charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
- From: Sam White <white67 AT illinois.edu>
- To: Kiril Dichev <K.Dichev AT qub.ac.uk>
- Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
- Subject: Re: [charm] Fault tolerant Jacobi
- Date: Fri, 20 Jul 2018 16:10:56 -0500
- Authentication-results: illinois.edu; spf=pass smtp.mailfrom=samt.white AT gmail.com; dkim=pass header.d=gmail.com header.s=20161025; dkim=pass header.d=illinois-edu.20150623.gappssmtp.com header.s=20150623; dmarc=none header.from=illinois.edu
Hi Kiril,
The checkpoint/restart-based fault tolerance schemes described in that paper are available in production for Charm++ and AMPI programs. That includes checkpointing to disk or in-memory, with online recovery. To build Charm++/AMPI with double in-memory checkpoint/restart support, you should build with the 'syncft' option, as in './build AMPI netlrts-linux-x86_64 syncft -j16 --with-production'. I just pushed some cleanup of tests/ampi/jacobi3d/, so if you do 'git pull origin charm' now, then run 'make syncfttest' in that directory you should see the test run with the '+killFile <file>' option.
Also, syncft is currently only supported on the netlrts and verbs communication layers, and message logging fault tolerance is not maintained as a production feature anymore, though it shouldn't be hard to revive it. If you can share, we'd be interested to hear what you're working on.
-Sam
The checkpoint/restart-based fault tolerance schemes described in that paper are available in production for Charm++ and AMPI programs. That includes checkpointing to disk or in-memory, with online recovery. To build Charm++/AMPI with double in-memory checkpoint/restart support, you should build with the 'syncft' option, as in './build AMPI netlrts-linux-x86_64 syncft -j16 --with-production'. I just pushed some cleanup of tests/ampi/jacobi3d/, so if you do 'git pull origin charm' now, then run 'make syncfttest' in that directory you should see the test run with the '+killFile <file>' option.
Also, syncft is currently only supported on the netlrts and verbs communication layers, and message logging fault tolerance is not maintained as a production feature anymore, though it shouldn't be hard to revive it. If you can share, we'd be interested to hear what you're working on.
-Sam
On Fri, Jul 20, 2018 at 10:15 AM, Kiril Dichev <K.Dichev AT qub.ac.uk> wrote:
Hello,
I am a new user of Charm++ and AMPI.
I’ve done some research on fault tolerance in MPI in the last year, and I see some nice ways to couple it with AMPI (happy to explain if anyone is interested). I used a Jacobi solver before, so it would be nice to use the same for AMPI to get going. I am especially interested to test the parallel recovery capabilities that were presented in work such as this one, for Jacobi among other codes: https://repositoriotec.tec.ac.cr/bitstream/handle/2238/7150/Using%20Migratable%20Objects%20to%20Enhance%20Fault%20Tolerance%20Schemes%20in%20Supercomputers.pdf?sequence=1&isAllowed=y
However, I am not sure where to begin. I pulled the official Charm++ repo, which contains some MPI Jacobi code in tests/ampi/jacobi3d. In particular, it has some kill files as well, which a very old tutorial tells me can be used to specify failure scenarios for PEs. However, it seems the +pkill_file option doesn’t even exist anymore, so that’s outdated, and I don’t know if the code is up-to-date either.
On the other hand, there is a repo here, according to the documentation in the main repo:
… which I can’t access, and apparently it also has Jacobi codes I can run with AMPI. Maybe that is the one I need? If it is, can I use it if I’m not affiliated with any US institutions?
Any help which is the up-to-date Jacobi + AMPI would be much appreciated. In addition, any help how to experiment with parallel recovery via migration would be great.
Regards,Kiril Dichev
- [charm] Fault tolerant Jacobi, Kiril Dichev, 07/20/2018
- <Possible follow-up(s)>
- Re: [charm] Fault tolerant Jacobi, Sam White, 07/20/2018
- Re: [charm] Fault tolerant Jacobi, Kiril Dichev, 07/23/2018
Archive powered by MHonArc 2.6.19.