Okay, very informative, I will experiment a bit and see which way
to go from here.
Regards,
Kiril
On 07/08/18 15:33, Sam White wrote:
That's sort of an open question right now, since
message logging fault tolerance in Charm++ was (from what I can
tell) developed primarily around 2008-2014, but did not make it
into production support. It was never enabled by default in the
build and was broken at least 2 years ago when we switched to
64-bit IDs for most globally visible entities inside Charm++
(and did not update the mlogft code to match the new IDs). To
build Charm++ with message logging fault tolerance, you need to
build with the 'mlogft' option, as in './build charm++
netlrts-linux-x86_64 mlogft'. My advice would be to first check
out older releases of Charm++ like v6.7.1, v6.6.1, or v6.5.1 and
test out mlogft builds of those.
Redmine issue: https://charm.cs.illinois.edu/redmine/issues/1244
-Sam
On Mon, Aug 6, 2018 at 10:10 AM, Kiril
Dichev <K.Dichev AT qub.ac.uk>
wrote:
Okay, thanks Sam. It seems to me my best shot is to enable
message logging for Charm and try to understand the
partial rollback implemented there. You mentioned message
logging doesn’t quite work in an earlier email. How do I
need to build and what do I need to change to make it
work?
Regards,
Kiril
On 2 Aug 2018, at 13:52, Sam White <white67 AT illinois.edu>
wrote:
Yeah, that would require
several changes to the current
checkpoint/restart support. Currently the
runtime always packs and unpacks all of the
MPI state and the user-level thread stack.
The heap data can either be automatically
packed and unpacked via Isomalloc or done
manually via PUP routines. With PUP routines
you can selectively pack/unpack only certain
data. We also assume a full rollback, but I
think there's been work on partial rollbacks
(at least in terms of message logging fault
tolerance).
If you'd like to look into this more, here
are some starting points: TCharm::pup() and
TCharm::pupThread() in
charm/src/libs/ck-libs/tcharm/tcharm.C
are where the thread stack and PUP routines
are called. The thread stack is pup'ed in
charm/src/conv-core/threads.c.
Others can chime in if they have more info
on the checkpoint/restart strategies and
partial rollbacks. Hope this helps,
Sam
On Thu, Aug 2, 2018
at 7:22 AM, Kiril Dichev
<K.Dichev AT qub.ac.uk>
wrote:
Thanks
Sam. I understand better now the
recovery. The fact that the recovery
is almost transparent to the
application code (almost, because I
still need to provide the PUP
pack/unpack routines for
checkpointing) is great. The ULFM
Jacobi code more than doubles compared
to what you guys have managed for
AMPI, so I see the advantages there.
However, there are some issues when it
comes to playing around with different
rollback strategies, which is what I
am getting at. The important question
for my research is if I can test some
partial rollback strategies, where
some processes don’t roll back, and
others do.
Can I draw a line between
runtime/communicator/MPI data, and
application data, after failure? Can
I choose to reset all MPI-related
data, such as communicators (whether
it resides in heap or stack), but
disable:
a) the reset of the application
data residing on the stack, which
includes pretty much all static
variables that hold application data
(such as iteration)
b) the application checkpoint
(the actual array I provide
pack/unpack routines for)
at an MPI process (say, P1) of my
choice?
I suspect this is probably
impossible to do with the existing
design, but maybe you can give me
your take on this.
Regards,
Kiril
On 1 Aug 2018, at 17:49,
Sam White <white67 AT illinois.edu>
wrote:
1. AMPI, unlike ULFM, makes fault recovery
transparent to the
user. It does so based
on its ability to copy
and migrate all of the
state that an AMPI
rank owns and has
associated with it. We
rely on the Isomalloc
memory allocator to do
this automatically, or
else the user can
write explicit
Pack/UnPack (PUP)
routines to serialize
and deserialize their
state at runtime.
Either way the runtime
takes care of
serializing all of its
own internal state and
the user-level thread
stack.
Here's how it all
works step-by-step:
-
The application
periodically checkpoints
its state via explicit
calls to AMPI_Migrate()
for either in-memory or
disk checkpoints.
- The runtime system
continuously monitors
for failures by having
processes periodically
ping each other (buddies
are laid out in a ring
fashion).
- When a failure is
detected, all AMPI ranks
are rolled back to their
latest checkpoint, and
the AMPI ranks
(user-level threads)
that were on the failed
process are restarted in
a different (already
existing) OS process. If
there's a newly added
imbalance of ranks per
process, then the user
call AMPI_Migrate() to
perform dynamic load
balancing to smooth that
out.
- Since all of the
runtime's state as well
as the application's
heap and thread memory
is restored to its last
checkpoint, and because
AMPI supports
virtualization, there
are no user-level
changes needed in the
application to continue
running, such as
MPI_Comm_shrink. All
communicators will
continue to work, since
all AMPI ranks, even
those on the failed
process, will continue
execution from the last
checkpoint. The only
change is the physical
location of some ranks,
which AMPI already
virtualizes anyway.
2. Execution proceeds from the last checkpoint
(call to
AMPI_Migrate), since
we roll back all heap
and stack memory to
what it was during
that call.
Also note that the code in
charm/src/arch/mpi/machine.C
is actually the MPI
communication layer upon
which Charm++ can be
built (just like OFI,
Infiniband Verbs, Cray
uGNI, IBM PAMI, etc),
while Adaptive MPI lives
in
charm/src/libs/ck-libs/ampi/.
Most of the code that
makes this really work
is inside Charm++ and
its core location
management though, and
AMPI basically looks
like any other Charm++
application, with the
addition of Isomalloc to
automate serialization.
Let us know if you have
further questions about
any of this,
Sam
On
Wed, Aug 1, 2018 at 9:24
AM, Kiril Dichev
<K.Dichev AT qub.ac.uk>
wrote:
Hi
again,
I am afraid I
will need some
more clarification
on the way MPI
recovery works
after crashes in
Adaptive MPI. In
the sample
fault-tolerant
Jacobi versions
for ULFM (e.g
http://fault-tolerance.org/2017/11/11/sc17-tutorial/), a lot of
the MPI recovery
logic is in the
actual application
code. It is not an
easy thing to go
through, but there
are well defined
phases such as
1. revoke
communicator upon
failure detection
2. shrink
communicator via
MPI_Comm_shrink
3. expand
again via
MPI_Comm_spawn
Now, I have
been very much
focused on how
checkpoint/restart
happens, which
is mostly
contained in
src/ck-core/ckmemcheckpoint.C.
The only
indication there
of the MPI
recovery are the
calls
find_spare_mpirank
and
mpi_restart_crashed.
‘mpi_restart_crashed’, implemented in src/arch/mpi/machine.C, however
the
implementation
of these
routines there
doesn’t give
away too much.
For the
moment, I have
following
questions:
1. So how
exactly does
Adaptive MPI
perform the
above steps,
which seem
necessary no
matter how they
are implemented?
I understand
Adaptive MPI
does not
implement
MPI_Comm_shrink,
but it must
implement
something along
these lines. How
and where does
this happens?
Also, since
Adaptive MPI
seems to be more
thread-oriented,
does it create
of a new Unix
process, or does
it create a new
thread within an
existing
process?
2 How exactly
does execution
continue
post-failure in
the application
code, say from
the start of a
new iteration.
This is a bit
more explicit
for ULFM, where
survivors use
the C calls
setjmp / longjmp
to reset to the
start of a
compute
iteration. But
how does that
work with the
Adaptive MPI
runtime?
Thanks.
Regards,
Kiril
On 20 Jul
2018, at
22:10, Sam
White <white67 AT illinois.edu>
wrote:
Hi
Kiril,
The
checkpoint/restart-based
fault
tolerance
schemes
described in
that paper are
available in
production for
Charm++ and
AMPI programs.
That includes
checkpointing
to disk or
in-memory,
with online
recovery. To
build
Charm++/AMPI
with double
in-memory
checkpoint/restart
support, you
should build
with the
'syncft'
option, as in
'./build AMPI
netlrts-linux-x86_64 syncft -j16 --with-production'. I just pushed some
cleanup of
tests/ampi/jacobi3d/,
so if you do
'git pull
origin charm'
now, then run
'make
syncfttest' in
that directory
you should see
the test run
with the
'+killFile
<file>'
option.
Also, syncft
is currently
only supported
on the netlrts
and verbs
communication
layers, and message
logging fault
tolerance is
not maintained
as a
production
feature
anymore,
though it
shouldn't be
hard to revive
it. If you can
share, we'd be
interested to
hear what
you're working
on.
-Sam
On
Fri, Jul 20,
2018 at 10:15
AM, Kiril
Dichev
<K.Dichev AT qub.ac.uk>
wrote:
Hello,
I am a
new user of
Charm++ and
AMPI.
I’ve done
some research
on fault
tolerance in
MPI in the
last year, and
I see some
nice ways to
couple it with
AMPI (happy to
explain if
anyone is
interested). I
used a Jacobi
solver before,
so it would be
nice to use
the same for
AMPI to get
going. I am
especially
interested to
test the
parallel
recovery
capabilities
that were
presented in
work such as
this one, for
Jacobi among
other codes: https://repositoriotec.tec.ac.cr/bitstream/handle/2238/7150/Using%20Migratable%20Objects%20to%20Enhance%20Fault%20Tolerance%20Schemes%20in%20Supercomputers.pdf?sequence=1&isAllowed=y
However,
I am not sure
where to
begin. I
pulled the
official
Charm++ repo,
which contains
some MPI
Jacobi code
in tests/ampi/jacobi3d. In particular, it has some
kill files as
well, which a
very old
tutorial tells
me can be used
to specify
failure
scenarios for
PEs. However,
it seems the
+pkill_file
option doesn’t
even exist
anymore, so
that’s
outdated, and
I don’t know
if the code is
up-to-date
either.
On
the other
hand, there is
a repo here,
according to
the
documentation
in the main
repo:
ssh://charm.cs.illinois.edu:9418/benchmarks/ampi-benchmarks
… which I can’t
access, and
apparently it
also has
Jacobi codes I
can run with
AMPI. Maybe
that is the
one I need? If
it is, can I
use it if I’m
not affiliated
with any US
institutions?
Any help which is the
up-to-date
Jacobi +
AMPI would be
much
appreciated.
In addition,
any help how
to experiment
with parallel
recovery via
migration
would be
great.
Regards,
Kiril Dichev
--
Kiril Dichev
Research Fellow
High Performance and Distributed Computing
http://www.qub.ac.uk/research-centres/HPDC/
School of EEECS
Queen's University Belfast
|