charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
- From: Phil Miller <mille121 AT illinois.edu>
- To: "Brunner, Robert Kraemer" <rbrunner AT illinois.edu>
- Cc: "charm AT cs.illinois.edu" <charm AT cs.illinois.edu>
- Subject: Re: [charm] Charm/Converse fault tolerance on BW?
- Date: Thu, 17 Mar 2016 10:58:59 -0500
On Thu, Mar 17, 2016 at 10:18 AM, Brunner, Robert Kraemer <rbrunner AT illinois.edu> wrote:
What is the state of fault tolerance support on Cray XE systems (in particular, Blue Waters) with respect to allowing user code to catch node failures. Can the runtime notify the user program that a node has failed, and allow the user program to handle the failure, and perhaps to keep running, taking the loss of the node and any associated objects into account?
At a higher level, Charm++'s approach to fault tolerance is to make the failure transparent to the user code. The application will never see missing nodes and objects, because the runtime system takes responsibility for recovering them consistently.
- [charm] Charm/Converse fault tolerance on BW?, Brunner, Robert Kraemer, 03/17/2016
- Re: [charm] Charm/Converse fault tolerance on BW?, Phil Miller, 03/17/2016
- <Possible follow-up(s)>
- Re: [charm] Charm/Converse fault tolerance on BW?, Xiang Ni, 03/17/2016
Archive powered by MHonArc 2.6.16.