charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
- From: Phil Miller <mille121 AT illinois.edu>
- To: Jozsef Bakosi <jbakosi AT lanl.gov>
- Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>, Evan Ramos <evan AT hpccharm.com>
- Subject: Re: [charm] mis-matched client callbacks in reduction messages
- Date: Tue, 7 Nov 2017 16:31:55 -0600
- Authentication-results: illinois.edu; spf=pass smtp.mailfrom=unmobile AT gmail.com
On Tue, Nov 7, 2017 at 4:30 PM, Jozsef Bakosi <jbakosi AT lanl.gov> wrote:
Hi Phil,
I'm having a hard time with that checkout. Here is what I do:
git clone https://charm.cs.illinois.edu/gerrit/charm && cd charm
git fetch https://charm.cs.illinois.edu/gerrit/charm refs/changes/27/3227/1 && git checkout FETCH_HEAD
./build charm++ mpi-linux-x86_64 --with-prio-type=int --enable-randomized-msgq --suffix randq-debug --build-shared -j36 -g
This is fine, then when I build my code, I get the link error:
/usr/bin/ld: cannot find -lhwloc_embedded
Does that ring some bells for you? Is that being pulled in by my mpi? I'm
probably screwing something up...
That's related to some recent changes we've made, though that particular failure is kinda surprising to me.
What machine is this, and what output do you get from the following commands?
which mpicxx
mpicxx -show
Thanks,
Jozsef
On 11.03.2017 15:09, Phil Miller wrote:
> You can try out a rough patch to print basic details of the mis-matched
> reductions here:
>
> [1]https://charm.cs.illinois.edu/gerrit/3227
>
> Right now, it will just say what the reducers and callback types are
> numerically - deeper information would require a bunch more code, and
> those bits should be enough to identify among a couple
> suspect contribute() calls.
>
> On Fri, Nov 3, 2017 at 2:58 PM, Phil Miller <[2]mille121 AT illinois.edu>
> wrote:
>
> Hi Jozsef,
> It's not whining at all. This is a bothersome problem to address.
> Randomized queues will change the order in which available messages get
> delivered to individual chares. If there is some perverse order it
> creates that leads to an inconsistent reduction sequence, that order is
> entirely possible to occur by chance in non-randomized execution as
> well. Note that it only operates on messages queued for delivery to
> objects - the objects themselves can (and must) structure and sequence
> their processing to ensure consistent operation. Multiple reductions
> are thus *not* inconsistent with randomized queueing. If the abort is
> triggered, there's a message delivery order that causes different
> elements in an array to make a different sequence of contribute calls.
> In the particular case you present, I would recommend sparing yourself
> the bulk of the frustrating reasoning about message ordering, and
> moving the contribution of the diagnostics onto a bound array.
> When running under randomized queues and still getting the error, there
> may be more going on than is apparent in the code you presented. I'm
> putting together a patch that will provide deeper diagnostic
> information for you.
> Phil
>
> On Thu, Nov 2, 2017 at 12:24 PM, Jozsef Bakosi wrote:
>
> Hi Phil,
> Sorry for the whining, but this error is giving me way too much
> trouble and I
> don't think my understanding is getting better.
> So I am successfully using shadow arrays and they do appear to work
> around this
> problem. (I have tried this with groups successfully only though so
> far.)
> Since I have been mainly getting this problem with multiple
> reductions using the
> randomized-queue build of Charm++, I wonder if my requirement that a
> logic
> involving SDAG and multiple reductions to execute correctly (i.e.
> without this
> error) makes sense even with randomized queues. I am thinking that
> randomized
> queues will most likely fire off multiple reductions in different
> (i.e., random)
> order, effectively taking the ordering out of my hand. Do you think
> that's true?
> Aren't multiple reductions inherently incompatible with randomized
> queues?
> To make it more concrete, I have the following simplified scenario
> in
> pseudo-code:
> class ChareArray : public CProxy_ChareArray {
> /*entry*/ void dt() {
> // compute some dt specific to this array element
> double dt = ...
> // allreduce:
> contribute( to all elements of ChareArray targeting advance(mindt)
> delivering
> the minimum of some dt to all elements )
> }
> /*entry*/ void advance(double mindt) {
> contribute( to some single chare collecting some diagnostics )
> if (continue time stepping)
> dt();
> else
> contribute( to some single chare eventually calling ckExit() )
> }
> }
> So during time stepping there are really two contribute calls and
> I'm pretty
> sure these two generate the "mis-matched client callbacks in
> reduction messages"
> error. (I don't think the logic gets to the contribute that will
> eventually get
> to ckExit().)
> When I start one of them from a bound/shadow array, I still get the
> error but
> only with randomized queues. The order of contributions to the two
> reductions
> (per single chare), I believe, is guaranteed here. But won't
> randomized queues
> screw up the order? Can that even be done? Do I want too much?
> Jozsef
>
> On 10.29.2017 17:22, Jozsef Bakosi wrote:
> > On 10.27.2017 11:38, Jozsef Bakosi wrote:
> > > On 10.27.2017 11:02, Phil Miller wrote:
> > > > We use an approach of creating bound 'shadow' arrays to act as
> > > > independent reduction (sequencing) contexts to address this
> limitation.
> > > > We've used this approach in a few places in our code,
> including the
> > > > LiveViz in-situ visualization library and the collision
> detection
> > > > library.
> > > > In a little more detail, when constructing a chare array, it's
> possible
> > > > to specify that it should be bound to another existing chare
> array.
> > > > That means that elements of the same index will always live on
> the same
> > > > PE. So, you can instantiate some auxiliary arrays, one per
> reduction
> > > > stream, and bind them to your main computation arrays. Since
> elements
> > > > with corresponding indices are guaranteed to be co-located,
> the main
> > > > element can get a pointer to each auxiliary via a ckLocal()
> call, and
> > > > then call aux->contribute(...) rather than implicitly
> > > > this->contribute(). So, the setup code get a bit more
> complicated, and
> > > > the code actually invoking the reductions get just a little
> more
> > > > involved.
> > > > Is that a clear description? Does that approach work for you?
> > >
> > > I think that would work and I do use bound arrays for a different
> purpose.
> > >
> > > So how would I have to use this? Here is what I think I need to do:
> I have to
> > > identify all reductions that can happen in an order that is not
> necessarily
> > > guaranteed to be always the same and fire them from bound arrays
> instead (each
> > > from a different chare array)?
> >
> > Is there a way to tell which two reductions caused the "mis-matched
> client
> > callbacks in reduction messages" error? I do get a traceback from
> one, but can I
> > get one from the other one somehow so I know which reduction I have
> to initiate
> > from a shadow array?
> >
> > Thanks,
> > J
- Re: [charm] mis-matched client callbacks in reduction messages, Jozsef Bakosi, 11/02/2017
- Re: [charm] mis-matched client callbacks in reduction messages, Phil Miller, 11/03/2017
- Re: [charm] mis-matched client callbacks in reduction messages, Phil Miller, 11/03/2017
- Re: [charm] mis-matched client callbacks in reduction messages, Jozsef Bakosi, 11/06/2017
- Re: [charm] mis-matched client callbacks in reduction messages, Jozsef Bakosi, 11/07/2017
- Re: [charm] mis-matched client callbacks in reduction messages, Phil Miller, 11/07/2017
- Re: [charm] mis-matched client callbacks in reduction messages, Jozsef Bakosi, 11/07/2017
- Re: [charm] mis-matched client callbacks in reduction messages, Phil Miller, 11/20/2017
- Re: [charm] mis-matched client callbacks in reduction messages, Jozsef Bakosi, 11/07/2017
- Re: [charm] mis-matched client callbacks in reduction messages, Phil Miller, 11/07/2017
- Re: [charm] mis-matched client callbacks in reduction messages, Jozsef Bakosi, 11/06/2017
- Re: [charm] mis-matched client callbacks in reduction messages, Phil Miller, 11/03/2017
- Re: [charm] mis-matched client callbacks in reduction messages, Kale, Laxmikant V, 11/03/2017
- Re: [charm] mis-matched client callbacks in reduction messages, Jozsef Bakosi, 11/06/2017
- Re: [charm] mis-matched client callbacks in reduction messages, Phil Miller, 11/03/2017
Archive powered by MHonArc 2.6.19.