charm AT lists.siebelschool.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Timing execution at scale

From: Phil Miller <phil AT hpccharm.com>
To: "Kale, Laxmikant V" <kale AT illinois.edu>
Cc: Steve Petruzza <spetruzza AT sci.utah.edu>, charm <charm AT lists.cs.illinois.edu>
Subject: Re: [charm] Timing execution at scale
Date: Mon, 8 Aug 2016 11:07:21 -0500

[re-adding the mailing list, since there is nothing private in this discussion, and it should be in the public archive so that others can see it later]

By three global chare arrays and the questions around them, Steve refers to a few distinct but related things:

1. The collections of distributed objects instantiated by calls to CProxy_Foo1::ckNew(N1), CProxy_Foo2::ckNew(N2), etc.

2. The proxy objects returned by each of the above ckNew calls
3. The readonly proxy object variables, that get initialized with the ckNew calls, and subsequently broadcast to all PEs

The difference between a collection itself and its proxy is exactly analogous to that between a dynamically sized object or array of objects in memory, and a pointer to that object/array. The object may be arbitrarily large, and there may be arbitrarily many instances in the collection. The proxy to them is always just a few integers that represent a globally valid handle used to look up objects in the collection, just like a pointer can be dereferenced and subscripted to access the objects it points to.

Knowing that the proxies are small fixed-size objects, sticking them in gloablly-broadcast readonly variables becomes much less worrisome. The difference of having just 1, vs 3 or 10 or 20, is a difference of a few dozen bytes broadcast - a negligible cost.

As for the design shift from a single chare array to three arrays handling input, computation, and output tasks, that sounds like an excellent choice. Especially so if the 'done' calls are only coming from the output tasks, and so they could just do a reduction over their entire array. Timing each of the other phases also then just becomes a reduction from the elements of the corresponding arrays. In that arrangement, they'll present no interference at all with ongoing execution, either.

Regarding the GNI error messages, that may be the result of hitting PE0 with all of the done messages as point-to-point sends from the individual chares (as Sanjay also suggests). Trying the pure reduction-based designs should probably come before investigating that too intensely, since again, the 10% of objects all sending to one place is a huge scalability pitfall anyway.

Phil

On Mon, Aug 8, 2016 at 10:27 AM, Kale, Laxmikant V <kale AT illinois.edu> wrote:

No extra overhead for 3 chare arrays.

(I don’t understand wht you mean as by “read only”.. a chare array is not read only. But do you mean their content doesn’t’ change? That’s probably irrelevant).

Yes, but with this scheme you can collect data from those chare arrays seaparately.

What do you do with those ints coming back? If its any commutative/associative op, you should use a reduction. Even if you are not, you can do a reduction with a reduction type called : CkReduction::concat

Would the input array need to send done messages too? Do they still have an int payload? Anyway, you can use an appropriate type of reduction for it.

Laxmikant (Sanjay) Kale   http://charm.cs.uiuc.edu

Professor, Computer Science     kale AT illinois.edu

201 N. Goodwin Avenue           Ph:  (217) 244-0094

Urbana, IL  61801-2302

From: Steve Petruzza <spetruzza AT sci.utah.edu>
Date: Monday, August 8, 2016 at 9:31 AM
To: Laxmikant Kale <kale AT illinois.edu>
Cc: Phil Miller <phil AT hpccharm.com>
Subject: Re: [charm] Timing execution at scale

The done message has only an “int” payload.

I am considering to split the main char array into 3 (global) arrays, input tasks, computing tasks, and output tasks.

This way I can easily manage the timing of initialization, computing and finalisation phase.

Would it add any overhead to have 3 global char arrays (/*read only*/) instead of a single one? All of the three will be created in the main chare.

Of course tasks from the computation phase, for example, will call eventually tasks form the finalisation phase. Initialization and finalisation phase tasks are generally about 10% of the total amount of tasks.

Does it make sense?

Thank you,

Steve

On 08 Aug 2016, at 17:05, Kale, Laxmikant V <kale AT illinois.edu> wrote:

That looks like a GNI layer memory (running out of pinned memory?) error. Could be due to barrage of messages to 0 (how big are the done messages?) or could be due to another leak in the charm system (but less likely, because we have applications that run for days). It maybe worthwhile running on another layer (charm on mpi on cray, or charm on another non-cray machine of your choice).

Quiescence detection may be worth using, if your “done” message have no payload (or even of they do have small payload, it could be collected via some other group reduction after quiescence.

In any case, we shouldn’t have that error, or need to find out what resource is running out and why.

I think its best to have Phil continue to advice you (and may another person if Phil asks someone else).

Laxmikant (Sanjay) Kale   http://charm.cs.uiuc.edu

Professor, Computer Science     kale AT illinois.edu

201 N. Goodwin Avenue           Ph:  (217) 244-0094

Urbana, IL  61801-2302

From: Steve Petruzza <spetruzza AT sci.utah.edu>
Date: Monday, August 8, 2016 at 8:45 AM
To: Laxmikant Kale <kale AT illinois.edu>
Cc: Phil Miller <phil AT hpccharm.com>
Subject: Re: [charm] Timing execution at scale

Thank you Kale,

I described briefly the application on the other thread ("Scalability issues using large chare array”).

No the N tasks are not independent, they call a few others methods’ (all in the same global proxy). The only large-range call I make is for the timing.

I can tell you that it reaches 64K cores using almost 600K tasks (in the same chare array) strong-scaling well (taking the timing with this 10%global "done" call).

When I go further up to 128K with over 1M tasks I get:

[0] GNI_GNI_MemRegister mem buffer; err=GNI_RC_ERROR_RESOURCE

or (non SMP)

[0] GNI_GNI_MemRegister mem buffer; err=GNI_RC_ERROR_RESOURCE
------- Partition 0 Processor 0 Exiting: Called CmiAbort ------
Reason: GNI_RC_CHECK

I wonder if the two things could be related.

On the side I will investigate if there is some task indexing/prefixing issue in my code that could be related to over 1 million tasks…

But surely there is still some resources distribution/requesting issue.

For the timing “issue” do you think that dumping the time value on a file (1 file per proc) from this 10% procs and then post process would be better?

Thank you,

Steve

On 08 Aug 2016, at 16:10, Kale, Laxmikant V <kale AT illinois.edu> wrote:

If you are running this on 100+ cores, yes, the 10% tasks sending done messages will be a bottleneck at the processor running the main chare. Basically, even assuming you are not doing any significant computation for each done message, it will take a microsecond or so to process it.

Creating a section for one time use is not a good solution (at least for now. We have a distributed section creation feature in the pipleline).

If all the 100% chares can participate in the done function (some with no data) and the data being collected from them is reducible (sum of numbers, or even collecting a set of solutions), it may be worthwhile using a reduction over the entire array.

As a tangential thought:

Do you have N tasks that are independent of each other? (no lateral communication from task I to task j?) . For such master-slave or search applications, you should use singleton chares and seed balancers.

I am also wondering if we should take this conversation off charm mailing list for now, and keep it between a few of us. We can summarize to the mailing list later. (I dropped the mailing list).

It will be helpful to see some sort of skeleton of your application.

Laxmikant (Sanjay) Kale   http://charm.cs.uiuc.edu

Professor, Computer Science     kale AT illinois.edu

201 N. Goodwin Avenue           Ph:  (217) 244-0094

Urbana, IL  61801-2302

From: Steve Petruzza <spetruzza AT sci.utah.edu>
Reply-To: Steve Petruzza <spetruzza AT sci.utah.edu>
Date: Monday, August 8, 2016 at 4:18 AM
To: charm <charm AT lists.cs.illinois.edu>
Subject: [charm] Timing execution at scale

Hi,

In my application I have a global chare array with N tasks, where N can vary from 1K to 1M.

At the moment I am timing the execution from the main chare, where a subset of tasks (10%) at the end will call the “done" function of the main chare.

Do you think that this 10% calls could create a considerable overhead to the timing?

The alternative, I think, could be using a reduction operation on a ProxySection, but creating such a proxy section (it has to be global created by the main chare) with 10% of the tasks plus the reduction operation would not create anyway a bigger overhead (at least in memory)?

If necessary, is there any other alternative to get the timing of a specific part of the execution precisely? Projections could help (at scale)?

Steve

[charm] Timing execution at scale, Steve Petruzza, 08/08/2016
- Message not available
  - Message not available
    - Message not available
      - Message not available
        
        Message not available
        
        Re: [charm] Timing execution at scale, Phil Miller, 08/08/2016
        
        Re: [charm] Timing execution at scale, Steve Petruzza, 08/10/2016
        Re: [charm] Timing execution at scale, Phil Miller, 08/15/2016
        Re: [charm] Timing execution at scale, Steve Petruzza, 08/16/2016
        Re: [charm] Timing execution at scale, Phil Miller, 08/16/2016
        Re: [charm] Timing execution at scale, Steve Petruzza, 08/18/2016