charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
- From: Phil Miller <phil AT hpccharm.com>
- To: "Kale, Laxmikant V" <kale AT illinois.edu>
- Cc: Steve Petruzza <spetruzza AT sci.utah.edu>, charm <charm AT lists.cs.illinois.edu>
- Subject: Re: [charm] Timing execution at scale
- Date: Mon, 8 Aug 2016 11:07:21 -0500
By three global chare arrays and the questions around them, Steve refers to a few distinct but related things:
3. The readonly proxy object variables, that get initialized with the ckNew calls, and subsequently broadcast to all PEs
No extra overhead for 3 chare arrays.
(I don’t understand wht you mean as by “read only”.. a chare array is not read only. But do you mean their content doesn’t’ change? That’s probably irrelevant).
Yes, but with this scheme you can collect data from those chare arrays seaparately.
What do you do with those ints coming back? If its any commutative/associative op, you should use a reduction. Even if you are not, you can do a reduction with a reduction type called : CkReduction::concat
Would the input array need to send done messages too? Do they still have an int payload? Anyway, you can use an appropriate type of reduction for it.
Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
Professor, Computer Science kale AT illinois.edu
201 N. Goodwin Avenue Ph: (217) 244-0094
Urbana, IL 61801-2302
From: Steve Petruzza <spetruzza AT sci.utah.edu>
Date: Monday, August 8, 2016 at 9:31 AM
To: Laxmikant Kale <kale AT illinois.edu>
Cc: Phil Miller <phil AT hpccharm.com>
Subject: Re: [charm] Timing execution at scale
The done message has only an “int” payload.
I am considering to split the main char array into 3 (global) arrays, input tasks, computing tasks, and output tasks.
This way I can easily manage the timing of initialization, computing and finalisation phase.
Would it add any overhead to have 3 global char arrays (/*read only*/) instead of a single one? All of the three will be created in the main chare.
Of course tasks from the computation phase, for example, will call eventually tasks form the finalisation phase. Initialization and finalisation phase tasks are generally about 10% of the total amount of tasks.
Does it make sense?
Thank you,
Steve
On 08 Aug 2016, at 17:05, Kale, Laxmikant V <kale AT illinois.edu> wrote:
That looks like a GNI layer memory (running out of pinned memory?) error. Could be due to barrage of messages to 0 (how big are the done messages?) or could be due to another leak in the charm system (but less likely, because we have applications that run for days). It maybe worthwhile running on another layer (charm on mpi on cray, or charm on another non-cray machine of your choice).
Quiescence detection may be worth using, if your “done” message have no payload (or even of they do have small payload, it could be collected via some other group reduction after quiescence.
In any case, we shouldn’t have that error, or need to find out what resource is running out and why.
I think its best to have Phil continue to advice you (and may another person if Phil asks someone else).
Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
Professor, Computer Science kale AT illinois.edu
201 N. Goodwin Avenue Ph: (217) 244-0094
Urbana, IL 61801-2302
From: Steve Petruzza <spetruzza AT sci.utah.edu>
Date: Monday, August 8, 2016 at 8:45 AM
To: Laxmikant Kale <kale AT illinois.edu>
Cc: Phil Miller <phil AT hpccharm.com>
Subject: Re: [charm] Timing execution at scale
Thank you Kale,
I described briefly the application on the other thread ("Scalability issues using large chare array”).
No the N tasks are not independent, they call a few others methods’ (all in the same global proxy). The only large-range call I make is for the timing.
I can tell you that it reaches 64K cores using almost 600K tasks (in the same chare array) strong-scaling well (taking the timing with this 10%global "done" call).
When I go further up to 128K with over 1M tasks I get:
[0] GNI_GNI_MemRegister mem buffer; err=GNI_RC_ERROR_RESOURCE
or (non SMP)
[0] GNI_GNI_MemRegister mem buffer; err=GNI_RC_ERROR_RESOURCE
------- Partition 0 Processor 0 Exiting: Called CmiAbort ------
Reason: GNI_RC_CHECK
I wonder if the two things could be related.
On the side I will investigate if there is some task indexing/prefixing issue in my code that could be related to over 1 million tasks…
But surely there is still some resources distribution/requesting issue.
For the timing “issue” do you think that dumping the time value on a file (1 file per proc) from this 10% procs and then post process would be better?
Thank you,
Steve
On 08 Aug 2016, at 16:10, Kale, Laxmikant V <kale AT illinois.edu> wrote:
If you are running this on 100+ cores, yes, the 10% tasks sending done messages will be a bottleneck at the processor running the main chare. Basically, even assuming you are not doing any significant computation for each done message, it will take a microsecond or so to process it.
Creating a section for one time use is not a good solution (at least for now. We have a distributed section creation feature in the pipleline).
If all the 100% chares can participate in the done function (some with no data) and the data being collected from them is reducible (sum of numbers, or even collecting a set of solutions), it may be worthwhile using a reduction over the entire array.
As a tangential thought:
Do you have N tasks that are independent of each other? (no lateral communication from task I to task j?) . For such master-slave or search applications, you should use singleton chares and seed balancers.
I am also wondering if we should take this conversation off charm mailing list for now, and keep it between a few of us. We can summarize to the mailing list later. (I dropped the mailing list).
It will be helpful to see some sort of skeleton of your application.
Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu
Professor, Computer Science kale AT illinois.edu
201 N. Goodwin Avenue Ph: (217) 244-0094
Urbana, IL 61801-2302
From: Steve Petruzza <spetruzza AT sci.utah.edu>
Reply-To: Steve Petruzza <spetruzza AT sci.utah.edu>
Date: Monday, August 8, 2016 at 4:18 AM
To: charm <charm AT lists.cs.illinois.edu>
Subject: [charm] Timing execution at scale
Hi,
In my application I have a global chare array with N tasks, where N can vary from 1K to 1M.
At the moment I am timing the execution from the main chare, where a subset of tasks (10%) at the end will call the “done" function of the main chare.
Do you think that this 10% calls could create a considerable overhead to the timing?
The alternative, I think, could be using a reduction operation on a ProxySection, but creating such a proxy section (it has to be global created by the main chare) with 10% of the tasks plus the reduction operation would not create anyway a bigger overhead (at least in memory)?
If necessary, is there any other alternative to get the timing of a specific part of the execution precisely? Projections could help (at scale)?
Steve
- [charm] Timing execution at scale, Steve Petruzza, 08/08/2016
- Message not available
- Message not available
- Message not available
- Message not available
- Message not available
- Re: [charm] Timing execution at scale, Phil Miller, 08/08/2016
- Re: [charm] Timing execution at scale, Steve Petruzza, 08/10/2016
- Re: [charm] Timing execution at scale, Phil Miller, 08/15/2016
- Re: [charm] Timing execution at scale, Steve Petruzza, 08/16/2016
- Re: [charm] Timing execution at scale, Phil Miller, 08/16/2016
- Re: [charm] Timing execution at scale, Steve Petruzza, 08/18/2016
- Re: [charm] Timing execution at scale, Phil Miller, 08/08/2016
- Message not available
- Message not available
- Message not available
- Message not available
- Message not available
Archive powered by MHonArc 2.6.19.