charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
[charm] huge overhead for CkStartQD(..) question for large # Chares on many CPUs
Chronological Thread
- From: Evghenii Gaburov <e-gaburov AT northwestern.edu>
- To: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
- Cc: "Kale, Laxmikant V" <kale AT illinois.edu>
- Subject: [charm] huge overhead for CkStartQD(..) question for large # Chares on many CPUs
- Date: Sat, 24 Sep 2011 22:25:12 +0000
- Accept-language: en-US
- List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
- List-id: CHARM parallel programming system <charm.cs.uiuc.edu>
Hi,
I have the a Charm++ code which uses CkStartQD(..) (example shown below).
The overhead of this code becomes huge when #chares reaches 1024.
The CPU topology is such that 16 nodes are in use each 8-way SMP.
$ ./charmrun +p128 ./test 128 done in 0.0011 sec (for 128 chares)
$ ./charmrun +p128 ./test 256 done in 0.0019 sec (for 256 chares)
$ ./charmrun +p128 ./test 512 done in 0.0032 sec ( etc...)
$ ./charmrun +p128 ./test 1024 done in 0.130 sec
$ ./charmrun +p128 ./test 2048 done in 0.126 sec
$ ./charmrun +p128 ./test 4096 done in 0.138 sec
I did some tests by reducing #of CPUs and below are the timings output, which
pretty much have similar beahaviour.
The overhead of 0.1 sec is way too high for my purpose, ideally I want it to
be < 10ms. Also I am curious,
is it by design that overhead scales linearly with the number of chares?
Also, is there a specific reason why the overhead go so high with such large
# of chares?
If in the below code I replace
CkStartQD(CkIndex_System::ImportII(), &thishandle);
with
ImportII();
the overhead is negligible:
$ ./charmrun +p128 ./test 4096 done in 0.00031 sec
I use this CkStartQD( .. ) as a barrier to make sure no messages are
executed
until all previously launched msg from all chares are processed. However, I
cannot accept such a huge
overhead for large number of chares.
Any help with this?
I use charm-6.2 with mpi-linux-x86_64\
compiled with -O4 -DCMK_OPTIMIZE=1
Cheers,
Evghenii
$ ./charmrun +p16 ./test 128 done in 0.0006928 sec
$ ./charmrun +p16 ./test 256 done in 0.001204 sec
$ ./charmrun +p16 ./test 512 done in 0.00248384 sec
$ ./charmrun +p16 ./test 1024 done in 0.136 sec
$ ./charmrun +p16 ./test 2048 done in 0.135 sec
$ ./charmrun +p16 ./test 4096 done in 0.542 sec
this setup uses 2 compute nodes each having 8-core CPU (8-way SMP)
-----------
$ ./charmrun +p12 ./test 128 done in 0.000758 sec
$ ./charmrun +p12 ./test 256 done in 0.00132 sec
$ ./charmrun +p12 ./test 512 done in 0.00252 sec
$ ./charmrun +p12 ./test 1024 done in 0.00464 sec
$ ./charmrun +p12 ./test 2048 done in 0.00876 sec
$ ./charmrun +p12 ./test 4096 done in 0.141 sec
this setup uses 2 compute nodes with the following configuration (from Charm
output)
[egy820@qnode0290
charm]$ ./charmrun +p12 ./fvmhd3d 128
Running on 12 processors: ./fvmhd3d 128
mpirun -np 12 -machinefile
/hpc/opt/torque/nodes/qnode0290/aux//1178704.qsched01 ./fvmhd3d 128
Charm++> Running on MPI version: 2.1 multi-thread support: 0 (max supported:
-1)
Warning> Randomization of stack pointer is turned on in kernel, thread
migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as
root to disable it, or try run with '+isomalloc_sync'.
Charm++> Running on 2 unique compute nodes (8-way SMP).
Charm++> Cpu topology info:
PE to node map: 0 0 0 0 0 0 0 0 1 1 1 1
Node to PE map:
Chip #0: 0 1 2 3 4 5 6 7
Chip #1: 8 9 10 11
--------------
$ ./charmrun +p8 ./test 128 done in 0.0006 sec
$ ./charmrun +p8 ./test 256 done in 0.0011 sec
$ ./charmrun +p8 ./test 512 done in 0.0020 sec
$ ./charmrun +p8 ./test 1024 done in 0.0038 sec
$ ./charmrun +p8 ./test 2048 done in 0.0075 sec
$ ./charmrun +p8 ./test 4096 done in 0.0158 sec
one 8-way SMP node is used here.
-------------
Here is the code:
void MainChare::doTimeStep()
{
const double t0 = CkWalltimer();
systemProxy.Iterate(CkCallbackResumeThread());
CkPrintf("done in %g sec \n", CkWalltimer() - t0);
}
void System::Iterate(CkCallback &cb)
{
Iterate_completeCb = cb;
systemProxy[thisIndex].Import(CkCallback(CkIndex_System::IterateII(),
thishandle));
}
void System::IterateII()
{
contribute(0, 0, CkReduction::concat, Iterate_completeCb);
}
void System::Import(CkCallbackCb &cb)
{
Import_completeCb = cb;
#if 1
// Import_ngb_recursively(recursion_depth=2); /* this one recursively sends
msg */
CkStartQD(CkIndex_System::ImportII(), &thishandle);
/* CkStartQD(..) makes sure that all launched msg are completed */
/* and if non are launched, makes sure that arrived msg from remote chares
are processed before going further */
#else
ImportII();
#endif
}
void System::ImportII()
{
Import_completeCb.send();
}
--
Evghenii Gaburov,
e-gaburov AT northwestern.edu
- [charm] huge overhead for CkStartQD(..) question for large # Chares on many CPUs, Evghenii Gaburov, 09/24/2011
Archive powered by MHonArc 2.6.16.