charm AT lists.siebelschool.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] huge overhead for CkStartQD(..) question for large # Chares on many CPUs

From: Evghenii Gaburov <e-gaburov AT northwestern.edu>
To: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
Cc: "Kale, Laxmikant V" <kale AT illinois.edu>
Subject: [charm] huge overhead for CkStartQD(..) question for large # Chares on many CPUs
Date: Sat, 24 Sep 2011 22:25:12 +0000
Accept-language: en-US
List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hi,

I have the a Charm++ code which uses CkStartQD(..) (example shown below).
The overhead of this code becomes huge when #chares reaches 1024.

The CPU topology is such that 16 nodes are in use each 8-way SMP.

$ ./charmrun +p128 ./test 128 done in 0.0011 sec (for 128 chares)
$ ./charmrun +p128 ./test 256 done in 0.0019 sec (for 256 chares)
$ ./charmrun +p128 ./test 512 done in 0.0032 sec ( etc...)
$ ./charmrun +p128 ./test 1024 done in 0.130 sec
$ ./charmrun +p128 ./test 2048 done in 0.126 sec
$ ./charmrun +p128 ./test 4096 done in 0.138 sec

I did some tests by reducing #of CPUs and below are the timings output, which
pretty much have similar beahaviour.

The overhead of 0.1 sec is way too high for my purpose, ideally I want it to
be < 10ms. Also I am curious,
is it by design that overhead scales linearly with the number of chares?

Also, is there a specific reason why the overhead go so high with such large
# of chares?

If in the below code I replace

CkStartQD(CkIndex_System::ImportII(), &thishandle);

with

ImportII();

the overhead is negligible:

$ ./charmrun +p128 ./test 4096 done in 0.00031 sec

I use this CkStartQD( .. ) as a barrier to make sure no messages are
executed
until all previously launched msg from all chares are processed. However, I
cannot accept such a huge
overhead for large number of chares.

Any help with this?

I use charm-6.2 with mpi-linux-x86_64\

compiled with -O4 -DCMK_OPTIMIZE=1

Cheers,
Evghenii

$ ./charmrun +p16 ./test 128 done in 0.0006928 sec
$ ./charmrun +p16 ./test 256 done in 0.001204 sec
$ ./charmrun +p16 ./test 512 done in 0.00248384 sec
$ ./charmrun +p16 ./test 1024 done in 0.136 sec
$ ./charmrun +p16 ./test 2048 done in 0.135 sec
$ ./charmrun +p16 ./test 4096 done in 0.542 sec

this setup uses 2 compute nodes each having 8-core CPU (8-way SMP)
-----------
$ ./charmrun +p12 ./test 128 done in 0.000758 sec
$ ./charmrun +p12 ./test 256 done in 0.00132 sec
$ ./charmrun +p12 ./test 512 done in 0.00252 sec
$ ./charmrun +p12 ./test 1024 done in 0.00464 sec
$ ./charmrun +p12 ./test 2048 done in 0.00876 sec
$ ./charmrun +p12 ./test 4096 done in 0.141 sec

this setup uses 2 compute nodes with the following configuration (from Charm
output)
[egy820@qnode0290
charm]$ ./charmrun +p12 ./fvmhd3d 128

Running on 12 processors: ./fvmhd3d 128
mpirun -np 12 -machinefile
/hpc/opt/torque/nodes/qnode0290/aux//1178704.qsched01 ./fvmhd3d 128
Charm++> Running on MPI version: 2.1 multi-thread support: 0 (max supported:
-1)
Warning> Randomization of stack pointer is turned on in kernel, thread
migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as
root to disable it, or try run with '+isomalloc_sync'.
Charm++> Running on 2 unique compute nodes (8-way SMP).
Charm++> Cpu topology info:
PE to node map: 0 0 0 0 0 0 0 0 1 1 1 1
Node to PE map:
Chip #0: 0 1 2 3 4 5 6 7
Chip #1: 8 9 10 11

--------------

$ ./charmrun +p8 ./test 128 done in 0.0006 sec
$ ./charmrun +p8 ./test 256 done in 0.0011 sec
$ ./charmrun +p8 ./test 512 done in 0.0020 sec
$ ./charmrun +p8 ./test 1024 done in 0.0038 sec
$ ./charmrun +p8 ./test 2048 done in 0.0075 sec
$ ./charmrun +p8 ./test 4096 done in 0.0158 sec

one 8-way SMP node is used here.

-------------

Here is the code:

void MainChare::doTimeStep()
{
const double t0 = CkWalltimer();
systemProxy.Iterate(CkCallbackResumeThread());
CkPrintf("done in %g sec \n", CkWalltimer() - t0);
}

void System::Iterate(CkCallback &cb)
{
Iterate_completeCb = cb;
systemProxy[thisIndex].Import(CkCallback(CkIndex_System::IterateII(),
thishandle));
}

void System::IterateII()
{
contribute(0, 0, CkReduction::concat, Iterate_completeCb);
}

void System::Import(CkCallbackCb &cb)
{
Import_completeCb = cb;

#if 1
// Import_ngb_recursively(recursion_depth=2); /* this one recursively sends
msg */

CkStartQD(CkIndex_System::ImportII(), &thishandle);

/* CkStartQD(..) makes sure that all launched msg are completed */
/* and if non are launched, makes sure that arrived msg from remote chares
are processed before going further */
#else

ImportII();

#endif
}

void System::ImportII()
{
Import_completeCb.send();
}

--
Evghenii Gaburov,
e-gaburov AT northwestern.edu

[charm] huge overhead for CkStartQD(..) question for large # Chares on many CPUs, Evghenii Gaburov, 09/24/2011