charm AT lists.siebelschool.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Profiling and tuning charm++ applications

From: Eric Bohm <ebohm AT illinois.edu>
To: Alexander Frolov <alexndr.frolov AT gmail.com>
Cc: charm AT cs.uiuc.edu
Subject: Re: [charm] Profiling and tuning charm++ applications
Date: Thu, 23 Jul 2015 13:34:55 -0500
List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hi Alex,

On 07/23/2015 05:29 AM, Alexander Frolov wrote:

Hello Eric,

On Wed, Jul 22, 2015 at 7:50 PM, Eric Bohm <ebohm AT illinois.edu> wrote:

Hello Alex,

Charm++ applications can easily reach peak utilization. However, there are a number of factors which may be affecting your performance. The MPI target for Charm++ is one of the simplest to build, but it is unlikely to be the one that gives the best performance. For single node scalability you will probably experience better performance using a different target. Try multicore-linux64.

I am targeting on scaling it on infiniband cluster, single smp performance is not that interesting, that's why I am using it with mpicc. By the way, do charm++ runtime support combination of multicore and mpi?

Yes. If you add smp to the build line, you will have a version which allows multiple worker threads in a process along with a distinct communication thread. Typical usage would be to indicate the number of worker threads via the +ppn parameter. That number should be chosen to be one fewer than the number of execution threads so that the communication thread can use the remaining resource.

FYI: Best performance on infiniband is usually in the verbs-linux-x86_64-smp build.

It is difficult to diagnose your specific problem in the abstract, however the most common cause for poor single core utilization is overly fine granularity in simulation decomposition. Experiencing a substantial drop from 1 to 2 cores suggests a load imbalance issue may also be present, however I recommend you examine compute granularity first. A modest increase in work per chare is likely to help. The Projections tool can be used to evaluate the current situation.

Thank you for your suggestion. It is true that my application is very fine-grained. Unfortunately, even modest increase in granularity requires reimplementing and even rethinking of algorithm. But I will try it anyway.

What I do not understand is low utilization of cpu cores, which as I think should not be connected to charm++ application (and even runtime), but depend only on time mpi-proccesses been running on cpus.

If you examine the time profile graph of your performance it will distinguish between time allocated by your entry methods (various colors) and time spent handling message packing/unpacking (black at top). The combination of the total is the overall utilization. If the messaging overhead portion is a substantial fraction of the overall, then you have a granularity problem.

If refactoring for smaller granularity is very difficult, you may wish to look into using the TRAM library (see the manual appendices) as it will aggregate messages in a way that helps reduce the overhead of processing many tiny messages with tiny execution granularity.

Regarding process switching, you can force affinity by appending the +setcpuaffinity flag, and specifically choose bindings by using the +pemap L[-U[:S[.R]+Oarguments. See section C.2.2 of the manual (http://charm.cs.illinois.edu/manuals/html/charm++/manual.html) for details.

I am using mpirun (which is actually a script of task manager). The custom task manager on the system I use does not support another ways of running applications.

Thank you!

The + arguments are parsed by your application as a function of building it via the charm library, not parsed by mpirun. Cpuaffinity can be set that way.

On 07/22/2015 11:24 AM, Alexander Frolov wrote:

Hi!

I am profiling my application with projections and found out that usage profile is terribly low (~45%) for the cases when 2 and more cores are used (at the moment I am investigating scalability inside of single smp node). For single pe the usage profile is about 65% (does not look good as well).

I would suppose that something wrong with mpi environment (for eg. mpi-process are continuously switched between cores). But maybe the problem in charm++ configuration?

Has anybody met with similar behavior of charm++ applications? That is there is no scalability when it is expected...

Any suggestions would be very appreciated! :-)

Hardware:

x2 Intel(R) Xeon(R) CPU E5-2690 with 65868940 kB of memory

System software:

icpc version 14.0.1, impi/4.1.0.030

Charm++ runtime:

./build charm++ mpi-linux-x86_64 mpicxx -verbose 2>&1

ps. I checked with --with-production option but it does not improved the situation significantly.

Best,

Alex
_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm

[charm] Profiling and tuning charm++ applications, Alexander Frolov, 07/22/2015
- Re: [charm] Profiling and tuning charm++ applications, Eric Bohm, 07/22/2015
  - Re: [charm] Profiling and tuning charm++ applications, Alexander Frolov, 07/23/2015
    - Re: [charm] Profiling and tuning charm++ applications, Eric Bohm, 07/23/2015
      - Re: [charm] Profiling and tuning charm++ applications, Alexander Frolov, 07/23/2015
        
        Re: [charm] Profiling and tuning charm++ applications, Kale, Laxmikant V, 07/23/2015