charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
- From: "Kale, Laxmikant V" <kale AT illinois.edu>
- To: "charm AT cs.illinois.edu" <charm AT cs.illinois.edu>, Dan Kokron <dkokron AT gmail.com>
- Subject: Re: [charm] Optimization
- Date: Thu, 27 Aug 2020 20:45:54 +0000
- Accept-language: en-US
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=illinois.edu; dmarc=pass action=none header.from=illinois.edu; dkim=pass header.d=illinois.edu; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=7Ee/1iui0W1nI3J108S/OJBLAIPc5BYCee65G40hP1g=; b=UxGH6ejhDr+q3Nst+1/bSPbFv3TiaZThamFdHwLVPfI7mgax1LWqMi2Wv8/AwRguP++WMH1qskz52D+2h0Qd79XPhLo7Cw3pugWXlVzVOUjRBS5Cock4a1IKOlc2VAZsHHFOdOJTv0YBAmuu+kToHWpUIqb1jkTpTshrDginywQ5Ei1HKcwN0UqY0Yv7Xb0W7NPo07GC99gjgKChKtkbLDwD7YQxkMtdrlCogwIwvfYUQXqxOokQ4NtcAzI5I2Jf3Sx801uXwrdvldBeHHY0j8jRw9wJOx2xDgtgyZdMjCfjJt/Kt33r7Il5nkwHwSZDfMO+J3N5lp+iQsLZLMUh7g==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=fvBQIsI6im5ZO7a4zEb8E5I7wTqG//6dU+h5BX0I5Og5LZinueTozzYgXLae9gxQIkT8ivMq9fhjZK0zAADjLLzpwdZTyhX3iGxQ0NZAKMYyx9PupChYLvdYyZQc8sErMFXnU3Y55zNollYG0u2V5s/STquqtVrDkDXtmj8O2Eds6rml63hTYWwFMkLRPskyf1q3KU2q0t7l/w6UgI1HUapb3rfFBkJQkHE5CYdZRw1PRSedR7XnuApXA4OpPXk3Yr9z9j+xoiPTldWlxsVh0MEFBejj7JyZ63NVc4wXvvdF8PCmAMvByWEahmjwd2t0mbwf0w34Yww93UQBNucn4A==
- Authentication-results: illinois.edu; spf=neutral smtp.mailfrom=kale AT illinois.edu; dkim=pass header.s=selector2-uillinoisedu-onmicrosoft-com header.d=uillinoisedu.onmicrosoft.com; dmarc=none header.from=illinois.edu
- Authentication-results: cs.illinois.edu; dkim=none (message not signed) header.d=none;cs.illinois.edu; dmarc=none action=none header.from=illinois.edu;
(Eric/Jim and others will answer your specific questions, but a few quick thoughts:)
It may be worthwhile trying 4 processes on a node.
Also, is hypethreading useful? Comparing with and without using the extra threads is another thing to try.
Looking at projections traces for a few timesteps will be useful as well.
-- Laxmikant Kale http://charm.cs.uiuc.edu Professor, Computer Science kale AT cs.uiuc.edu 201 N. Goodwin Avenue Ph: (217) 244-0094 Urbana, IL 61801-2302 FAX: (217) 265-6582
From: Dan Kokron <dkokron AT gmail.com>
Sent: Wednesday, August 26, 2020 10:22 AM
To: charm AT cs.illinois.edu <charm AT cs.illinois.edu>
Subject: [charm] Optimization
Sent: Wednesday, August 26, 2020 10:22 AM
To: charm AT cs.illinois.edu <charm AT cs.illinois.edu>
Subject: [charm] Optimization
I am working with some researchers who are running some COVID19 simulations using NAMD. I have performed an extensive search of the parameter space trying to find the best performance for their case. I am asking this question here because the performance
of this case depends greatly on communication.
Eric Bohm suggested that UCX+SMP would provide the best scaling yet that configuration (or my use of it) falls significantly behind native UCX. See attached.
Hardware:
multi-node Xeon (skylake), each node has 2 Gold 6148 (40 hardware cores with HT enabled)
nodes are connected with EDR Infiniband
Software:
NAMD git/master
CHARM++ 6.10.2
HPCX 2.7.0 (OpenMPI + UCX-1.9.0)
Intel 2019.5.281 compiler
CHARM++ for the UCX+SMP build was built with
setenv base_charm_opts "-O3 -ip -g -xCORE-AVX512"
./build charm++ ucx-linux-x86_64 icc ompipmix smp --suffix avx512 --with-production $base_charm_opts --basedir=$HPCX_UCX_DIR --basedir=$HPCX_MPI_DIR -j12
The native UCX build was the same except without the 'smp' option.
The UCX+SMP build of NAMD was built with
FLOATOPTS = -ip -O3 -xCORE-AVX512 -qopt-zmm-usage=high -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE -qopenmp-simd -DNAMD_AVXTILES
./config Linux-x86_64-icc --charm-arch ucx-linux-x86_64-ompipmix-smp-icc-avx512 --with-fftw3 --fftw-prefix /fftw-3.3.8/install/namd --charm-opts -verbose
Simulation Case:
1764532 atoms (See attached output listing)
UCX+SMP launch
mpiexec -np $nsockets --map-by ppr:1:socket --bind-to core -x UCX_TLS="rc,xpmem,self" /Linux-x86_64-icc-ucx-smp-xpmem-avx512/namd2 +ppn 38 +pemap 1-19,21-39,41-59,61-79 +commap 0,20 +setcpuaffinity +showcpuaffinity restart.namd
Would you expect native UCX to outperform UCX_SMP in this scenario?
Can you suggest some ways to improve the performance of my UCX+SMP build?
Dan
- [charm] Optimization, Dan Kokron, 08/26/2020
- Re: [charm] Optimization, Bohm, Eric J, 08/27/2020
- Re: [charm] Optimization, Dan Kokron, 08/27/2020
- Re: [charm] Optimization, Dan Kokron, 08/27/2020
- Re: [charm] Optimization, Dan Kokron, 08/27/2020
- Re: [charm] Optimization, Kale, Laxmikant V, 08/27/2020
- Re: [charm] Optimization, Dan Kokron, 08/27/2020
- Re: [charm] Optimization, Bohm, Eric J, 08/27/2020
Archive powered by MHonArc 2.6.19.