charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
RE: [charm] question about charm++ IP address definition as part of cputopology.C
Chronological Thread
- From: "Choi, Jaemin" <jchoi157 AT illinois.edu>
- To: "pierre.carrier AT hpe.com" <pierre.carrier AT hpe.com>
- Cc: "kim.mcmahon AT hpe.com" <kim.mcmahon AT hpe.com>, "brian.gilmer AT hpe.com" <brian.gilmer AT hpe.com>, "steven.warren AT hpe.com" <steven.warren AT hpe.com>, "Kale, Laxmikant V" <kale AT illinois.edu>, "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
- Subject: RE: [charm] question about charm++ IP address definition as part of cputopology.C
- Date: Fri, 10 Sep 2021 20:39:26 +0000
- Accept-language: en-US
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=illinois.edu; dmarc=pass action=none header.from=illinois.edu; dkim=pass header.d=illinois.edu; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=klftVmF4jUBXbV4uyiG0I4YUHf52KGYv7KWZnHVo77A=; b=NSP+KJTot5avtJMGcjh1BzTsHT87pcWjgQsAdn2IIPzs6r79UPoNSO2ZdPzAE9T28rNd2BccRwpz62bbwOrns12nycwAm8uO4kKCju1COMpA6Rlbw4JRYGt5470N3dnyJJMFsqD5U6sDAut9mskREHTqsUZZHJjf3jlhjMOJt4kQZMhhwZrJVRVeUiyn1mP0ABijjo+8xYtT7mG5NQIm+Ea32fKtNvT6LzDW4FMZwspbARfLE4PQRJgjwY/RDRb+6Ql8npxk2JBJoF6NnKOae2T75ZSV1ehpdK6iEHKx44fVj3yeF1qOWq6u9URHlsoTNIklOrhnfdUOgq7be9HZyw==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=JP0ga1xSoX10HiTm/X/6P8gFhAFpPDp7YWe5iCeK/8aahDWseOXM6fPMFuyecDvgW/OFN/Ees6esaktyWIie8lXIpc53DvXSAmqPlXpRia3PZfgfD6EofPzEckciBolD9x/y8DOrx3gU8f3/LsjXgXnAoor7ZztJcIrdct5TYD+f2XNOziqUSt5cSQrhJavyHEO4FexMriAOtw2BBZgTtW42tRaXvcoY9J45DGSAc43fzDgEiLLdleYmE+Xg9r3iEjiTp4/D/GH6PlCTYUfl8Zwa51vPXZBPXxy70o6jDGgKvDZUWslCU4Dp39dSFR171Iq4e538rxC5CQi3zzu3BA==
- Authentication-results: ppops.net; spf=softfail smtp.mailfrom=jchoi157 AT illinois.edu; dkim=pass header.s=selector2-uillinoisedu-onmicrosoft-com header.d=uillinoisedu.onmicrosoft.com
- Authentication-results: hpe.com; dkim=none (message not signed) header.d=none;hpe.com; dmarc=none action=none header.from=illinois.edu;
Hi Pierre,
The Charm++ topology code currently uses socket IP addresses to distinguish between physical nodes. The LrtsInitCpuTopo routine in src/conv-core/cputopology.C obtains the IP address by calling skt_my_ip, which is defined in src/util/sockRoutines.C. It seems to use getifaddr if that’s available, and I don’t think it’s reading anything from SLURM. In the for loop in cpuTopoHandler, it loops through the IP addresses received from all the PEs and constructs the hostTable, which should have one record per physical node. I think one way of debugging this would be printing out the IP addresses in cpuTopoHandler and see if some PEs in the same physical node have different IP addresses (which seems to cause this issue of detecting 15 hosts instead of the actual 8).
One other thing, I don’t think ‘mpi-linux-amd64’ is a valid target when building Charm++, I believe it should be ‘mpi-linux-x86_64’.
Best,
Jaemin Choi PhD Candidate in Computer Science Research Assistant at the Parallel Programming Laboratory University of Illinois Urbana-Champaign
From: Kale, Laxmikant V <kale AT illinois.edu>
From: Carrier, Pierre <pierre.carrier AT hpe.com>
... Charm++ is built using these steps:
Building NAMD with UCX doesn’t work. On our internal system this recipe works and allows to run 4 GPU per node. However, on a customer’s machine, with a different SLURM configuration, then the error shown below appears.
Thank you. Best regards, Pierre
From: Carrier, Pierre
Hi Prof. Kale,
I work at HPE on some NAMD benchmarks, with others (in CC) that are currently trying to resolve the following problem. On one of our systems the number of nodes is incorrect when trying to run with 4 GPU per node. For example, I get the following output
Charm++> Running in SMP mode: 32 processes, 4 worker threads (PEs) + 1 comm threads per process, 128 PEs total Charm++> Running on 15 hosts (1 sockets x 64 cores x 2 PUs = 128-way SMP)
...where the SLURM script uses the following syntax:
#SBATCH --nodes=8 ... srun --ntasks=32 --ntasks-per-node=4 –cpu-bind=none \ ${NAMD_PATH}/namd2 ++ppn 4 +devices 0,1,2,3 ${INPUT_PATH}/chromat100-bench.namd &> namd.log
Using that subdivision, I expect to be running on 8 nodes. Following that error, the output becomes:
FATAL ERROR: Number of devices (4) is not a multiple of number of processes (3). Sharing devices between processes is inefficient. Specify +ignoresharing (each process uses all visible devices) if not all devices are visible to each process, otherwise adjust number of processes to evenly divide number of devices, specify subset of devices with +devices argument (e.g., +devices 0,2), or multiply list shared devices (e.g., +devices 0,1,2,0).
Which is just a consequence of the fact that the number of nodes numNodes is incorrect.
I could trace the error down to the variable “topomsg->nodes”, which is incorrectly computed, and hostTable.size(), at the line where printTopology is called:
The comments that I added are the values I’m supposed to have when running on a different system that is configured differently with SLURM but can run correctly.
Could you please direct me to someone that can explain the principles of this part of the charm++ code, in particular, what variables are read from the system (SLURM?) in order to define the proc->ip and nodeIDs? And the numNodes.
That part of the charm++ program was done by:
/** This scheme relies on using IP address to identify physical nodes * written by Gengbin Zheng 9/2008
...but I believe that he is now at Intel, if LinkedIn is up-to-date.
Thank you for your help.
Best regards, Pierre Pierre Carrier, Ph.D. Apps & Performance Engineering (651)354-3570
|
- [charm] Fw: question about charm++ IP address definition as part of cputopology.C, Kale, Laxmikant V, 09/09/2021
- <Possible follow-up(s)>
- [charm] Fw: question about charm++ IP address definition as part of cputopology.C, Kale, Laxmikant V, 09/09/2021
- RE: [charm] question about charm++ IP address definition as part of cputopology.C, Choi, Jaemin, 09/10/2021
Archive powered by MHonArc 2.6.19.