charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
- From: "Ortega, Bob" <bobo AT mail.smu.edu>
- To: Ronak Buch <rabuch2 AT illinois.edu>
- Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
- Subject: Re: [charm] FW: Projections
- Date: Mon, 14 Dec 2020 23:19:22 +0000
- Accept-language: en-US
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=mail.smu.edu; dmarc=pass action=none header.from=mail.smu.edu; dkim=pass header.d=mail.smu.edu; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=fJG/17vEWMeJbwz2j+IADyXMs5vgz0IKrIpwxbeNm0Y=; b=l1kFBc2A9Ga9T8Q/QGzoI7bSvGXRqCZE9zzGsqNZWMbb3zX5mTx4FPU0Kxc6DbPRf7v2yx5iQzkVkDVFWjr/iMMLF74mkk0akygvm5p+Oo1YRWJLE87aD5qp+2h6dPEc6VinLwOiHwNMrfvuwnsWcHGe/oPY6AO+VcgFqomWlAG7Cs6+D/Jm3WmOPLa4UmdEey89GIJzjF6/h1VstYrMUTflgHWnnDjjiiy3gE8UqoEbMkw5KgpYVc0dm3zb8fsGyjIuFNU0YtvQJj/m4hYVHFi1N1mStpySiqICYMPMSdK+xTKM3YDXOPPU/l/grJs3mgpIbwXcW28osoeATlBobQ==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=mtByzE5g4G0aKlP0+KMc1d72JwfLdh0uEo1icMh+oNw7QbmwXiUuJ3BC2BpLROI+TIb/vRuCY3mcbpDWbo1M35LwEnzMpiRk+NuaPPWUtYPwYAVsi0pQc2TcQz+Dgva2REJKcS9B53o8NcmPN3ZuKyPYa42djVIEruN8g8VPwteBqfGvFwlKM/If6t91txAKJ38/8CKqvDQAV4rQFxM7wjxSX5c7EWNyd71mBQA4yHiUs+9gWb4pz+20WicnR45J4ufE3dJtGvH5jxXz4X/yfPmO3ejpSLDukHtyPR5SKChGRtaQveKn+01uyp0yCa3zOWv7y7QI5D7Wk7cFa+DfmA==
- Authentication-results: illinois.edu; spf=pass smtp.mailfrom=bobo AT mail.smu.edu; dkim=pass header.d=smu365.onmicrosoft.com header.s=selector2-smu365-onmicrosoft-com; dmarc=none
Ronak,
Thanks again and real sorry if you received multiple copies of my previous message.
Yes, we are running an MPI build of Charm++ because it turns out when we attempt a UCX build it fails. I worked with Mellanox Support to discover that the errors/failures are due to the fact we are using HPCX (Mellanox) version 2.1. Mellanox has suggested we upgrade to v 2.7.
Here are some email messages between me and Aleksey of Mellanox Support:
From: Aleksey Senin <alekseys AT nvidia.com>
Hi Bob, NAMD compilation is looking for ucp_put_nb symbol that doesn’t exist in your setup, but in newer HPC-X/UCX version. Update will resolve this specific issue. But again, this is a NAMD application that we are not familiar with and unfortunately cannot support or provide recommendation regarding what HPC-X version should be used. Please, raise NAMD related question with NAMD vendor.
Regards, Aleksey
From: Ortega, Bob <bobo AT mail.smu.edu>
Agree, no Mellanox issue, but should I be using HPC-X version 2.2 or higher to resolve build issue?
Thanks!
From: Aleksey Senin <alekseys AT nvidia.com>
Hi Bob, As logs show, it is v2.1 and there is no ucp_pub_nb symbol that NAND is looking for in this version. I don’t see a Mellanox issue here. Please, contact NAMD vendor regarding recommended version to use. Current and archived version of HPC-X toolkint available from here: https://www.mellanox.com/products/hpc-x-toolkit
Let me know if there is anything else that I can assist you with.
Aleksey
From: Ortega, Bob <bobo AT mail.smu.edu>
Here are the outputs:
[bobo@login04 charm-6.10.2]$ ofed_info -s MLNX_OFED_LINUX-4.9-0.1.7.0: [bobo@login04 charm-6.10.2]$ ucx_info -v # UCT version=1.3.0 revision 0b45e29 # configured with: --enable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --with-knem=/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel7-u4-x86-64-MOFED-CHECKER/hpcx_root/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.1-1.0.2.0-redhat7.4-x86_64/knem --prefix=/scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel7-u4-x86-64-MOFED-CHECKER/hpcx_root/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.1-1.0.2.0-redhat7.4-x86_64/ucx [bobo@login04 charm-6.10.2]$ env |grep HPCX HPCX_MPI_DIR=/hpc/applications/hpc-x/hpcx-v2.1.0/gcc-9.2/hpcx-ompi HPCX_DIR=/hpc/applications/hpc-x/hpcx-v2.1.0/gcc-9.2 HPCX_HOME=/hpc/applications/hpc-x/hpcx-v2.1.0/gcc-9.2 HPCX_SHARP_DIR=/hpc/applications/hpc-x/hpcx-v2.1.0/gcc-9.2/sharp HPCX_IPM_LIB=/hpc/applications/hpc-x/hpcx-v2.1.0/gcc-9.2/hpcx-ompi/tests/ipm-2.0.6/lib/libipm.so HPCX_HCOLL_DIR=/hpc/applications/hpc-x/hpcx-v2.1.0/gcc-9.2/hcoll HPCX_MXM_DIR=/hpc/applications/hpc-x/hpcx-v2.1.0/gcc-9.2/mxm HPCX_UCX_DIR=/hpc/applications/hpc-x/hpcx-v2.1.0/gcc-9.2/ucx HPCX_IPM_DIR=/hpc/applications/hpc-x/hpcx-v2.1.0/gcc-9.2/hpcx-ompi/tests/ipm-2.0.6 HPCX_OSHMEM_DIR=/hpc/applications/hpc-x/hpcx-v2.1.0/gcc-9.2/hpcx-ompi HPCX_FCA_DIR=/hpc/applications/hpc-x/hpcx-v2.1.0/gcc-9.2/fca HPCX_MPI_TESTS_DIR=/hpc/applications/hpc-x/hpcx-v2.1.0/gcc-9.2/hpcx-ompi/tests [bobo@login04 charm-6.10.2]$ find $HPCX_UCX_DIR -iname ucp_version.h -exec grep -ir 'API.*version' {} \; * UCP API version is 1.3 #define UCP_API_VERSION UCP_VERSION(1, 3) * UCP API version is 1.3 #define UCP_API_VERSION UCP_VERSION(1, 3) * UCP API version is 1.3 #define UCP_API_VERSION UCP_VERSION(1, 3) * UCP API version is 1.3 #define UCP_API_VERSION UCP_VERSION(1, 3) * UCP API version is 1.3 #define UCP_API_VERSION UCP_VERSION(1, 3) [bobo@login04 charm-6.10.2]$ find $HPCX_UCX_DIR -name ucp.h -exec egrep "ucp_put_nb\(" {} \; [bobo@login04 charm-6.10.2]$
From: Aleksey Senin <alekseys AT nvidia.com>
Hi, What is the output of the following command: ofed_info -s ucx_info -v env |grep HPCX find $HPCX_UCX_DIR -iname ucp_version.h -exec grep -ir 'API.*version' {} \; find $HPCX_UCX_DIR -name ucp.h -exec egrep "ucp_put_nb\(" {} \;
Please, be sure to have supportadmin AT mellanox.com in ‘CC’ list
Many thanks, Aleksey From: Ortega, Bob <bobo AT mail.smu.edu>
Aleksey,
Thank you so much for the response.
However, I am not sure what version the HPC-X module I am using is. How can I check that and confirm what it is? Yes, hpcx/hpcx_2.1 is on the system, but I am using the default HPC-X module which is just listed as hpcx/hpcx. When I issue the command,
module whatis hpcx/hpcx
It just comes back with,
hpcx/hpcx : Mellanox HPC-X toolkit.
No version number. Since, it Is failing, I’ll assume I am using a version below 2.7, but then again, there could be other problems.
Thanks! Bob
From: Aleksey Senin <alekseys AT nvidia.com>
Hello Robert,
Based on the failure log, that comes from NAMD, it is not a Mellanox issue, but incompatible HPC-X/UCX version . The failure says that it cannot find ucx_put_nb
machine-onesided.C:125:21: error: 'ucp_put_nb' was not declared in this scope; did you mean 'ucp_put_nbi'? 125 | statusReq = ucp_put_nb(ep, ncpyOpInfo->srcPtr, | ^~~~~~~~~~ | ucp_put_nbi
The list of available modules shows you are using HPC-X v2.1 and based on UCX log, the ucx_pub_nb was added in HPC-X v2.2. Please, use latest HPC-X v2.7 or check with NAMD developers what is the compatible and tested version of HPC-X/UCX should be used.
=== git blame src/ucp/api/ucp.h | grep ucp_put_nb d0adb154c4 (Sergey Oblomov 2020-02-17 15:57:20 +0200 1323) * @ref ucp_tag_send_sync_nbx, @ref ucp_tag_recv_nbx, @ref ucp_put_nbx, 706ea2281b (Pavel Shamis (Pasha) 2015-12-22 15:55:54 -0500 3491) ucs_status_t ucp_put_nbi(ucp_ep_h ep, const void *buffer, size_t length, b634aa8e31 (Nathan Hjelm 2018-02-20 14:26:09 -0700 3537) ucs_status_ptr_t ucp_put_nb(ucp_ep_h ep, const void *buffer, size_t length, c98497f289 (Sergey Oblomov 2020-06-29 10:04:53 +0300 3592) ucs_status_ptr_t ucp_put_nbx(ucp_ep_h ep, const void *buffer, size_t count, git describe --contains b634aa8e31 hpcx-v2.2~170^2~1 ===
Let me know if there is anything else that I can help you with.
Aleksey
From: Ronak Buch <rabuch2 AT illinois.edu>
Hi Bob,
I did in fact receive your message, I'm glad to see that things are working properly.
You seem to be running NAMD in parallel already. I've emboldened the lines in the startup output below that indicate your parallel execution (you're running on two physical nodes, 36 cores in total). One place you can look to get an overview of how Charm++ parallel execution works and the various flags and parameters you can use to customize execution is our manual: https://charm.readthedocs.io/en/latest/charm++/manual.html.
(One thing I should also note is that you are running an MPI build of Charm++, which is generally not recommended unless your platform doesn't support any of the native Charm++ machine layers, as those usually provide higher performance. Assuming you're using SMU's ManeFrame II, you'll probably want to use the UCX machine layer for Charm++.)
Charm++> Running on MPI version: 3.1 Charm++> level of thread support used: MPI_THREAD_SINGLE (desired: MPI_THREAD_SINGLE) Charm++> Running in non-SMP mode: 36 processes (PEs) Charm++> Using recursive bisection (scheme 3) for topology aware partitions Converse/Charm++ Commit ID: v6.10.2-0-g7bf00fa-namd-charm-6.10.2-build-2020-Aug-05-556 Trace: logsize: 10000000 Charm++: Tracemode Projections enabled. Trace: traceroot: /users/bobo/NAMD/NAMD_2.14_Source/Linux-x86_64-icc/./namd2.prj CharmLB> Load balancer assumes all CPUs are same. Charm++> Running on 2 hosts (2 sockets x 9 cores x 1 PUs = 18-way SMP) Charm++> cpu topology info is gathered in 0.024 seconds. Info: NAMD 2.14 for Linux-x86_64-MPI Info: Info: Please visit http://www.ks.uiuc.edu/Research/namd/ Info: for updates, documentation, and support information. Info: Info: Please cite Phillips et al., J. Chem. Phys. 153:044130 (2020) doi:10.1063/5.0014475 Info: in all publications reporting results obtained with NAMD. Info: Info: Based on Charm++/Converse 61002 for mpi-linux-x86_64-icc Info: Built Wed Dec 9 22:01:36 CST 2020 by bobo on login04 Info: 1 NAMD 2.14 Linux-x86_64-MPI 36 v001 bobo Info: Running on 36 processors, 36 nodes, 2 physical nodes. Info: CPU topology information available. Info: Charm++/Converse parallel runtime startup completed at 0.769882 s Info: 2118.93 MB of memory in use based on /proc/self/stat Info: Configuration file is stmv/stmv.namd Info: Changed directory to stmv TCL: Suspending until startup complete. Info: SIMULATION PARAMETERS: Info: TIMESTEP 1 Info: NUMBER OF STEPS 500 Info: STEPS PER CYCLE 20 Info: PERIODIC CELL BASIS 1 216.832 0 0 Info: PERIODIC CELL BASIS 2 0 216.832 0
Thanks, Ronak
On Mon, Dec 14, 2020 at 8:56 AM Ortega, Bob <bobo AT mail.smu.edu> wrote:
|
- [charm] FW: Projections, Ortega, Bob, 12/10/2020
- <Possible follow-up(s)>
- Re: [charm] FW: Projections, Ronak Buch, 12/10/2020
- Re: [charm] FW: Projections, Ortega, Bob, 12/10/2020
- Message not available
- Re: [charm] FW: Projections, Ronak Buch, 12/11/2020
- Message not available
- Message not available
- Message not available
- Message not available
- Message not available
- Re: [charm] FW: Projections, Ortega, Bob, 12/14/2020
- Message not available
- Message not available
- Message not available
- Re: [charm] FW: Projections, Ronak Buch, 12/11/2020
- Message not available
- Message not available
- Message not available
- Message not available
- Message not available
- Message not available
- Re: [charm] FW: Projections, Ronak Buch, 12/14/2020
- Re: [charm] FW: Projections, Ortega, Bob, 12/14/2020
- Message not available
- Message not available
- Message not available
- [charm] FW: FW: Projections, Ortega, Bob, 12/15/2020
- Message not available
- Message not available
Archive powered by MHonArc 2.6.19.