charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
- From: Ronak Buch <rabuch2 AT illinois.edu>
- To: "Ortega, Bob" <bobo AT mail.smu.edu>
- Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
- Subject: Re: [charm] FW: Projections
- Date: Mon, 14 Dec 2020 16:39:13 -0500
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=illinois.edu; dmarc=pass action=none header.from=illinois.edu; dkim=pass header.d=illinois.edu; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=4Pcvm+N9MNN9r62a7OT/h1tloF8My/6dRuJ2oR5b+9Q=; b=NU1TTBVIbelw0Ls8LEUnM091iyGTTTeC/+aEnHSEXf7MVn0yVCwy1dka+3YHATV8/YpaHgIMZfSEKvogy04VbGyS89vi0+cXCv3oSU3TP3xiVp0/M4tBPfeviP6vTCV6mlS6Eei8z5JW1RbcFyNzbBlMJ/57EMjDVThQ+dqemiWqjgD6pYrw6coEoaVhB2Yb9wIMPIo5cxvCcdPNv8navTZhHg/w1AW45mrm6KdVyzJcpqxLe8abd8Yn2pLDceXpCDPJ/sC8iA7BsMom3ZHNy5Yax2uV3eh5j6tQiLE4PE80LNfcgsLDcOoLcW7mT1CFRXFp71/JS+pm+T3Lg4Wv/Q==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=XWy5ZeRnHEq4Exyrnl3vbbvmrTxUdDY3dnFxuHSS2QIfaWfkGxqr2LI9xZx0VDsml2KD5CgUD58CL3sEoQQHsJrWT3W6w31yXL2SL4IiNMJf/KF9JreGVH6i0+kRR0bxUjlI3cjdM5SQoNmFbleuM6SOE/6SyumEoXG6MeDuDDxCb8o0RsGHj9zfgW0MtkUckeCsBrvcrw6U9mNlYUTecT2Oi8IghtdpXYZNpZHvgDdzYjY+IBOt+C3NBPViCq4e35OkTmGWqae9IfhDCE6HiQRPrN7kzPcovfA06FD2c96v60YeYgLyhxxfihzncpZUCxxiOXdsPkCsdUxpBYO8kQ==
- Authentication-results: illinois.edu; spf=softfail smtp.mailfrom=rabuch2 AT illinois.edu; dkim=pass header.d=uillinoisedu.onmicrosoft.com header.s=selector2-uillinoisedu-onmicrosoft-com; dmarc=none header.from=illinois.edu
- Authentication-results: lists.cs.illinois.edu; dkim=none (message not signed) header.d=none;lists.cs.illinois.edu; dmarc=none action=none header.from=illinois.edu;
Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired: MPI_THREAD_SINGLE)
Charm++> Running in non-SMP mode: 36 processes (PEs)
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.10.2-0-g7bf00fa-namd-charm-6.10.2-build-2020-Aug-05-556
Trace: logsize: 10000000
Charm++: Tracemode Projections enabled.
Trace: traceroot: /users/bobo/NAMD/NAMD_2.14_Source/Linux-x86_64-icc/./namd2.prj
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 2 hosts (2 sockets x 9 cores x 1 PUs = 18-way SMP)
Charm++> cpu topology info is gathered in 0.024 seconds.
Info: NAMD 2.14 for Linux-x86_64-MPI
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: for updates, documentation, and support information.
Info:
Info: Please cite Phillips et al., J. Chem. Phys. 153:044130 (2020) doi:10.1063/5.0014475
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 61002 for mpi-linux-x86_64-icc
Info: Built Wed Dec 9 22:01:36 CST 2020 by bobo on login04
Info: 1 NAMD 2.14 Linux-x86_64-MPI 36 v001 bobo
Info: Running on 36 processors, 36 nodes, 2 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.769882 s
Info: 2118.93 MB of memory in use based on /proc/self/stat
Info: Configuration file is stmv/stmv.namd
Info: Changed directory to stmv
TCL: Suspending until startup complete.
Info: SIMULATION PARAMETERS:
Info: TIMESTEP 1
Info: NUMBER OF STEPS 500
Info: STEPS PER CYCLE 20
Info: PERIODIC CELL BASIS 1 216.832 0 0
Info: PERIODIC CELL BASIS 2 0 216.832 0
Ronak,
I apologize. I’ve decided to not include or attach a screenshot from Projections run because SYMPA keeps telling
me it cannot distribute the message with the screenshot.
Sorry if you’ve received this messages multiple times. Usually I get a confirmation from the mail list server
but then realized I had not copied the list server. Just wanted to make sure you received this. Sometimes when I included
a graphic it’s too large. So, I resized it. I tried before and it was above 400 kb. Now, I resized again to belows 400 kb.
***********************************************************************************************
That worked! No warning messages.
We are attempting to be able to confirm parallel running of NAMD/Charm as it has been a long standing issue. There is a serial version running, but of course, because we do have the ability to run
Applications in parallel, that is what this latest testing is about. This why I have been seeking tools/resources to better understand what is going on during these runs.
As I mentioned to Nitin, I would really like to understand better (and where it indicates things are running in parallel) the output from NAMD, which starts off like this:
Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired: MPI_THREAD_SINGLE)
Charm++> Running in non-SMP mode: 36 processes (PEs)
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.10.2-0-g7bf00fa-namd-charm-6.10.2-build-2020-Aug-05-556
Trace: logsize: 10000000
Charm++: Tracemode Projections enabled.
Trace: traceroot: /users/bobo/NAMD/NAMD_2.14_Source/Linux-x86_64-icc/./namd2.prj
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 2 hosts (2 sockets x 9 cores x 1 PUs = 18-way SMP)
Charm++> cpu topology info is gathered in 0.024 seconds.
Info: NAMD 2.14 for Linux-x86_64-MPI
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: for updates, documentation, and support information.
Info:
Info: Please cite Phillips et al., J. Chem. Phys. 153:044130 (2020) doi:10.1063/5.0014475
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 61002 for mpi-linux-x86_64-icc
Info: Built Wed Dec 9 22:01:36 CST 2020 by bobo on login04
Info: 1 NAMD 2.14 Linux-x86_64-MPI 36 v001 bobo
Info: Running on 36 processors, 36 nodes, 2 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.769882 s
Info: 2118.93 MB of memory in use based on /proc/self/stat
Info: Configuration file is stmv/stmv.namd
Info: Changed directory to stmv
TCL: Suspending until startup complete.
Info: SIMULATION PARAMETERS:
Info: TIMESTEP 1
Info: NUMBER OF STEPS 500
Info: STEPS PER CYCLE 20
Info: PERIODIC CELL BASIS 1 216.832 0 0
Info: PERIODIC CELL BASIS 2 0 216.832 0
So, in addition to learning more about projections, what other tools/apps/resources would you recommend that might help in monitoring/analyzing our attempts at parallization?
Thanks!
Bob
From: Ronak Buch <rabuch2 AT illinois.edu>
Date: Friday, December 11, 2020 at 2:02 PM
To: "Ortega, Bob" <bobo AT mail.smu.edu>
Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>, Nitin Bhat <nitin AT hpccharm.com>
Subject: Re: [charm] FW: Projections
Hi Bob,
Your run command should look something like:
date;time srun -n 36 -N 2 -p fp-gpgpu-3 --mem=36GB ./namd2.prj stmv/stmv.namd +logsize 10000000 >namd2.prj.fp-gpgpu-3.6.log;date
Thanks,
Ronak
On Thu, Dec 10, 2020 at 3:31 PM Ortega, Bob <bobo AT mail.smu.edu> wrote:
Ronak,
Thank you for the quick reply.
Well, I’m using srun to run NAMD. Here’s the command,
date;time srun -n 36 -N 2 -p fp-gpgpu-3 --mem=36GB ./namd2.prj stmv/stmv.namd >namd2.prj.fp-gpgpu-3.6.log;date
How can I submit a similar charmrun command targeting 36 processors, 2 nodes, the fp-gpgpu-3 queue partition, 36GB of memory and +logsize of 10000000?
Oh, I’m not getting the exception anymore and unfortunately, during that run, I didn’t log the results to a file.
If it occurs again, I’ll forward the log file.
Thanks,
Bob
From: Ronak Buch <rabuch2 AT illinois.edu>
Date: Thursday, December 10, 2020 at 2:09 PM
To: "Ortega, Bob" <bobo AT mail.smu.edu>
Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>, Nitin Bhat <nitin AT hpccharm.com>
Subject: Re: [charm] FW: Projections
Hi Bob,
Regarding the +logsize parameter, it is a runtime parameter, not a compile time parameter, so you shouldn't add it to the Makefile, you should add to your run command (e.g. ./charmrun +p2 ./namd <namd input file name> +logsize 10000000).
Regarding the exception you're seeing, I'm not sure why that's happening, it's likely due to some issue in initialization. Would it be possible for you to share the generated logs for debugging?
Thanks,
Ronak
On Thu, Dec 10, 2020 at 12:36 PM Ortega, Bob <bobo AT mail.smu.edu> wrote:
From: "Ortega, Bob" <bobo AT mail.smu.edu>
Date: Thursday, December 10, 2020 at 11:24 AM
To: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
Cc: Nitin Bhat <nitin AT hpccharm.com>
Subject: FW: Projections
Nitin Bhat was kind enough to review my questions about some errors and messages I am receiving while using/running NAMD/Charm with projections enabled.
I am including some email messages I sent to Nitin about these issues. Let me know how I might resolve these issues and any references that may help to clarify
proper use of projections to be able to take further advantage of it’s capabilities.
Thanks,
Bob
From: Nitin Bhat <nitin AT hpccharm.com>
Date: Thursday, December 10, 2020 at 10:55 AM
To: "Ortega, Bob" <bobo AT mail.smu.edu>
Subject: Re: Projections
Hi Bob,
I am just reading your latest emails about the issues that you’re seeing on projections.
Can you reach out to the Charm mailing list (charm AT lists.cs.illinois.edu) with both the issues that you’re seeing? (This one and the previous java exception that you saw when you launched projections). The folks who work with (and develop) projections will be able to better address those issues.
Thanks,
Nitin
On Dec 10, 2020, at 8:52 AM, Ortega, Bob <bobo AT mail.smu.edu> wrote:
Nitin,
Thanks again for your support. I’m now trying to find out how to use the following runtime option,
+logsize NUM
Because when I run NAMD.PRJ binary, at the end of the output I get this message:
*************************************************************
Warning: Projections log flushed to disk 101 times on 36 cores: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35.
Warning: The performance data is likely invalid, unless the flushes have been explicitly synchronized by your program.
Warning: This may be fixed by specifying a larger +logsize (current value 1000000).
I thought that perhaps this was entered into the Makefile under the projections section, so I put it there with this line,
+logsize 10000000
But I still am getting the Warning message.
Thanks,
Bob
Nitin,
As noted in an earlier email, I was successful running projections for traces generated by a run with 18 processors and 1 node. But when I tried with 180 processors and 10 nodes, I get the following error when trying to run projections:
Do you know what could be the problem here?
Thanks,
Bob
- [charm] FW: Projections, Ortega, Bob, 12/10/2020
- <Possible follow-up(s)>
- Re: [charm] FW: Projections, Ronak Buch, 12/10/2020
- Re: [charm] FW: Projections, Ortega, Bob, 12/10/2020
- Message not available
- Re: [charm] FW: Projections, Ronak Buch, 12/11/2020
- Message not available
- Message not available
- Message not available
- Message not available
- Message not available
- Re: [charm] FW: Projections, Ortega, Bob, 12/14/2020
- Message not available
- Message not available
- Message not available
- Re: [charm] FW: Projections, Ronak Buch, 12/11/2020
- Message not available
- Message not available
- Message not available
- Message not available
- Message not available
- Message not available
- Re: [charm] FW: Projections, Ronak Buch, 12/14/2020
- Re: [charm] FW: Projections, Ortega, Bob, 12/14/2020
- Message not available
- Message not available
- Message not available
- [charm] FW: FW: Projections, Ortega, Bob, 12/15/2020
- Message not available
- Message not available
Archive powered by MHonArc 2.6.19.