charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
- From: Evan Ramos <evan AT hpccharm.com>
- To: Jakub Homola <jakub.homola AT vsb.cz>, charm <charm AT lists.cs.illinois.edu>
- Subject: Re: [charm] errors when running on multiple physical nodes
- Date: Fri, 25 Oct 2019 12:36:44 -0500
- Authentication-results: illinois.edu; spf=none smtp.mailfrom=evan AT hpccharm.com; dkim=pass header.d=hpccharm-com.20150623.gappssmtp.com header.s=20150623; dmarc=none
Hello,
So after hours of debugging I found two bugs in my code. I used PUPbytes on a structure which had pointers inside, resulting in invalid pointers when the structure was being copied between nodes. Also in one place I assumed that messages always arrive in order, which they did not, which led to working with not-yet initialized fields inside chares.
Anyway, thank you for your help.
Jakub Homola
From: Evan Ramos
Sent: Wednesday, October 23, 2019 00:56
To: Jakub Homola
Cc: charm
Subject: Re: [charm] errors when running on multiple physical nodes
Hi Jakub,
I believe this example is a red herring because `-memory paranoid` is currently unsupported in SMP mode. Inspection of the crash in a debugger indicates something is going wrong inside the paranoid mode's data structures. I don't observe this issue in a non-SMP build.
--
Evan A. Ramos
Software Engineer
Charmworks, Inc.
On Tue, Oct 22, 2019 at 5:34 PM Jakub Homola <jakub.homola AT vsb.cz> wrote:
Hello,
Thanks for the answer.
So I did try a couple of things around that and found out that there actually is some memory corruption. However is seems to be in charm++ generated code. A simple hello world program produces a heap corruption. I am attaching the program and the outputted error message.
I compiled the Charm++ library using command “./build charm++ netlrts-linux-x86_64 icc smp -j”,
Compiled the simple hello world program using “/path/to/charmc Hello.ci” and then “/path/to/charmc -g -memory paranoid *.cpp -o Hello.x”
And run the program using “./Hello.x”. Also tried running it using “./charmrun ./Hello.x ++local”, but the same error occurred.
The error happened even before anything in the mainchare constructor started executing. The same problem occurred on my local virtual Ubuntu machine as well as on Salomon cluster node running CentOS.
I think that this could be somehow related to the original problem I had and would appreciate any help.
Thank you,
Jakub Homola
From: Evan Ramos
Sent: Monday, October 21, 2019 21:04
To: Jakub Homola
Cc: charm AT lists.cs.illinois.edu
Subject: Re: [charm] errors when running on multiple physical nodes
Hi Jakub,
Judging by the stack trace in your second error message, it is likely that the problem is somewhere in your code. I would highly recommend becoming familiar with the ++debug and ++debug-no-pause options, as they will allow you to investigate the issue directly using GDB. You may also want to rebuild Charm++ without the `--with-production` option to enable error checking in the runtime. This set of instructions may help you set up X forwarding: https://uisapp2.iu.edu/confluence-prd/pages/viewpage.action?pageId=280461906
Regards,
--
Evan A. Ramos
Software Engineer
Charmworks, Inc.
- [charm] errors when running on multiple physical nodes, Jakub Homola, 10/21/2019
- Re: [charm] errors when running on multiple physical nodes, Evan Ramos, 10/21/2019
- Message not available
- Re: [charm] errors when running on multiple physical nodes, Evan Ramos, 10/22/2019
- Message not available
- Re: [charm] errors when running on multiple physical nodes, Evan Ramos, 10/25/2019
- Message not available
- Re: [charm] errors when running on multiple physical nodes, Evan Ramos, 10/22/2019
- Message not available
- Re: [charm] errors when running on multiple physical nodes, Evan Ramos, 10/21/2019
Archive powered by MHonArc 2.6.19.