charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
- From: "Mikida, Eric P" <mikida2 AT illinois.edu>
- To: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>, Jozsef Bakosi <jbakosi AT lanl.gov>
- Subject: Re: [charm] Restart from checkpoint files
- Date: Fri, 21 Jun 2019 19:06:24 +0000
- Accept-language: en-US
- Authentication-results: illinois.edu; spf=pass smtp.mailfrom=mikida2 AT illinois.edu; dkim=pass header.d=uillinoisedu.onmicrosoft.com header.s=selector1-uillinoisedu-onmicrosoft-com; dmarc=pass header.from=illinois.edu
Hey Jozsef,
Are you also compiling your application with -g? It looks like the only debug symbols showing up in the trace are from within the runtime system itself.
If this is an error coming from your application it will come from a message being deleted twice. This could either occur from a message pointer being shared across multiple chares and then both try and delete it, or if you have an entry method marked [nokeep] but still delete the message yourself.
Eric
From: Jozsef Bakosi <jbakosi AT lanl.gov>
Sent: Saturday, June 15, 2019 7:58:19 AM
To: charm AT lists.cs.illinois.edu
Subject: [charm] Restart from checkpoint files
Sent: Saturday, June 15, 2019 7:58:19 AM
To: charm AT lists.cs.illinois.edu
Subject: [charm] Restart from checkpoint files
Hi folks,
Restarting from checkpoint files I'm getting:
[0]CkRestartMain done. sending out callback.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: CmiFree reference count was zero-- is this a duplicate free?
[0] Stack Traceback:
[0:0] [0x76464e]
[0:1] [0x761898]
[0:2] [0x768b04]
[0:3] CkFreeMsg+0x28 [0x60a398]
[0:4] [0x5dab08]
[0:5] [0x76430a]
[0:6] [0x763c08]
[0:7] [0x5dc45e]
[0:8] [0x5d4a12]
[0:9] __libc_start_main+0xeb [0x7fffec65309b]
[0:10] [0x4d790a]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP
FROM 0 with errorcode 1
I tried the link option -memory paranoid and the runtime option ++debug,
as well as built Charm++ in debug mode with no luck getting more info.
build charm++ mpi-linux-x86_64 --enable-error-checking --with-prio-type=int --enable-randomized-msgq --suffix randq-debug --build-shared -j36 -w -stdlib=libc++ -g
How can I get more information what chare/group goes into error when
returning from checkpoint?
Thanks,
Jozsef
Restarting from checkpoint files I'm getting:
[0]CkRestartMain done. sending out callback.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: CmiFree reference count was zero-- is this a duplicate free?
[0] Stack Traceback:
[0:0] [0x76464e]
[0:1] [0x761898]
[0:2] [0x768b04]
[0:3] CkFreeMsg+0x28 [0x60a398]
[0:4] [0x5dab08]
[0:5] [0x76430a]
[0:6] [0x763c08]
[0:7] [0x5dc45e]
[0:8] [0x5d4a12]
[0:9] __libc_start_main+0xeb [0x7fffec65309b]
[0:10] [0x4d790a]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP
FROM 0 with errorcode 1
I tried the link option -memory paranoid and the runtime option ++debug,
as well as built Charm++ in debug mode with no luck getting more info.
build charm++ mpi-linux-x86_64 --enable-error-checking --with-prio-type=int --enable-randomized-msgq --suffix randq-debug --build-shared -j36 -w -stdlib=libc++ -g
How can I get more information what chare/group goes into error when
returning from checkpoint?
Thanks,
Jozsef
- [charm] Restart from checkpoint files, Jozsef Bakosi, 06/15/2019
- Re: [charm] Restart from checkpoint files, Mikida, Eric P, 06/21/2019
- Re: [charm] Restart from checkpoint files, Jozsef Bakosi, 06/24/2019
- Re: [charm] Restart from checkpoint files, Eric Mikida, 06/24/2019
- Re: [charm] Restart from checkpoint files, Jozsef Bakosi, 06/26/2019
- Re: [charm] Restart from checkpoint files, Eric Mikida, 06/24/2019
- Re: [charm] Restart from checkpoint files, Jozsef Bakosi, 06/24/2019
- Re: [charm] Restart from checkpoint files, Mikida, Eric P, 06/21/2019
Archive powered by MHonArc 2.6.19.