charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
- From: Fouzhan Hosseini <F.Hosseini AT leeds.ac.uk>
- To: "Vipul Harsh, -" <vharsh2 AT illinois.edu>
- Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
- Subject: Re: [charm] code crash when run with migration based LB - charm++ 6.7.1
- Date: Wed, 12 Oct 2016 17:14:26 +0000
- Accept-language: en-US
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:99
Hi Vipul,
Thank you for your reply. I create new array sections, because I need new sections with different elements.
Thanks for referring me to the example at tests/charm++/delegation/multicast. What I am doing in my code is quiet different, I am afraid i still have problem with placing
CkSectionInfo
as chare data member.
I send you my code shortly.
Thanks,
Fouzhan
Sent: Monday, October 10, 2016 10:30:48 PM
To: Fouzhan Hosseini
Cc: charm AT cs.uiuc.edu
Subject: RE: code crash when run with migration based LB - charm++ 6.7.1
Sent: 06 October 2016 06:31:45
To: Vipul Harsh, -
Cc: charm AT cs.uiuc.edu
Subject: Re: [charm] code crash when run with migration based LB - charm++ 6.7.1
Hi Vipul,
Thanks a lot for your reply. I cannot pup CkSectioninfo objects. As I mentioned in my first email, I've defined
CkSectionInfo objects in the local scope of entry methods which are contributing in a section reduction, and they are updated by calling
CkGetSectionInfo(). Charm++ manual, sec 14.3, says cookie should not be used as a one-time local variable. However, if i define
CkSectionInfo objects as a data members of chare elements, the program will not finish (I guess, sth goes wrong in the underlying message passing). Moving definition of the CkSectionInfo objects to the local scope of functions seems to work and program
always finish successfully as long as I do not use migration based LB.
I was confused with the charm++ manual suggestion. In each iteration, I define new array sections, broadcast/multicast to each section, and then each section contributes in only one reduction operation. I guess CkSectionInfo objects must be defined locally,
as from one iteration to another a new one is needed. I do not know many implementation details, and obviously can be wrong! I can put together an example codeset that goes to infinite run, if anybody is happy to look at it.
I still get "corrupted double-linked
list" error when using LB.
Thanks,
Fouzhan
Sent: Wednesday, October 5, 2016 6:37:36 PM
To: Fouzhan Hosseini
Cc: charm AT cs.uiuc.edu
Subject: RE: code crash when run with migration based LB - charm++ 6.7.1
Sent: 03 October 2016 18:36:46
To: charm AT cs.uiuc.edu
Subject: [charm] code crash when run with migration based LB - charm++ 6.7.1
Dear All,
I have coded a Charm++ program, which works fine running either on a multi-core machine or on a cluster. However, when this program is linked and executed with available migration based load balancing strategies (e.g +balancer GreedyLB), it usually crashes with error message "corrupted double-linked list.." or "seg fault". I have been trying to track down the problem and not sure where it is coming from. I have a few questions.
I am new to charm++ community, I hope here is the right place to raise questions/ask for help/report bugs.
1) There are two char arrays in my code and PUP method is implemented for both. I only have simple entry methods (no threaded or sync method), but I heavily use structure daggers to express coordination between entry methods(
for, if and when statements and matching on reference numbers).
"__sdag_pup(p);" is added in PUP methods. Is there anything else I am supposed to add to my code to be able to use migration based LB?
2) I am using CkMulticast library with array sections and section reductions. Each array section only contributes in one reduction, so I've define a local variable of type CkSectionInfo in relevant chare function members which are updated calling "CkGetSectionInfo()". I do not quite understand how CkSectionInfo are updated in CkMulticast lib in case of migration, so wondered if this can cause problem.
3) There is an entry method called Merger() which is expressed by sdagger. In this method there is a when statement waiting on another entry method called RecvBSlabSet1(). RecvBSlabSet1() is called when a section reduction on the other array completes. This two entry methods often are mentioned is Error message stack trace. I am including the error message stack trace in case it would be useful. Both this entry methods belong to a chare array called JointContourNet.
======= Backtrace: =========
/lib64/libc.so.6(+0x7b184)[0x7ffff6957184]
/lib64/libc.so.6(+0x7d235)[0x7ffff6959235]
JCN(_ZN4SDAG10MsgClosureD0Ev+0x24)[0x60e0b4]
JCN(_ZN4SDAG6BufferD0Ev+0x46)[0x60e126]
JCN(_ZN15JointContourNet7_when_0EPN23Closure_JointContourNet16Merger_4_closureEi+0x2bc)[0x4c82ac]
JCN(_ZN15JointContourNet13RecvBSlabSet1EP14CkReductionMsg+0x188)[0x4c8af8]
JCN(CkDeliverMessageFree+0x22)[0x530652]
JCN(_ZN14CkLocRec_local11invokeEntryEP12CkMigratablePvib+0x240)[0x54a570]
JCN(_ZN14CkLocRec_local7deliverEP14CkArrayMessage11CkDeliver_ti+0x314)[0x54b504]
JCN(_ZN8CkLocMgr7deliverEP9CkMessage11CkDeliver_ti+0xec)[0x546fdc]
JCN(_Z15_processHandlerPvP11CkCoreState+0x437)[0x537327]
JCN(CsdScheduleForever+0x48)[0x5f9ff8]
JCN(CsdScheduler+0x2d)[0x5fa28d]
JCN(ConverseInit+0x3ea)[0x5f8f6a]
JCN(main+0x2c)[0x4bcd5c]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff68fdb15]
Regards,
Fouzhan
- [charm] code crash when run with migration based LB - charm++ 6.7.1, Fouzhan Hosseini, 10/03/2016
- RE: [charm] code crash when run with migration based LB - charm++ 6.7.1, Vipul Harsh, -, 10/05/2016
- Re: [charm] code crash when run with migration based LB - charm++ 6.7.1, Fouzhan Hosseini, 10/06/2016
- RE: [charm] code crash when run with migration based LB - charm++ 6.7.1, Vipul Harsh, -, 10/10/2016
- Re: [charm] code crash when run with migration based LB - charm++ 6.7.1, Fouzhan Hosseini, 10/12/2016
- RE: [charm] code crash when run with migration based LB - charm++ 6.7.1, Vipul Harsh, -, 10/10/2016
- Re: [charm] code crash when run with migration based LB - charm++ 6.7.1, Fouzhan Hosseini, 10/06/2016
- RE: [charm] code crash when run with migration based LB - charm++ 6.7.1, Vipul Harsh, -, 10/05/2016
Archive powered by MHonArc 2.6.19.