charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
Re: [charm] Errors when running charm++ v6.6 with obverts for the qlogic infiniband interface.
Chronological Thread
- From: Abhishek Gupta <gupta59 AT illinois.edu>
- To: "Low, John J." <jlow AT mcs.anl.gov>
- Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
- Subject: Re: [charm] Errors when running charm++ v6.6 with obverts for the qlogic infiniband interface.
- Date: Wed, 7 May 2014 17:00:27 -0500
- List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
- List-id: CHARM parallel programming system <charm.cs.uiuc.edu>
Hello,
Can you please test by applying the
attached patch (first use qlogic.patch, if that does not work, please
apply qlogic2.patch on top of it)?
Abhishek
On Wed, May 7, 2014 at 9:04 AM, Low, John J. <jlow AT mcs.anl.gov> wrote:
Charm++ developers,I have made several attempts to build charm++ on a Xeon based cluster with a QLogic QDR infiniband network. I built charm++ with the following command:
"./build charm++ net-linux-x86_64 ibverbs icc --with-production”
When I test this build with the hello command I get the following errors.
************************************************************************Charmrun> IBVERBS version of charmrunCharmrun> started all node programs in 0.104 seconds.------------- Processor 0 Exiting: Called CmiAbort ------------Reason: Failed to change qp state to RTS: you may need some device-specific parameters in machine-ibevrbs------------- Processor 0 Exiting: Called CmiAbort ------------Reason: Failed to change qp state to RTS: you may need some device-specific parameters in machine-ibevrbs------------- Processor 0 Exiting: Called CmiAbort ------------Reason: Failed to change qp state to RTS: you may need some device-specific parameters in machine-ibevrbs------------- Processor 0 Exiting: Called CmiAbort ------------Reason: Failed to change qp state to RTS: you may need some device-specific parameters in machine-ibevrbs------------- Processor 0 Exiting: Called CmiAbort ------------Reason: Failed to change qp state to RTS: you may need some device-specific parameters in machine-ibevrbs[0] Stack Traceback:[0:0] CmiAbort+0x4c [0x51d92c][0:1] initInfiOtherNodeData+0x180 [0x51d100][0:2] [0x511d48][0:3] ConverseInit+0x13a6 [0x51ab66][0:4] main+0x57 [0x46e567][0:5] __libc_start_main+0xfd [0x34cc41ed1d][0:6] [0x46a199][0] Stack Traceback:[0:0] CmiAbort+0x4c [0x51d92c][0:1] initInfiOtherNodeData+0x180 [0x51d100][0:2] [0x511d48][0:3] ConverseInit+0x13a6 [0x51ab66][0:4] main+0x57 [0x46e567][0:5] __libc_start_main+0xfd [0x34cc41ed1d][0:6] [0x46a199]------------- Processor 0 Exiting: Called CmiAbort ------------Reason: Failed to change qp state to RTS: you may need some device-specific parameters in machine-ibevrbs------------- Processor 0 Exiting: Called CmiAbort ------------Reason: Failed to change qp state to RTS: you may need some device-specific parameters in machine-ibevrbsFatal error on PE 0> Failed to change qp state to RTS: you may need some device-specific parameters in machine-ibevrbs************************************************************************I am using version 13.1.3 of the intel compilers.
Any suggestions on how to build a working ibverbs version of Charm++ for the qlogic PSM interface would be helpful. We find that the apoa1 benchmark for namd2.9 and charmm++ over mvapich2 does not scale past a few hundred cores on this machine. We would like to see good scaling up to a few thousand cores for NAMD. I think having a version of charm++ with ibverbs would help.
Thanks,
John J. LowPrincipal Computational Science Specialist
Computing, Environment and Life Sciences
Building 240, 2143
9700 South Cass Avenue
Argonne National Laboratory
Argonne, IL 60439.
630-252-0045
www.linkedin.com/pub/john-low/15/8b0/5aa/
_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm
From 08a5ab57e10f115eff65846298deaa2d9eee8eb3 Mon Sep 17 00:00:00 2001 From: Abhishek <gupta59 AT illinois.edu> Date: Wed, 7 May 2014 16:51:45 -0500 Subject: [PATCH] Ibverbs - Qlogic bug fix --- src/arch/net/machine-ibverbs.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/arch/net/machine-ibverbs.c b/src/arch/net/machine-ibverbs.c index 2d72cc6..bba6a8a 100644 --- a/src/arch/net/machine-ibverbs.c +++ b/src/arch/net/machine-ibverbs.c @@ -897,8 +897,8 @@ struct infiOtherNodeData *initInfiOtherNodeData(int node,int addr[3]){ // Error code 22 means that there was an invalid parameter when calling to this verbs, try with qlogic-specific parameters if (err==22) { - attr.timeout = 26; - attr.retry_cnt = 20; + attr.timeout = 14; + attr.retry_cnt = 7; MACHSTATE3(3,"Retry:dlid 0x%x qp 0x%x psn 0x%x",attr.ah_attr.dlid,attr.dest_qp_num,attr.sq_psn); -- 1.7.1
From f4a68b340d2e710622e4ff943264c11c1065ccb7 Mon Sep 17 00:00:00 2001 From: Abhishek <gupta59 AT illinois.edu> Date: Wed, 7 May 2014 16:54:30 -0500 Subject: [PATCH] Qlogic specific define for verbs --- src/arch/net/machine-ibverbs.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/src/arch/net/machine-ibverbs.c b/src/arch/net/machine-ibverbs.c index bba6a8a..cab43f2 100644 --- a/src/arch/net/machine-ibverbs.c +++ b/src/arch/net/machine-ibverbs.c @@ -31,7 +31,7 @@ #include <infiniband/verbs.h> -//#define QLOGIC +#define QLOGIC #ifndef QLOGIC enum ibv_mtu mtu = IBV_MTU_2048; #else -- 1.7.1
- [charm] Errors when running charm++ v6.6 with obverts for the qlogic infiniband interface., Low, John J., 05/07/2014
- Re: [charm] Errors when running charm++ v6.6 with obverts for the qlogic infiniband interface., Abhishek Gupta, 05/07/2014
Archive powered by MHonArc 2.6.16.