charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
- From: Lukasz Flis <l.flis AT cyf-kr.edu.pl>
- To: "Undisclosed.Recipients":
- Subject: [charm] charm 6.2.2 and ibverbs
- Date: Tue, 23 Nov 2010 01:54:05 +0100
- List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
- List-id: CHARM parallel programming system <charm.cs.uiuc.edu>
- Organization: ACC Cyfronet
Dear Charm++ users and developers,
During testing our new cluster with Qlogic based infiniband we have
encountered a problem with NAMD (based on charm 6.2.2):
Attempt to run ibverbs based version ends up with the following error:
...
Charmrun> node programs all started
Charmrun remote shell(n12-1-2.local.0)> remote responding...
Charmrun remote shell(n12-1-2.local.1)> remote responding...
Charmrun remote shell(n12-1-2.local.0)> starting node-program...
Charmrun remote shell(n12-1-2.local.0)> rsh phase successful.
Charmrun remote shell(n12-1-2.local.1)> starting node-program...
Charmrun remote shell(n12-1-2.local.1)> rsh phase successful.
Charmrun> Waiting for 0-th client to connect.
Charmrun> Waiting for 1-th client to connect.
Charmrun> All clients connected.
Charmrun> IP tables sent.
Charmrun> node programs all connected
Charmrun: error on request socket--
Socket closed before recv.
Using the strace we have obtained additional information on the problem:
write(11,
"\32\0\0\0\36\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\5\0\0\0"..., 120)
= -1 EINVAL (Invalid argument)
write(2, "Failed to modify QP to RTS\n", 27) = 27
| 00000 46 61 69 6c 65 64 20 74 6f 20 6d 6f 64 69 66 79 Failed t o modify
|
| 00010 20 51 50 20 74 6f 20 52 54 53 0a QP to R TS.
|
Problem has been confirmed to be a Charm++ related by running basic megatest
(pgm) from charm++ suite. The error was exactly the same.
OpenMPI (1.4.3) version of charm using IBVerbs didn't reported any problems.
Additional information:
IB Stack OFED 1.5.2
libibverbs-1.1.4-0.14.gb6c138b
libipathverbs-1.2-2.el5
IB cards: InfiniBand: QLogic Corp. IBA7322 QDR InfiniBand HCA (rev 01)
The same version of charm works properly on Mellanox HCAs with OFED 1.4.2
I am not sure whether it's a charm problem or ib related. Any comments and
ideas how to debug the problem are welcome
Best Regards
--
Lukasz Flis
- [charm] charm 6.2.2 and ibverbs, Lukasz Flis, 11/22/2010
Archive powered by MHonArc 2.6.16.