charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
[charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE
Chronological Thread
- From: Stephen Cousins <steve.cousins AT maine.edu>
- To: charm AT cs.uiuc.edu
- Subject: [charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE
- Date: Mon, 04 Jun 2012 17:24:02 -0000
- List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
- List-id: CHARM parallel programming system <charm.cs.uiuc.edu>
Hi,
We have 16 nodes with both IB and 10GbE interfaces (both interfaces are Mellanox). We also have 16 nodes that have just IB. I can run NAMD on the IB-only nodes just fine, however if the job is allocated a node that has both IB and 10 GbE then it does not work.
charmrun output is:
Charmrun> IBVERBS version of charmrun
[0] Stack Traceback:
[0:0] CmiAbort+0x5c [0xcbef1e]
[0:1] initInfiOtherNodeData+0x14a [0xcbe488]
[0:2] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbcf8b]
[0:3] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbd9e5]
[0:4] ConverseInit+0x1cd [0xcbe001]
[0:5] _ZN7BackEnd4initEiPPc+0x6f [0x58ad13]
[0:6] main+0x2f [0x585fd7]
[0:7] __libc_start_main+0xf4 [0x3633c1d9b4]
[0:8] _ZNSt8ios_base4InitD1Ev+0x4a [0x54105a]
And STDERR for the job is:
Charmrun> started all node programs in 1.544 seconds.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
Fatal error on PE 0> failed to change qp state to RTR
We are running Moab and Torque for the scheduler and resource manager. The version of NAMD is:
NAMD_2.9_Linux-x86_64-ibverbs
Do I need to specify that the link layer should use IB as opposed to Ethernet?
ibstat for the nodes with both interconnects:
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.10.0
Hardware version: 0
Node GUID: 0xffffffffffffffff
System image GUID: 0xffffffffffffffff
Port 1:
State: Active
Physical state: LinkUp
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0202c9fffe34e8f0
Link layer: Ethernet
Port 2:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0202c9fffe34e8f1
Link layer: Ethernet
CA 'mlx4_1'
CA type: MT26428
Number of ports: 1
Firmware version: 2.9.1000
Hardware version: b0
Node GUID: 0x002590ffff16b658
System image GUID: 0x002590ffff16b65b
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 21
LMC: 0
SM lid: 3
Capability mask: 0x02510868
Port GUID: 0x002590ffff16b659
Link layer: InfiniBand
ibstat for a node with just IB:
CA 'mlx4_0'
CA type: MT26428
Number of ports: 1
Firmware version: 2.9.1000
Hardware version: b0
Node GUID: 0x002590ffff16bbe8
System image GUID: 0x002590ffff16bbeb
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 14
LMC: 0
SM lid: 3
Capability mask: 0x02510868
Port GUID: 0x002590ffff16bbe9
Link layer: InfiniBand
Thanks for your help.
Steve
--
______________________________________________________________________
Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey Drive
Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME 04473
(207) 581-4302 ~ steve.cousins at maine.edu ~ (207) 866-6552
- [charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Stephen Cousins, 06/04/2012
- Re: [charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Stephen Cousins, 06/06/2012
- Re: [charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Stephen Cousins, 06/11/2012
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/11/2012
- Message not available
- Message not available
- Message not available
- Message not available
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/12/2012
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Jim Phillips, 06/12/2012
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/12/2012
- Message not available
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/14/2012
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Stephen Cousins, 06/14/2012
- Message not available
- Message not available
- Message not available
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/11/2012
Archive powered by MHonArc 2.6.16.