charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE
Chronological Thread
- From: Gengbin Zheng <gzheng AT illinois.edu>
- To: Stephen Cousins <steve.cousins AT maine.edu>
- Cc: "charm AT cs.uiuc.edu" <charm AT cs.uiuc.edu>
- Subject: Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE
- Date: Mon, 11 Jun 2012 14:00:28 -0500
- List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
- List-id: CHARM parallel programming system <charm.cs.uiuc.edu>
Hi Steve,
Charm ib layger always assume ibport is 1, that is why it didn't work
when you have multiple interfaces.
I checked in a fix to test the ib ports. It is in the latest main branch.
Can you give it a try?
Gengbin
On Mon, Jun 11, 2012 at 11:12 AM, Stephen Cousins
<steve.cousins AT maine.edu>
wrote:
> Hi,
>
> Is this list active? Does anyone have any ideas about how charmrun can
> specify specific types of interfaces when using ibverbs?
>
> Thanks,
>
> Steve
>
> On Mon, Jun 4, 2012 at 1:24 PM, Stephen Cousins
> <steve.cousins AT maine.edu>
> wrote:
>>
>> Hi,
>>
>> We have 16 nodes with both IB and 10GbE interfaces (both interfaces are
>> Mellanox). We also have 16 nodes that have just IB. I can run NAMD on the
>> IB-only nodes just fine, however if the job is allocated a node that has
>> both IB and 10 GbE then it does not work.
>>
>> charmrun output is:
>>
>> Charmrun> IBVERBS version of charmrun
>> [0] Stack Traceback:
>> [0:0] CmiAbort+0x5c [0xcbef1e]
>> [0:1] initInfiOtherNodeData+0x14a [0xcbe488]
>> [0:2] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbcf8b]
>> [0:3] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbd9e5]
>> [0:4] ConverseInit+0x1cd [0xcbe001]
>> [0:5] _ZN7BackEnd4initEiPPc+0x6f [0x58ad13]
>> [0:6] main+0x2f [0x585fd7]
>> [0:7] __libc_start_main+0xf4 [0x3633c1d9b4]
>> [0:8] _ZNSt8ios_base4InitD1Ev+0x4a [0x54105a]
>>
>>
>> And STDERR for the job is:
>>
>> Charmrun> started all node programs in 1.544 seconds.
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: failed to change qp state to RTR
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: failed to change qp state to RTR
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: failed to change qp state to RTR
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: failed to change qp state to RTR
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: failed to change qp state to RTR
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: failed to change qp state to RTR
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: failed to change qp state to RTR
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: failed to change qp state to RTR
>> Fatal error on PE 0> failed to change qp state to RTR
>>
>>
>> We are running Moab and Torque for the scheduler and resource manager. The
>> version of NAMD is:
>>
>> NAMD_2.9_Linux-x86_64-ibverbs
>>
>> Do I need to specify that the link layer should use IB as opposed to
>> Ethernet?
>>
>> ibstat for the nodes with both interconnects:
>>
>> CA 'mlx4_0'
>> CA type: MT4099
>> Number of ports: 2
>> Firmware version: 2.10.0
>> Hardware version: 0
>> Node GUID: 0xffffffffffffffff
>> System image GUID: 0xffffffffffffffff
>> Port 1:
>> State: Active
>> Physical state: LinkUp
>> Rate: 10
>> Base lid: 0
>> LMC: 0
>> SM lid: 0
>> Capability mask: 0x00010000
>> Port GUID: 0x0202c9fffe34e8f0
>> Link layer: Ethernet
>> Port 2:
>> State: Down
>> Physical state: Disabled
>> Rate: 10
>> Base lid: 0
>> LMC: 0
>> SM lid: 0
>> Capability mask: 0x00010000
>> Port GUID: 0x0202c9fffe34e8f1
>> Link layer: Ethernet
>> CA 'mlx4_1'
>> CA type: MT26428
>> Number of ports: 1
>> Firmware version: 2.9.1000
>> Hardware version: b0
>> Node GUID: 0x002590ffff16b658
>> System image GUID: 0x002590ffff16b65b
>> Port 1:
>> State: Active
>> Physical state: LinkUp
>> Rate: 40
>> Base lid: 21
>> LMC: 0
>> SM lid: 3
>> Capability mask: 0x02510868
>> Port GUID: 0x002590ffff16b659
>> Link layer: InfiniBand
>>
>>
>> ibstat for a node with just IB:
>>
>> CA 'mlx4_0'
>> CA type: MT26428
>> Number of ports: 1
>> Firmware version: 2.9.1000
>> Hardware version: b0
>> Node GUID: 0x002590ffff16bbe8
>> System image GUID: 0x002590ffff16bbeb
>> Port 1:
>> State: Active
>> Physical state: LinkUp
>> Rate: 40
>> Base lid: 14
>> LMC: 0
>> SM lid: 3
>> Capability mask: 0x02510868
>> Port GUID: 0x002590ffff16bbe9
>> Link layer: InfiniBand
>>
>>
>>
>>
>> Thanks for your help.
>>
>> Steve
>>
>> --
>> ______________________________________________________________________
>> Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
>> Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey Drive
>> Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME 04473
>> (207) 581-4302 ~ steve.cousins at maine.edu ~ (207) 866-6552
>>
>
>
>
> --
> ______________________________________________________________________
> Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
> Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey Drive
> Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME 04473
> (207) 581-4302 ~ steve.cousins at maine.edu ~ (207) 866-6552
>
- [charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Stephen Cousins, 06/04/2012
- Re: [charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Stephen Cousins, 06/06/2012
- Re: [charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Stephen Cousins, 06/11/2012
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/11/2012
- Message not available
- Message not available
- Message not available
- Message not available
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/12/2012
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Jim Phillips, 06/12/2012
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/12/2012
- Message not available
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/14/2012
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Stephen Cousins, 06/14/2012
- Message not available
- Message not available
- Message not available
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/11/2012
Archive powered by MHonArc 2.6.16.