charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE
Chronological Thread
- From: Gengbin Zheng <gzheng AT illinois.edu>
- To: Stephen Cousins <steve.cousins AT maine.edu>
- Cc: Charm Mailing List <charm AT cs.illinois.edu>
- Subject: Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE
- Date: Tue, 12 Jun 2012 09:05:36 -0500
- List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
- List-id: CHARM parallel programming system <charm.cs.uiuc.edu>
I added one more line of code to check if the linker layer is IB or
Ethernet from the ibv_port_attr, hope this helps.
Change is in git current branch.
Gengbin
On Tue, Jun 12, 2012 at 8:15 AM, Stephen Cousins
<steve.cousins AT maine.edu>
wrote:
> Hi Gengbin,
>
> That is where I started but it doesn't appear to be possible to reorder the
> devices on this cluster. No udev on the nodes. For now what you have done
> should be fine. I'll try to test it today.
>
> Thanks very much.
>
> Steve
>
> On Tue, Jun 12, 2012 at 1:37 AM, Gengbin Zheng
> <gzheng AT illinois.edu>
> wrote:
>>
>> I am not very sure about this. Maybe it can not check if it IB or
>> Ethernet.
>> As workaround, if you make port 1 IB and port 2 ethernet, Charm should
>> work.
>>
>> Gengbin
>>
>> On Mon, Jun 11, 2012 at 5:28 PM, Stephen Cousins
>> <steve.cousins AT maine.edu>
>> wrote:
>> > So if it is a 10 GbE port that uses the same driver and is up and
>> > running
>> > (doing Ethernet things) will it still get in the way? If so, can your
>> > solution check if it is IB vs. Ethernet? Right now I just want something
>> > to
>> > work so even if we have to disable ethernet on that device that would be
>> > fine. A longer term goal though would be to be able to use both IB and
>> > 10GbE
>> > on these nodes.
>> >
>> > Thanks,
>> >
>> > Steve
>> >
>> >
>> > On Mon, Jun 11, 2012 at 4:10 PM, Gengbin Zheng
>> > <gzheng AT illinois.edu>
>> > wrote:
>> >>
>> >> I added a call to ib_query_port to test if a port is valid or not
>> >> starting from port number 1.
>> >>
>> >> Gengbin
>> >>
>> >>
>> >> On Mon, Jun 11, 2012 at 2:57 PM, Stephen Cousins
>> >> <steve.cousins AT maine.edu>
>> >> wrote:
>> >> > Hi Gengbin,
>> >> >
>> >> > Thanks a lot. I'll give it a try.
>> >> >
>> >> > How is the new test done? Do you check the Link Layer to make sure it
>> >> > actually is a IB device as opposed to Ethernet?
>> >> >
>> >> > Steve
>> >> >
>> >> > On Mon, Jun 11, 2012 at 3:00 PM, Gengbin Zheng
>> >> > <gzheng AT illinois.edu>
>> >> > wrote:
>> >> >>
>> >> >> Hi Steve,
>> >> >>
>> >> >> Charm ib layger always assume ibport is 1, that is why it didn't
>> >> >> work
>> >> >> when you have multiple interfaces.
>> >> >> I checked in a fix to test the ib ports. It is in the latest main
>> >> >> branch.
>> >> >> Can you give it a try?
>> >> >>
>> >> >> Gengbin
>> >> >>
>> >> >> On Mon, Jun 11, 2012 at 11:12 AM, Stephen Cousins
>> >> >> <steve.cousins AT maine.edu>
>> >> >> wrote:
>> >> >> > Hi,
>> >> >> >
>> >> >> > Is this list active? Does anyone have any ideas about how charmrun
>> >> >> > can
>> >> >> > specify specific types of interfaces when using ibverbs?
>> >> >> >
>> >> >> > Thanks,
>> >> >> >
>> >> >> > Steve
>> >> >> >
>> >> >> > On Mon, Jun 4, 2012 at 1:24 PM, Stephen Cousins
>> >> >> > <steve.cousins AT maine.edu>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Hi,
>> >> >> >>
>> >> >> >> We have 16 nodes with both IB and 10GbE interfaces (both
>> >> >> >> interfaces
>> >> >> >> are
>> >> >> >> Mellanox). We also have 16 nodes that have just IB. I can run
>> >> >> >> NAMD
>> >> >> >> on
>> >> >> >> the
>> >> >> >> IB-only nodes just fine, however if the job is allocated a node
>> >> >> >> that
>> >> >> >> has
>> >> >> >> both IB and 10 GbE then it does not work.
>> >> >> >>
>> >> >> >> charmrun output is:
>> >> >> >>
>> >> >> >> Charmrun> IBVERBS version of charmrun
>> >> >> >> [0] Stack Traceback:
>> >> >> >> [0:0] CmiAbort+0x5c [0xcbef1e]
>> >> >> >> [0:1] initInfiOtherNodeData+0x14a [0xcbe488]
>> >> >> >> [0:2] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbcf8b]
>> >> >> >> [0:3] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbd9e5]
>> >> >> >> [0:4] ConverseInit+0x1cd [0xcbe001]
>> >> >> >> [0:5] _ZN7BackEnd4initEiPPc+0x6f [0x58ad13]
>> >> >> >> [0:6] main+0x2f [0x585fd7]
>> >> >> >> [0:7] __libc_start_main+0xf4 [0x3633c1d9b4]
>> >> >> >> [0:8] _ZNSt8ios_base4InitD1Ev+0x4a [0x54105a]
>> >> >> >>
>> >> >> >>
>> >> >> >> And STDERR for the job is:
>> >> >> >>
>> >> >> >> Charmrun> started all node programs in 1.544 seconds.
>> >> >> >> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> >> >> >> Reason: failed to change qp state to RTR
>> >> >> >> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> >> >> >> Reason: failed to change qp state to RTR
>> >> >> >> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> >> >> >> Reason: failed to change qp state to RTR
>> >> >> >> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> >> >> >> Reason: failed to change qp state to RTR
>> >> >> >> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> >> >> >> Reason: failed to change qp state to RTR
>> >> >> >> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> >> >> >> Reason: failed to change qp state to RTR
>> >> >> >> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> >> >> >> Reason: failed to change qp state to RTR
>> >> >> >> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> >> >> >> Reason: failed to change qp state to RTR
>> >> >> >> Fatal error on PE 0> failed to change qp state to RTR
>> >> >> >>
>> >> >> >>
>> >> >> >> We are running Moab and Torque for the scheduler and resource
>> >> >> >> manager.
>> >> >> >> The
>> >> >> >> version of NAMD is:
>> >> >> >>
>> >> >> >> NAMD_2.9_Linux-x86_64-ibverbs
>> >> >> >>
>> >> >> >> Do I need to specify that the link layer should use IB as opposed
>> >> >> >> to
>> >> >> >> Ethernet?
>> >> >> >>
>> >> >> >> ibstat for the nodes with both interconnects:
>> >> >> >>
>> >> >> >> CA 'mlx4_0'
>> >> >> >> CA type: MT4099
>> >> >> >> Number of ports: 2
>> >> >> >> Firmware version: 2.10.0
>> >> >> >> Hardware version: 0
>> >> >> >> Node GUID: 0xffffffffffffffff
>> >> >> >> System image GUID: 0xffffffffffffffff
>> >> >> >> Port 1:
>> >> >> >> State: Active
>> >> >> >> Physical state: LinkUp
>> >> >> >> Rate: 10
>> >> >> >> Base lid: 0
>> >> >> >> LMC: 0
>> >> >> >> SM lid: 0
>> >> >> >> Capability mask: 0x00010000
>> >> >> >> Port GUID: 0x0202c9fffe34e8f0
>> >> >> >> Link layer: Ethernet
>> >> >> >> Port 2:
>> >> >> >> State: Down
>> >> >> >> Physical state: Disabled
>> >> >> >> Rate: 10
>> >> >> >> Base lid: 0
>> >> >> >> LMC: 0
>> >> >> >> SM lid: 0
>> >> >> >> Capability mask: 0x00010000
>> >> >> >> Port GUID: 0x0202c9fffe34e8f1
>> >> >> >> Link layer: Ethernet
>> >> >> >> CA 'mlx4_1'
>> >> >> >> CA type: MT26428
>> >> >> >> Number of ports: 1
>> >> >> >> Firmware version: 2.9.1000
>> >> >> >> Hardware version: b0
>> >> >> >> Node GUID: 0x002590ffff16b658
>> >> >> >> System image GUID: 0x002590ffff16b65b
>> >> >> >> Port 1:
>> >> >> >> State: Active
>> >> >> >> Physical state: LinkUp
>> >> >> >> Rate: 40
>> >> >> >> Base lid: 21
>> >> >> >> LMC: 0
>> >> >> >> SM lid: 3
>> >> >> >> Capability mask: 0x02510868
>> >> >> >> Port GUID: 0x002590ffff16b659
>> >> >> >> Link layer: InfiniBand
>> >> >> >>
>> >> >> >>
>> >> >> >> ibstat for a node with just IB:
>> >> >> >>
>> >> >> >> CA 'mlx4_0'
>> >> >> >> CA type: MT26428
>> >> >> >> Number of ports: 1
>> >> >> >> Firmware version: 2.9.1000
>> >> >> >> Hardware version: b0
>> >> >> >> Node GUID: 0x002590ffff16bbe8
>> >> >> >> System image GUID: 0x002590ffff16bbeb
>> >> >> >> Port 1:
>> >> >> >> State: Active
>> >> >> >> Physical state: LinkUp
>> >> >> >> Rate: 40
>> >> >> >> Base lid: 14
>> >> >> >> LMC: 0
>> >> >> >> SM lid: 3
>> >> >> >> Capability mask: 0x02510868
>> >> >> >> Port GUID: 0x002590ffff16bbe9
>> >> >> >> Link layer: InfiniBand
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> Thanks for your help.
>> >> >> >>
>> >> >> >> Steve
>> >> >> >>
>> >> >> >> --
>> >> >> >>
>> >> >> >>
>> >> >> >> ______________________________________________________________________
>> >> >> >> Steve Cousins - Supercomputer Engineer/Administrator - Univ of
>> >> >> >> Maine
>> >> >> >> Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey
>> >> >> >> Drive
>> >> >> >> Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME
>> >> >> >> 04473
>> >> >> >> (207) 581-4302 ~ steve.cousins at maine.edu ~ (207)
>> >> >> >> 866-6552
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> >
>> >> >> >
>> >> >> > ______________________________________________________________________
>> >> >> > Steve Cousins - Supercomputer Engineer/Administrator - Univ of
>> >> >> > Maine
>> >> >> > Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey
>> >> >> > Drive
>> >> >> > Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME
>> >> >> > 04473
>> >> >> > (207) 581-4302 ~ steve.cousins at maine.edu ~ (207)
>> >> >> > 866-6552
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> >
>> >> > ______________________________________________________________________
>> >> > Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
>> >> > Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey Drive
>> >> > Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME 04473
>> >> > (207) 581-4302 ~ steve.cousins at maine.edu ~ (207) 866-6552
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > ______________________________________________________________________
>> > Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
>> > Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey Drive
>> > Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME 04473
>> > (207) 581-4302 ~ steve.cousins at maine.edu ~ (207) 866-6552
>> >
>
>
>
>
> --
> ______________________________________________________________________
> Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
> Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey Drive
> Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME 04473
> (207) 581-4302 ~ steve.cousins at maine.edu ~ (207) 866-6552
>
- [charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Stephen Cousins, 06/04/2012
- Re: [charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Stephen Cousins, 06/06/2012
- Re: [charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Stephen Cousins, 06/11/2012
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/11/2012
- Message not available
- Message not available
- Message not available
- Message not available
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/12/2012
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Jim Phillips, 06/12/2012
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/12/2012
- Message not available
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/14/2012
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Stephen Cousins, 06/14/2012
- Message not available
- Message not available
- Message not available
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/11/2012
Archive powered by MHonArc 2.6.16.