charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE
Chronological Thread
- From: Jim Phillips <jim AT ks.uiuc.edu>
- To: Gengbin Zheng <gzheng AT illinois.edu>
- Cc: Charm Mailing List <charm AT cs.illinois.edu>
- Subject: Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE
- Date: Tue, 12 Jun 2012 11:37:53 -0500 (CDT)
- List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
- List-id: CHARM parallel programming system <charm.cs.uiuc.edu>
Is there also a way to test if the port is active? People have reported that their two-port IB cards only work if the first port is used.
-Jim
On Tue, 12 Jun 2012, Gengbin Zheng wrote:
I added one more line of code to check if the linker layer is IB or
Ethernet from the ibv_port_attr, hope this helps.
Change is in git current branch.
Gengbin
On Tue, Jun 12, 2012 at 8:15 AM, Stephen Cousins
<steve.cousins AT maine.edu>
wrote:
Hi Gengbin,
That is where I started but it doesn't appear to be possible to reorder the
devices on this cluster. No udev on the nodes. For now what you have done
should be fine. I'll try to test it today.
Thanks very much.
Steve
On Tue, Jun 12, 2012 at 1:37 AM, Gengbin Zheng
<gzheng AT illinois.edu>
wrote:
I am not very sure about this. Maybe it can not check if it IB or
Ethernet.
As workaround, if you make port 1 IB and port 2 ethernet, Charm should
work.
Gengbin
On Mon, Jun 11, 2012 at 5:28 PM, Stephen Cousins
<steve.cousins AT maine.edu>
wrote:
So if it is a 10 GbE port that uses the same driver and is up and
running
(doing Ethernet things) will it still get in the way? If so, can your
solution check if it is IB vs. Ethernet? Right now I just want something
to
work so even if we have to disable ethernet on that device that would be
fine. A longer term goal though would be to be able to use both IB and
10GbE
on these nodes.
Thanks,
Steve
On Mon, Jun 11, 2012 at 4:10 PM, Gengbin Zheng
<gzheng AT illinois.edu>
wrote:
I added a call to ib_query_port to test if a port is valid or not
starting from port number 1.
Gengbin
On Mon, Jun 11, 2012 at 2:57 PM, Stephen Cousins
<steve.cousins AT maine.edu>
wrote:
Hi Gengbin,
Thanks a lot. I'll give it a try.
How is the new test done? Do you check the Link Layer to make sure it
actually is a IB device as opposed to Ethernet?
Steve
On Mon, Jun 11, 2012 at 3:00 PM, Gengbin Zheng
<gzheng AT illinois.edu>
wrote:
Hi Steve,
Charm ib layger always assume ibport is 1, that is why it didn't
work
when you have multiple interfaces.
I checked in a fix to test the ib ports. It is in the latest main
branch.
Can you give it a try?
Gengbin
On Mon, Jun 11, 2012 at 11:12 AM, Stephen Cousins
<steve.cousins AT maine.edu>
wrote:
Hi,
Is this list active? Does anyone have any ideas about how charmrun
can
specify specific types of interfaces when using ibverbs?
Thanks,
Steve
On Mon, Jun 4, 2012 at 1:24 PM, Stephen Cousins
<steve.cousins AT maine.edu>
wrote:
Hi,
We have 16 nodes with both IB and 10GbE interfaces (both
interfaces
are
Mellanox). We also have 16 nodes that have just IB. I can run
NAMD
on
the
IB-only nodes just fine, however if the job is allocated a node
that
has
both IB and 10 GbE then it does not work.
charmrun output is:
Charmrun> IBVERBS version of charmrun
[0] Stack Traceback:
[0:0] CmiAbort+0x5c [0xcbef1e]
[0:1] initInfiOtherNodeData+0x14a [0xcbe488]
[0:2] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbcf8b]
[0:3] /opt/scyld/NAMD_2.9_Linux-x86_64-ibverbs/namd2 [0xcbd9e5]
[0:4] ConverseInit+0x1cd [0xcbe001]
[0:5] _ZN7BackEnd4initEiPPc+0x6f [0x58ad13]
[0:6] main+0x2f [0x585fd7]
[0:7] __libc_start_main+0xf4 [0x3633c1d9b4]
[0:8] _ZNSt8ios_base4InitD1Ev+0x4a [0x54105a]
And STDERR for the job is:
Charmrun> started all node programs in 1.544 seconds.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: failed to change qp state to RTR
Fatal error on PE 0> failed to change qp state to RTR
We are running Moab and Torque for the scheduler and resource
manager.
The
version of NAMD is:
NAMD_2.9_Linux-x86_64-ibverbs
Do I need to specify that the link layer should use IB as opposed
to
Ethernet?
ibstat for the nodes with both interconnects:
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.10.0
Hardware version: 0
Node GUID: 0xffffffffffffffff
System image GUID: 0xffffffffffffffff
Port 1:
State: Active
Physical state: LinkUp
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0202c9fffe34e8f0
Link layer: Ethernet
Port 2:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0202c9fffe34e8f1
Link layer: Ethernet
CA 'mlx4_1'
CA type: MT26428
Number of ports: 1
Firmware version: 2.9.1000
Hardware version: b0
Node GUID: 0x002590ffff16b658
System image GUID: 0x002590ffff16b65b
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 21
LMC: 0
SM lid: 3
Capability mask: 0x02510868
Port GUID: 0x002590ffff16b659
Link layer: InfiniBand
ibstat for a node with just IB:
CA 'mlx4_0'
CA type: MT26428
Number of ports: 1
Firmware version: 2.9.1000
Hardware version: b0
Node GUID: 0x002590ffff16bbe8
System image GUID: 0x002590ffff16bbeb
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 14
LMC: 0
SM lid: 3
Capability mask: 0x02510868
Port GUID: 0x002590ffff16bbe9
Link layer: InfiniBand
Thanks for your help.
Steve
--
______________________________________________________________________
Steve Cousins - Supercomputer Engineer/Administrator - Univ of
Maine
Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey
Drive
Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME
04473
(207) 581-4302 ~ steve.cousins at maine.edu ~ (207)
866-6552
--
______________________________________________________________________
Steve Cousins - Supercomputer Engineer/Administrator - Univ of
Maine
Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey
Drive
Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME
04473
(207) 581-4302 ~ steve.cousins at maine.edu ~ (207)
866-6552
--
______________________________________________________________________
Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey Drive
Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME 04473
(207) 581-4302 ~ steve.cousins at maine.edu ~ (207) 866-6552
--
______________________________________________________________________
Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey Drive
Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME 04473
(207) 581-4302 ~ steve.cousins at maine.edu ~ (207) 866-6552
--
______________________________________________________________________
Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine
Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey Drive
Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME 04473
(207) 581-4302 ~ steve.cousins at maine.edu ~ (207) 866-6552
_______________________________________________
charm mailing list
charm AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/charm
_______________________________________________
ppl mailing list
ppl AT cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/ppl
- [charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Stephen Cousins, 06/04/2012
- Re: [charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Stephen Cousins, 06/06/2012
- Re: [charm] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Stephen Cousins, 06/11/2012
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/11/2012
- Message not available
- Message not available
- Message not available
- Message not available
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/12/2012
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Jim Phillips, 06/12/2012
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/12/2012
- Message not available
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/14/2012
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Stephen Cousins, 06/14/2012
- Message not available
- Message not available
- Message not available
- Message not available
- Re: [charm] [ppl] Problem running ibverbs version of NAMD on nodes with both IB and 10 GbE, Gengbin Zheng, 06/11/2012
Archive powered by MHonArc 2.6.16.