charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
- From: Thomas Albers <talbers AT binghamton.edu>
- To: charm AT cs.uiuc.edu
- Subject: Re: [charm] [ppl] Myrinet MX broken in Charm-6.3.2
- Date: Thu, 16 Jun 2011 17:28:33 -0400
- List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
- List-id: CHARM parallel programming system <charm.cs.uiuc.edu>
Hello!
> There actually never was native x86_64 support for MX because InfiniBand
> and x86_64 took over the market at about the same time MX came out.
> Since it exists on other 64-bit platforms the port should be pretty
> easy, just copying conv-mach-mx.h and conv-mach-mx.sh to
> arch/net-linux_86_64 from another platform and figuring out what the
> right #defines are.
As I said, the idea is to use Open-MX instead of TCP/IP networking in
the hope that the lower latency translates into greater speed when
running NAMD.
I have:
#define CMK_USE_MX 1
#define CMK_NETPOLL 1
#define CMK_BARRIER_USE_COMMON_CODE 0
This fails immediately, with Open-MX versions 1.4.0 and 1.4.901:
ta@porsche
~/NAMD_2.8_Source/charm-6.3.2/tests/charm++/megatest $
./charmrun +p2 ./pgm
Charmrun> error 93620 attaching to node:
Socket closed before recv.
Perhaps more interesting is the behavior when using OpenMPI (1.4.3). The
test suite finishes most of the time, but sometimes one sees this:
ta@porsche
~/NAMD_2.8_Source/charm-6.3.2/mpi-linux-x86_64-smp-mpicxx/tests/charm++
$ mpirun --mca btl mx,sm,self -H
porsche,porsche,porsche,porsche,ferrari,ferrari,ferrari,ferrari,yamaha,yamaha,yamaha,yamaha,michelin,michelin,michelin,michelin,michelin,michelin
./pgm
...
test 36: initiated [multi marshall (olawlor)]
[16] Stack Traceback:
[16:0] CmiAbort+0x58 [0x58c26a]
[16:1] [0x58d8a2]
[16:2] CmiHandleMessage+0x2b [0x58d5c7]
[16:3] CsdScheduleForever+0x5c [0x5911d6]
[16:4] CsdScheduler+0xd [0x59124e]
[16:5] [0x58cc1c]
[16:6] [0x58d1c7]
[16:7] +0x6ac4 [0x7fa3a8942ac4]
[16:8] clone+0x6d [0x7fa3a6f793ed]
------------- Processor 16 Exiting: Called CmiAbort ------------
Reason: Converse zero handler executed-- was a message corrupted?
Does the behavior of Open-MX differ in some subtle way from Myrinet-MX?
How could I be helpful in tracking down the bug, and is it worth it?
Regards,
Thomas
- [charm] Myrinet MX broken in Charm-6.3.2, Thomas Albers, 06/15/2011
- Re: [charm] [ppl] Myrinet MX broken in Charm-6.3.2, Jim Phillips, 06/16/2011
- Re: [charm] [ppl] Myrinet MX broken in Charm-6.3.2, Thomas Albers, 06/16/2011
- Re: [charm] [ppl] Myrinet MX broken in Charm-6.3.2, Jim Phillips, 06/16/2011
Archive powered by MHonArc 2.6.16.