charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
- From: Eric Bohm <ebohm AT illinois.edu>
- To: <charm AT cs.uiuc.edu>
- Subject: Re: [charm] Migration error with AMPI + ibverbs (+ SMP)
- Date: Fri, 22 Mar 2013 13:50:08 -0500
- List-archive: <http://lists.cs.uiuc.edu/pipermail/charm/>
- List-id: CHARM parallel programming system <charm.cs.uiuc.edu>
How
On 03/22/2013 08:16 AM, Rafael Keller Tesser wrote: Hello, I don't think my previous message reached the list (I wasn't subscribed to it), so I am including a copy at the end of this message. I ported a geophysics application to AMPI, in order to experiment with its load balancing features. I wanted to run some tests using Infiniband, but I was getting a migration error when using AMPI compiled with the ibverbs option. This error didn't happen when running without ibverbs. By looking at the compilation options used in the nightly regression tests (http://charm.cs.illinois.edu/autobuild/cur/), I notice that the ibverbs versions where being compiled with the option "-thread context". So, I built Charm++, AMPI, and my application with this option and now it seems to be working with Infiniband. However, I'd like to know what this option means. I couldn't find this information anywhere. On another topic, I am also interested in using the smp module. At first i was getting a migration error, then I found out I needed to pass the option +CmiNoProcForComThread to the runtime. So, now I can execute my application with ibverbs OR smp. But not with ibverbs AND smp together! When I try to run with the application on Charm++/AMPI built with ibverbs and smp, I get a segmentation violation error. When I compile the program with "-memory paranoid", the error disappears. Commands used to build Charm++ and AMPI: ./build charm++ net-linux-x86_64 ibverbs smp -j16 --with-production -thread context ./build AMPI net-linux-x86_64 ibverbs smp -j16 --with-production -thread context I am passing the following options to charmrun (on 4 nodes x 8 cores per node): ./charmrun ondes3d +p32 +vp 128 +mapping BLOCK_MAP ++remote-shell ssh +setcpuaffinity +balancer GreedyLB +CmiNoProcForComThread I also tested with the migration test program that came with Charm (in the subdirectory tests/ampi/migration). I doesn't give a segmentaion violation, but sometimes it hangs during migration. I included the output below this message. Any Idea on what the problem maybe? -- Best regards, Rafael Keller Tesser GPPD - Grupo de Processamento Paralelo e Distribuído Instituto de Informática / UFRGS Porto Alegre, RS - Brasil ------------------- ****Output of the migration test program (until it hangs):**** ./charmrun ./pgm +p2 +vp4 +CmiNoProcForComThread Charmrun> IBVERBS version of charmrun Charmrun> started all node programs in 1.198 seconds. Converse/Charm++ Commit ID: Charm++> scheduler running in netpoll mode. CharmLB> Load balancer assumes all CPUs are same. Charm++> Running on 1 unique compute nodes (8-way SMP). Charm++> cpu topology info is gathered in 0.002 seconds. begin migrating begin migrating begin migrating begin migrating Trying to migrate partition 1 from pe 0 to 1 Entering TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 0, migrate_test is 0 Leaving TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 1, migrate_test is 0 Done with step 0 Done with step 0 Done with step 0 Done with step 0 Trying to migrate partition 1 from pe 1 to 0 Entering TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 1, migrate_test is 1 Leaving TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 0, migrate_test is 1 Done with step 1 Done with step 1 Done with step 1 Done with step 1 Trying to migrate partition 1 from pe 0 to 1 Entering TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 0, migrate_test is 0 Leaving TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 1, migrate_test is 0 Done with step 2 Done with step 2 Done with step 2 Done with step 2 done migrating done migrating done migrating done migrating All tests passed ./charmrun ./pgm +p2 +vp20 +CmiNoProcForComThread Charmrun> IBVERBS version of charmrun Charmrun> started all node programs in 1.174 seconds. Converse/Charm++ Commit ID: Charm++> scheduler running in netpoll mode. CharmLB> Load balancer assumes all CPUs are same. Charm++> Running on 1 unique compute nodes (8-way SMP). Charm++> cpu topology info is gathered in 0.002 seconds. begin migrating begin migrating Trying to migrate partition 1 from pe 0 to 1 begin migrating Entering TCHARM_Migrate_to, FEM_My_partition is 1, CkMyPe() is 0, migrate_test is 0 -------------------------------------------------------------- ***My previous message:**** From: Rafael Keller Tesser <rafael.tesser AT inf.ufrgs.br> Date: Thu, Mar 21, 2013 at 10:38 AM Subject: MIgration error With AMPI + ibverbs To: charm AT cs.uiuc.edu Hello, I ported a geophysics application to AMPI, in order to experiment with its load balancing features. Without load-balancing the application runs without any error, on both Gigabit Ethernet and Infiniband. With load-balancing, the application runs fine on Gigabit Ethernet. With the IBVERBS version of Charm, however, I am getting the following error, during the first load-balancing step: -- ... CharmLB> GreedyLB: PE [0] Memory: LBManager: 921 KB CentralLB: 87 KB CharmLB> GreedyLB: PE [0] #Objects migrating: 247, LBMigrateMsg size: 0.02 MB CharmLB> GreedyLB: PE [0] strategy finished at 55.669918 duration 0.007592 s [0] Starting ReceiveMigration step 0 at 55.672409 Charmrun: error on request socket-- Socket closed before recv. -- I send the full output in a file attached to this message (output.txt). The error also happens with the AMPI migration test program that comes with charm++ (located in tests/ampi/migration). The outputs are attached to this message. I get this error both with Charm-6.4.0 and with the development version from the Git repository. AMPI was built with: ./build charm++ net-linux-x86_64 ibverbs --with-production -j16 ./build AMPI net-linux-x86_64 ibverbs --with-production -j16 Do you have any ideas on what may be causing this error? -- Best regards, Rafael Keller Tesser GPPD - Grupo de Processamento Paralelo e Distribuído Instituto de Informática / UFRGS _______________________________________________ charm mailing list charm AT cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/charm |
- [charm] Migration error with AMPI + ibverbs (+ SMP), Rafael Keller Tesser, 03/22/2013
- [charm] Migration error with AMPI + ibverbs (+ SMP), Rafael Keller Tesser, 03/22/2013
- Re: [charm] Migration error with AMPI + ibverbs (+ SMP), Eric Bohm, 03/22/2013
- <Possible follow-up(s)>
- Re: [charm] Migration error with AMPI + ibverbs (+ SMP), Jain, Nikhil, 03/22/2013
- Re: [charm] Migration error with AMPI + ibverbs (+ SMP), Jain, Nikhil, 03/22/2013
- Re: [charm] Migration error with AMPI + ibverbs (+ SMP), Rafael Keller Tesser, 03/23/2013
- Re: [charm] Migration error with AMPI + ibverbs (+ SMP), Jain, Nikhil, 03/27/2013
- [charm] Migration error with AMPI + ibverbs (+ SMP), Rafael Keller Tesser, 03/22/2013
Archive powered by MHonArc 2.6.16.