No luck. The same type of string processing error occurs at some point when trying to read the key itself,
see below.
rfvander@klondike:~/charm-6.7.1/examples/ampi/Cjacobi3D$ $HOME/charm-6.7.1/bin/charmrun ./jacobi 2 2 2 1000 +p 2 +vp 8 +isomalloc_sync +balancer RotateLB +LBDebug 1
Running command: ./jacobi 2 2 2 1000 +p 2 +vp 8 +isomalloc_sync +balancer RotateLB +LBDebug 1
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 2 threads
Converse/Charm++ Commit ID:
Warning> Randomization of stack pointer is turned on in kernel.
Charm++> synchronizing isomalloc memory region...
[0] consolidated Isomalloc memory region: 0x440000000 - 0x7f1a00000000 (133258240 megs)
CharmLB> Verbose level 1, load balancing period: 0.5 seconds
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (16-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
[0] RotateLB created
iter 1 time: 0.079971 maxerr: 2020.200000
iter 2 time: 0.059791 maxerr: 1696.968000
iter 3 time: 0.050566 maxerr: 1477.170240
iter 4 time: 0.046094 maxerr: 1319.433024
iter 5 time: 0.045918 maxerr: 1200.918072
iter 6 time: 0.045842 maxerr: 1108.425519
iter 7 time: 0.045895 maxerr: 1033.970839
iter 8 time: 0.045871 maxerr: 972.509242
iter 9 time: 0.045872 maxerr: 920.721889
iter 10 time: 0.045870 maxerr: 876.344030
CharmLB> RotateLB: PE [0] step 0 starting at 0.758304 Memory: 72.253906 MB
CharmLB> RotateLB: PE [0] strategy starting at 0.758354
CharmLB> RotateLB: PE [0] Memory: LBManager: 920 KB CentralLB: 3 KB
CharmLB> RotateLB: PE [0] #Objects migrating: 8, LBMigrateMsg size: 0.00 MB
CharmLB> RotateLB: PE [0] strategy finished at 0.758360 duration 0.000006 s
CharmLB> RotateLB: PE [0] step 0 finished at 0.786232 duration 0.027928 s
iter 11 time: 0.063298 maxerr: 837.779089
iter 12 time: 0.045806 maxerr: 803.868831
iter 13 time: 0.045729 maxerr: 773.751705
iter 14 time: 0.045843 maxerr: 746.772667
iter 15 time: 0.045770 maxerr: 722.424056
iter 16 time: 0.045805 maxerr: 700.305763
iter 17 time: 0.045858 maxerr: 680.097726
iter 18 time: 0.045809 maxerr: 661.540528
iter 19 time: 0.044910 maxerr: 644.421422
iter 20 time: 0.041548 maxerr: 628.564089
iter 21 time: 0.040014 maxerr: 613.821009
iter 22 time: 0.039945 maxerr: 600.067696
iter 23 time: 0.039926 maxerr: 587.198273
iter 24 time: 0.039924 maxerr: 575.122054
iter 25 time: 0.039885 maxerr: 563.760848
iter 26 time: 0.040128 maxerr: 553.046836
iter 27 time: 0.040071 maxerr: 542.920870
iter 28 time: 0.039904 maxerr: 533.331094
iter 29 time: 0.039919 maxerr: 524.231833
iter 30 time: 0.039921 maxerr: 515.582675
CharmLB> RotateLB: PE [0] step 1 starting at 1.648019 Memory: 75.172928 MB
CharmLB> RotateLB: PE [0] strategy starting at 1.648106
CharmLB> RotateLB: PE [0] Memory: LBManager: 920 KB CentralLB: 3 KB
CharmLB> RotateLB: PE [0] #Objects migrating: 8, LBMigrateMsg size: 0.00 MB
CharmLB> RotateLB: PE [0] strategy finished at 1.648112 duration 0.000006 s
CharmLB> RotateLB: PE [0] step 1 finished at 1.665523 duration 0.017504 s
iter 31 time: 0.050692 maxerr: 507.347718
iter 32 time: 0.040078 maxerr: 499.494943
iter 33 time: 0.040256 maxerr: 491.995690
iter 34 time: 0.040043 maxerr: 484.824219
iter 35 time: 0.040006 maxerr: 477.957338
iter 36 time: 0.040048 maxerr: 471.374089
iter 37 time: 0.040035 maxerr: 465.055477
iter 38 time: 0.040001 maxerr: 458.984241
iter 39 time: 0.040005 maxerr: 453.144656
iter 40 time: 0.040110 maxerr: 447.522361
iter 41 time: 0.040379 maxerr: 442.104210
iter 42 time: 0.040126 maxerr: 436.878145
iter 43 time: 0.040149 maxerr: 431.833082
iter 44 time: 0.040228 maxerr: 426.958810
iter 45 time: 0.040168 maxerr: 422.245909
iter 46 time: 0.040041 maxerr: 417.685669
iter 47 time: 0.040055 maxerr: 413.270025
iter 48 time: 0.040096 maxerr: 408.991494
iter 49 time: 0.039997 maxerr: 404.843126
iter 50 time: 0.040021 maxerr: 400.818454
CharmLB> RotateLB: PE [0] step 2 starting at 2.476987 Memory: 75.238968 MB
CharmLB> RotateLB: PE [0] strategy starting at 2.477029
CharmLB> RotateLB: PE [0] Memory: LBManager: 920 KB CentralLB: 3 KB
CharmLB> RotateLB: PE [0] #Objects migrating: 8, LBMigrateMsg size: 0.00 MB
CharmLB> RotateLB: PE [0] strategy finished at 2.477035 duration 0.000006 s
CharmLB> RotateLB: PE [0] step 2 finished at 2.493661 duration 0.016674 s
iter 51 time: 0.050363 maxerr: 396.911452
iter 52 time: 0.040102 maxerr: 393.116496
iter 53 time: 0.039939 maxerr: 389.428332
iter 54 time: 0.039998 maxerr: 385.842045
iter 55 time: 0.040045 maxerr: 382.353031
iter 56 time: 0.040046 maxerr: 378.956970
iter 57 time: 0.040027 maxerr: 375.649808
iter 58 time: 0.039957 maxerr: 372.427733
iter 59 time: 0.040017 maxerr: 369.287159
iter 60 time: 0.040044 maxerr: 366.224708
iter 61 time: 0.040012 maxerr: 363.237194
iter 62 time: 0.039956 maxerr: 360.321610
iter 63 time: 0.039989 maxerr: 357.475116
iter 64 time: 0.040022 maxerr: 354.695025
iter 65 time: 0.039989 maxerr: 351.978797
iter 66 time: 0.040025 maxerr: 349.324022
iter 67 time: 0.039996 maxerr: 346.728419
iter 68 time: 0.039968 maxerr: 344.189822
iter 69 time: 0.040082 maxerr: 341.706174
iter 70 time: 0.040181 maxerr: 339.275521
CharmLB> RotateLB: PE [0] step 3 starting at 3.302705 Memory: 75.305084 MB
CharmLB> RotateLB: PE [0] strategy starting at 3.302795
CharmLB> RotateLB: PE [0] Memory: LBManager: 920 KB CentralLB: 3 KB
CharmLB> RotateLB: PE [0] #Objects migrating: 8, LBMigrateMsg size: 0.00 MB
CharmLB> RotateLB: PE [0] strategy finished at 3.302802 duration 0.000007 s
CharmLB> RotateLB: PE [0] step 3 finished at 3.318951 duration 0.016246 s
iter 71 time: 0.049915 maxerr: 336.896006
iter 72 time: 0.040021 maxerr: 334.565860
iter 73 time: 0.040179 maxerr: 332.283400
iter 74 time: 0.040051 maxerr: 330.047020
iter 75 time: 0.040005 maxerr: 327.855193
iter 76 time: 0.040029 maxerr: 325.706456
iter 77 time: 0.040045 maxerr: 323.599418
iter 78 time: 0.040035 maxerr: 321.532746
iter 79 time: 0.040319 maxerr: 319.505169
iter 80 time: 0.040152 maxerr: 317.515469
iter 81 time: 0.040000 maxerr: 315.562481
iter 82 time: 0.040090 maxerr: 313.645090
iter 83 time: 0.040004 maxerr: 311.762228
iter 84 time: 0.040049 maxerr: 309.912871
iter 85 time: 0.040071 maxerr: 308.096037
iter 86 time: 0.039998 maxerr: 306.310783
iter 87 time: 0.040066 maxerr: 304.556206
iter 88 time: 0.039985 maxerr: 302.831437
iter 89 time: 0.040058 maxerr: 301.135641
iter 90 time: 0.040069 maxerr: 299.468016
WARNING: Unknown MPI_Info key given to AMPI_Migrate: ampi_load_balanceÿÿÿÿÿÿÿ%
From: Van Der Wijngaart, Rob F
Sent: Monday, November 28, 2016 3:59 PM
To: 'Phil Miller' <mille121 AT illinois.edu>
Cc: 'Sam White' <white67 AT illinois.edu>; 'charm AT cs.uiuc.edu' <charm AT cs.uiuc.edu>
Subject: RE: [charm] Adaptive MPI
For now I am overriding the load balancer test in the code that reads its key value and am just executing TCHARM_Migrate() whenever the key is found, regardless of its value.
Keep fingers crossed.
Hi Phil,
So far I had been using charm6.7.0, but I started to notice errors that appeared to be caused by the migration routines in AMPI, so I tried out the new version, 6.7.1. The
way the load balancing hints are read appears corrupted. Please see below for a run with an example from examples/ampi/Cjacobi3D. The first time the value of the load balancer key is read it is correct, but all subsequent times when it is actually used, the
library attaches a random character. I inserted the debug line:
key 0 equals ampi_load_balance with value sync
Rob
rfvander@klondike:~/charm-6.7.1/examples/ampi/Cjacobi3D$ $HOME/charm-6.7.1/bin/charmrun ./jacobi 2 2
2 30 +p 2 +vp 8 +isomalloc_sync +balancer RotateLB +LBDebug 1
Running command: ./jacobi 2 2 2 30 +p 2 +vp 8 +isomalloc_sync +balancer RotateLB +LBDebug 1
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 2 threads
Converse/Charm++ Commit ID:
Warning> Randomization of stack pointer is turned on in kernel.
Charm++> synchronizing isomalloc memory region...
[0] consolidated Isomalloc memory region: 0x440000000 - 0x7f5d00000000 (133532672 megs)
CharmLB> Verbose level 1, load balancing period: 0.5 seconds
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (16-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
[0] RotateLB created
iter 1 time: 0.078998 maxerr: 2020.200000
iter 2 time: 0.059326 maxerr: 1696.968000
iter 3 time: 0.050306 maxerr: 1477.170240
iter 4 time: 0.045964 maxerr: 1319.433024
iter 5 time: 0.045959 maxerr: 1200.918072
iter 6 time: 0.045985 maxerr: 1108.425519
iter 7 time: 0.045932 maxerr: 1033.970839
iter 8 time: 0.045992 maxerr: 972.509242
iter 9 time: 0.045941 maxerr: 920.721889
iter 10 time: 0.045945 maxerr: 876.344030
key 0 equals ampi_load_balance with value sync
key 0 equals ampi_load_balance with value sync
key 0 equals ampi_load_balance with value sync
key 0 equals ampi_load_balance with value sync
key 0 equals ampi_load_balance with value sync
key 0 equals ampi_load_balance with value sync
key 0 equals ampi_load_balance with value sync
key 0 equals ampi_load_balance with value sync
CharmLB> RotateLB: PE [0] step 0 starting at 0.853504 Memory: 72.253906 MB
CharmLB> RotateLB: PE [0] strategy starting at 0.853559
CharmLB> RotateLB: PE [0] Memory: LBManager: 920 KB CentralLB: 3 KB
CharmLB> RotateLB: PE [0] #Objects migrating: 8, LBMigrateMsg size: 0.00 MB
CharmLB> RotateLB: PE [0] strategy finished at 0.853564 duration 0.000005 s
CharmLB> RotateLB: PE [0] step 0 finished at 0.882196 duration 0.028692 s
iter 11 time: 0.063316 maxerr: 837.779089
iter 12 time: 0.046134 maxerr: 803.868831
iter 13 time: 0.046079 maxerr: 773.751705
iter 14 time: 0.046063 maxerr: 746.772667
iter 15 time: 0.046088 maxerr: 722.424056
iter 16 time: 0.046083 maxerr: 700.305763
iter 17 time: 0.046087 maxerr: 680.097726
iter 18 time: 0.046047 maxerr: 661.540528
iter 19 time: 0.044149 maxerr: 644.421422
iter 20 time: 0.040968 maxerr: 628.564089
iter 21 time: 0.040264 maxerr: 613.821009
iter 22 time: 0.040429 maxerr: 600.067696
iter 23 time: 0.040471 maxerr: 587.198273
iter 24 time: 0.040278 maxerr: 575.122054
iter 25 time: 0.040325 maxerr: 563.760848
iter 26 time: 0.040425 maxerr: 553.046836
iter 27 time: 0.040186 maxerr: 542.920870
iter 28 time: 0.040066 maxerr: 533.331094
iter 29 time: 0.040020 maxerr: 524.231833
key 0 equals ampi_load_balance with value synca
WARNING: Unknown MPI_Info value (synca) given to AMPI_Migrate for key: ampi_load_balance
key 0 equals ampi_load_balance with value synca
WARNING: Unknown MPI_Info value (synca) given to AMPI_Migrate for key: ampi_load_balance
key 0 equals ampi_load_balance with value synca
WARNING: Unknown MPI_Info value (synca) given to AMPI_Migrate for key: ampi_load_balance
key 0 equals ampi_load_balance with value synca
WARNING: Unknown MPI_Info value (synca) given to AMPI_Migrate for key: ampi_load_balance
key 0 equals ampi_load_balance with value synca
WARNING: Unknown MPI_Info value (synca) given to AMPI_Migrate for key: ampi_load_balance
key 0 equals ampi_load_balance with value synca
WARNING: Unknown MPI_Info value (synca) given to AMPI_Migrate for key: ampi_load_balance
key 0 equals ampi_load_balance with value synca
WARNING: Unknown MPI_Info value (synca) given to AMPI_Migrate for key: ampi_load_balance
iter 30 time: 0.040080 maxerr: 515.582675
key 0 equals ampi_load_balance with value synca
WARNING: Unknown MPI_Info value (synca) given to AMPI_Migrate for key: ampi_load_balance
[Partition 0][Node 0] End of program
From:
unmobile AT gmail.com [mailto:unmobile AT gmail.com]
On Behalf Of Phil Miller
Sent: Friday, November 25, 2016 2:09 PM
To: Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com>
Cc: Sam White <white67 AT illinois.edu>;
charm AT cs.uiuc.edu
Subject: Re: [charm] Adaptive MPI
Sam: It seems like it should be straightforward to add an assertion in our API entry/exit tracking sentries to catch this kind of issue. Essentially, it would need to check that the calling thread is actually an AMPI process thread that's
supposed to be running. We should also document that PUP routines for AMPI code can't call MPI routines.
On Thu, Nov 24, 2016 at 5:36 PM, Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com> wrote:
Hi Sam,
I put the code away for a bit and just started looking at it again. I identified one major (and vexing source) of
errors: I tried to get ranks to print what they were doing (using MP{I_Comm_rank) inside the PUP routine, and also to synchronize (MPI_Barrier) to order the output. But that it evidently not valid inside the routine, depending on the mode with which it is
called. The first two entries are fine, but once migration takes place, errors result. I took all MPI calls out of the PUP routine, and now the code progresses a lot further. Still bombs, but I am pretty sure I can track down the segmentation violation.
Happy Thanksgiving!
Rob
From:
samt.white AT gmail.com [mailto:samt.white AT gmail.com]
On Behalf Of Sam White
Sent: Wednesday, November 23, 2016 1:30 PM
Your code is failing inside the call to pup_isPacking(p)? Or it is failing while packing? A pup_er is indeed a pointer.
Also, you should still be using '+isomalloc_sync' whenever Charm gives you that warning during startup: even though you aren't using Isomalloc Heaps, AMPI is using Isomalloc Stacks for its user-level threads.
-Sam
On Wed, Nov 23, 2016 at 3:06 PM, Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com> wrote:
Thanks, Sam. The code crashes inside AMPI_Migrate, so it doesn’t reach any print statements after that. I tracked
down the statement that causes the crash. It is this one: pup_isPacking(p), where p is of type pup_er, I presume that is a pointer, so I printed it as such. They all look like reasonable addresses to me. None of the ranks prints NULL.
Rob
From:
samt.white AT gmail.com [mailto:samt.white AT gmail.com]
On Behalf Of Sam White
Sent: Wednesday, November 23, 2016 12:21 PM
To: Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com>
Cc: charm AT cs.uiuc.edu
Subject: Re: Adaptive MPI
The Isomalloc failure appears to be a locking issue during Charm/Converse startup in SMP/multicore builds when running with Isomalloc. We are looking at this now: https://charm.cs.illinois.edu/redmine/issues/1310.
If you switch to a non-SMP/multicore build it will work.
To debug the issue with your PUP code, I would suggest adding print statements before/after your AMPI_Migrate() call, and inside the PUP routine. It often helps to see where in the PUP process (sizing, packing, deleting, unpacking) the runtime is when it fails
to debug these types of issues.
-Sam
On Wed, Nov 23, 2016 at 11:28 AM, Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com> wrote:
Hi Sam,
The first experiment was successful, but the isomalloc example hangs. See below. Unless it is a symptom of something
bigger, I am not going to worry about the latter, since I wasn’t planning to use isomalloc for heap migration anyway. My regular MPI code on which the AMPI version is based runs fine for all the parameters I have tried, but I reckon that it may contain a memory
bug that manifests itself only with load balancing
Rob
rfvander@klondike:~/Cjacobi3D$ make
/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicxx -c jacobi.C
/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicxx -o jacobi jacobi.o -module CommonLBs -lm
/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicxx -c -DNO_PUP jacobi.C -o jacobi.iso.o
/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicxx -o jacobi.iso jacobi.iso.o -module CommonLBs -memory isomalloc
/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicxx -c -tlsglobal jacobi.C -o jacobi.tls.o
/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicxx -o jacobi.tls jacobi.tls.o -tlsglobal -module CommonLBs #-memory isomalloc
/opt/charm/charm-6.7.0/multicore-linux64/bin/../lib/libconv-util.a(sockRoutines.o): In function `skt_lookup_ip':
sockRoutines.c:(.text+0x334): warning: Using 'gethostbyname' in statically linked applications requires at runtime the shared
libraries from the glibc version used for linking
/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicxx -c jacobi-get.C
/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicxx -o jacobi-get jacobi-get.o -module CommonLBs -lm
rfvander@klondike:~/Cjacobi3D$ ./charmrun +p3 ./jacobi 2 2 2 +vp8 +balancer RotateLB +LBDebug 1
Running command: ./jacobi 2 2 2 +vp8 +balancer RotateLB +LBDebug 1 +p3
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 3 threads
Converse/Charm++ Commit ID: v6.7.0-1-gca55e1d
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space'
as root to disable it, or try run with '+isomalloc_sync'.
CharmLB> Verbose level 1, load balancing period: 0.5 seconds
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (16-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
[0] RotateLB created
iter 1 time: 0.142733 maxerr: 2020.200000
iter 2 time: 0.157225 maxerr: 1696.968000
iter 3 time: 0.172039 maxerr: 1477.170240
iter 4 time: 0.146178 maxerr: 1319.433024
iter 5 time: 0.123098 maxerr: 1200.918072
iter 6 time: 0.131063 maxerr: 1108.425519
iter 7 time: 0.138213 maxerr: 1033.970839
iter 8 time: 0.138295 maxerr: 972.509242
iter 9 time: 0.138113 maxerr: 920.721889
iter 10 time: 0.121553 maxerr: 876.344030
CharmLB> RotateLB: PE [0] step 0 starting at 1.489509 Memory: 72.253906 MB
CharmLB> RotateLB: PE [0] strategy starting at 1.489573
CharmLB> RotateLB: PE [0] Memory: LBManager: 920 KB CentralLB: 3 KB
CharmLB> RotateLB: PE [0] #Objects migrating: 8, LBMigrateMsg size: 0.00 MB
CharmLB> RotateLB: PE [0] strategy finished at 1.489592 duration 0.000019 s
CharmLB> RotateLB: PE [0] step 0 finished at 1.507922 duration 0.018413 s
iter 11 time: 0.152840 maxerr: 837.779089
iter 12 time: 0.136401 maxerr: 803.868831
iter 13 time: 0.138095 maxerr: 773.751705
iter 14 time: 0.139319 maxerr: 746.772667
iter 15 time: 0.139327 maxerr: 722.424056
iter 16 time: 0.141794 maxerr: 700.305763
iter 17 time: 0.142484 maxerr: 680.097726
iter 18 time: 0.141056 maxerr: 661.540528
iter 19 time: 0.153895 maxerr: 644.421422
iter 20 time: 0.198588 maxerr: 628.564089
[Partition 0][Node 0] End of program
rfvander@klondike:~/Cjacobi3D$ ./charmrun +p3 ./jacobi.iso 2 2 2 +vp8 +balancer RotateLB +LBDebug 1
Running command: ./jacobi.iso 2 2 2 +vp8 +balancer RotateLB +LBDebug 1 +p3
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 3 threads
^C
rfvander@klondike:~/Cjacobi3D$ ./charmrun +p3 ./jacobi.iso 2 2 2 +vp8 +balancer RotateLB +LBDebug 1 +isomalloc_sync
Running command: ./jacobi.iso 2 2 2 +vp8 +balancer RotateLB +LBDebug 1 +isomalloc_sync +p3
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 3 threads
From:
samt.white AT gmail.com [mailto:samt.white AT gmail.com]
On Behalf Of Sam White
Sent: Wednesday, November 23, 2016 7:10 AM
To: Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com>
Cc: charm AT cs.uiuc.edu
Subject: Re: Adaptive MPI
Can you try an example AMPI program with load balancing? You can try charm/examples/ampi/Cjacobi3D/, running with something like '
./charmrun +p3 ./jacobi 2 2 2 +vp8 +balancer RotateLB +LBDebug 1'. You can also test that example with Isomalloc by running jacobi.iso (and as the warning in the Charm preamble
output suggests, run with +isomalloc_sync). It also might help to build Charm++/AMPI with '-g' to get stacktraces.
-Sam
On Wed, Nov 23, 2016 at 2:19 AM, Van Der Wijngaart, Rob F <rob.f.van.der.wijngaart AT intel.com> wrote:
Hello Team,
I am trying to troubleshoot my Adaptive MPI code that uses dynamic load balancing. It crashes with a segmentation
fault in AMPI_Migrate. I checked and dchunkpup (which I supplied) is called within AMPI_Migrate and finishes on all ranks. That is not to say it is correct, but the crash is not happening there. It could have corrupted memory elsewhere, though, so I gutted
it, such that it only asks for and prints the MPI rank of the ranks entering it. I added graceful exit code after the call to AMPI_Migrate, But that is evidently not reached. I understand that this information is not enough for you to identify the problem,
but at present I don’t know where to start, since the error occurs in code that I did not write. Could you give me some pointers where to start? Thanks!
Below is some relevant output. If I replace the RotateLB load balancer with RefineLB, some ranks do pass the AMPI_Migrate
call, but that is evidently because the load balancer left them alone.
Rob
rfvander@klondike:~/esg-prk-devel/AMPI/AMR$ make clean; make amr USE_PUPER=1
rm -f amr.o MPI_bail_out.o wtime.o amr *.optrpt *~ charmrun stats.json amr.decl.h amr.def.h
/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicc -O3 -std=c99 -DADAPTIVE_MPI -DRESTRICT_KEYWORD=0 -DVERBOSE=0 -DDOUBLE=1
-DRADIUS=2 -DSTAR=1 -DLOOPGEN=0 -DUSE_PUPER=1 -I../../include -c amr.c
In file included from amr.c:66:0:
../../include/par-res-kern_general.h: In function ‘prk_malloc’:
../../include/par-res-kern_general.h:136:11: warning: implicit declaration of function ‘posix_memalign’ [-Wimplicit-function-declaration]
ret = posix_memalign(&ptr,alignment,bytes);
^
amr.c: In function ‘AMPI_Main’:
amr.c:842:14: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘long int’ [-Wformat=]
printf("ERROR: rank %d's BG work tile smaller than stencil radius: %d\n",
^
amr.c:1080:14: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]
printf("ERROR: rank %d's work tile %d smaller than stencil radius: %d\n",
^
amr.c:1518:14: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘long int’ [-Wformat=]
printf("Rank %d about to call AMPI_Migrate in iter %d\n", my_ID, iter);
^
amr.c:1520:14: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘long int’ [-Wformat=]
printf("Rank %d called AMPI_Migrate in iter %d\n", my_ID, iter);
^
/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicc -O3 -std=c99 -DADAPTIVE_MPI -DRESTRICT_KEYWORD=0 -DVERBOSE=0 -DDOUBLE=1
-DRADIUS=2 -DSTAR=1 -DLOOPGEN=0 -DUSE_PUPER=1 -I../../include -c ../../common/MPI_bail_out.c
In file included from ../../common/MPI_bail_out.c:51:0:
../../include/par-res-kern_general.h: In function ‘prk_malloc’:
../../include/par-res-kern_general.h:136:11: warning: implicit declaration of function ‘posix_memalign’ [-Wimplicit-function-declaration]
ret = posix_memalign(&ptr,alignment,bytes);
^
/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicc -O3 -std=c99 -DADAPTIVE_MPI -DRESTRICT_KEYWORD=0 -DVERBOSE=0 -DDOUBLE=1
-DRADIUS=2 -DSTAR=1 -DLOOPGEN=0 -DUSE_PUPER=1 -I../../include -c ../../common/wtime.c
/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicc -language ampi -o amr -O3 -std=c99 -DADAPTIVE_MPI amr.o MPI_bail_out.o
wtime.o -lm -module CommonLBs
cc1plus: warning: command line option ‘-std=c99’ is valid for C/ObjC but not for C++
rfvander@klondike:~/esg-prk-devel/AMPI/AMR$ /opt/charm/charm-6.7.0/bin/charmrun ./amr 20 1000 500 3 10 5 1 FINE_GRAIN +p
8 +vp 16 +balancer RotateLB +LBDebug 1
Running command: ./amr 20 1000 500 3 10 5 1 FINE_GRAIN +p 8 +vp 16 +balancer RotateLB +LBDebug 1
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 8 threads
Converse/Charm++ Commit ID: v6.7.0-1-gca55e1d
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space'
as root to disable it, or try run with '+isomalloc_sync'.
CharmLB> Verbose level 1, load balancing period: 0.5 seconds
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (16-way SMP).
Charm++> cpu topology info is gathered in 0.001 seconds.
[0] RotateLB created
Parallel Research Kernels Version 2.17
MPI AMR stencil execution on 2D grid
Number of ranks = 16
Background grid size = 1000
Radius of stencil = 2
Tiles in x/y-direction on BG = 4/4
Tiles in x/y-direction on ref 0 = 4/4
Tiles in x/y-direction on ref 1 = 4/4
Tiles in x/y-direction on ref 2 = 4/4
Tiles in x/y-direction on ref 3 = 4/4
Type of stencil = star
Data type = double precision
Compact representation of stencil loop body
Number of iterations = 20
Load balancer = FINE_GRAIN
Refinement rank spread = 16
Refinements:
Background grid points = 500
Grid size = 3993
Refinement level = 3
Period = 10
Duration = 5
Sub-iterations = 1
Rank 12 about to call AMPI_Migrate in iter 0
Rank 12 entered dchunkpup
Rank 7 about to call AMPI_Migrate in iter 0
Rank 7 entered dchunkpup
Rank 8 about to call AMPI_Migrate in iter 0
Rank 8 entered dchunkpup
Rank 4 about to call AMPI_Migrate in iter 0
Rank 4 entered dchunkpup
Rank 15 about to call AMPI_Migrate in iter 0
Rank 15 entered dchunkpup
Rank 11 about to call AMPI_Migrate in iter 0
Rank 11 entered dchunkpup
Rank 3 about to call AMPI_Migrate in iter 0
Rank 1 about to call AMPI_Migrate in iter 0
Rank 1 entered dchunkpup
Rank 3 entered dchunkpup
Rank 13 about to call AMPI_Migrate in iter 0
Rank 13 entered dchunkpup
Rank 6 about to call AMPI_Migrate in iter 0
Rank 6 entered dchunkpup
Rank 0 about to call AMPI_Migrate in iter 0
Rank 0 entered dchunkpup
Rank 9 about to call AMPI_Migrate in iter 0
Rank 9 entered dchunkpup
Rank 5 about to call AMPI_Migrate in iter 0
Rank 5 entered dchunkpup
Rank 2 about to call AMPI_Migrate in iter 0
Rank 2 entered dchunkpup
Rank 10 about to call AMPI_Migrate in iter 0
Rank 10 entered dchunkpup
Rank 14 about to call AMPI_Migrate in iter 0
Rank 14 entered dchunkpup
CharmLB> RotateLB: PE [0] step 0 starting at 0.507547 Memory: 990.820312 MB
CharmLB> RotateLB: PE [0] strategy starting at 0.511685
CharmLB> RotateLB: PE [0] Memory: LBManager: 920 KB CentralLB: 19 KB
CharmLB> RotateLB: PE [0] #Objects migrating: 16, LBMigrateMsg size: 0.00 MB
CharmLB> RotateLB: PE [0] strategy finished at 0.511696 duration 0.000011 s
Segmentation fault (core dumped)