Hello Team,
I am trying to troubleshoot my Adaptive MPI code that uses dynamic load balancing. It crashes with a segmentation
fault in AMPI_Migrate. I checked and dchunkpup (which I supplied) is called within AMPI_Migrate and finishes on all ranks. That is not to say it is correct, but the crash is not happening there. It could have corrupted memory elsewhere, though, so I gutted
it, such that it only asks for and prints the MPI rank of the ranks entering it. I added graceful exit code after the call to AMPI_Migrate, But that is evidently not reached. I understand that this information is not enough for you to identify the problem,
but at present I don’t know where to start, since the error occurs in code that I did not write. Could you give me some pointers where to start? Thanks!
Below is some relevant output. If I replace the RotateLB load balancer with RefineLB, some ranks do pass the AMPI_Migrate
call, but that is evidently because the load balancer left them alone.
Rob
rfvander@klondike:~/esg-prk-devel/AMPI/AMR$ make clean; make amr USE_PUPER=1
rm -f amr.o MPI_bail_out.o wtime.o amr *.optrpt *~ charmrun stats.json amr.decl.h amr.def.h
/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicc -O3 -std=c99 -DADAPTIVE_MPI -DRESTRICT_KEYWORD=0 -DVERBOSE=0 -DDOUBLE=1
-DRADIUS=2 -DSTAR=1 -DLOOPGEN=0 -DUSE_PUPER=1 -I../../include -c amr.c
In file included from amr.c:66:0:
../../include/par-res-kern_general.h: In function ‘prk_malloc’:
../../include/par-res-kern_general.h:136:11: warning: implicit declaration of function ‘posix_memalign’ [-Wimplicit-function-declaration]
ret = posix_memalign(&ptr,alignment,bytes);
^
amr.c: In function ‘AMPI_Main’:
amr.c:842:14: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘long int’ [-Wformat=]
printf("ERROR: rank %d's BG work tile smaller than stencil radius: %d\n",
^
amr.c:1080:14: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]
printf("ERROR: rank %d's work tile %d smaller than stencil radius: %d\n",
^
amr.c:1518:14: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘long int’ [-Wformat=]
printf("Rank %d about to call AMPI_Migrate in iter %d\n", my_ID, iter);
^
amr.c:1520:14: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘long int’ [-Wformat=]
printf("Rank %d called AMPI_Migrate in iter %d\n", my_ID, iter);
^
/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicc -O3 -std=c99 -DADAPTIVE_MPI -DRESTRICT_KEYWORD=0 -DVERBOSE=0 -DDOUBLE=1
-DRADIUS=2 -DSTAR=1 -DLOOPGEN=0 -DUSE_PUPER=1 -I../../include -c ../../common/MPI_bail_out.c
In file included from ../../common/MPI_bail_out.c:51:0:
../../include/par-res-kern_general.h: In function ‘prk_malloc’:
../../include/par-res-kern_general.h:136:11: warning: implicit declaration of function ‘posix_memalign’ [-Wimplicit-function-declaration]
ret = posix_memalign(&ptr,alignment,bytes);
^
/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicc -O3 -std=c99 -DADAPTIVE_MPI -DRESTRICT_KEYWORD=0 -DVERBOSE=0 -DDOUBLE=1
-DRADIUS=2 -DSTAR=1 -DLOOPGEN=0 -DUSE_PUPER=1 -I../../include -c ../../common/wtime.c
/opt/charm/charm-6.7.0/multicore-linux64/bin/ampicc -language ampi -o amr -O3 -std=c99 -DADAPTIVE_MPI amr.o MPI_bail_out.o
wtime.o -lm -module CommonLBs
cc1plus: warning: command line option ‘-std=c99’ is valid for C/ObjC but not for C++
rfvander@klondike:~/esg-prk-devel/AMPI/AMR$ /opt/charm/charm-6.7.0/bin/charmrun ./amr 20 1000 500 3 10 5 1 FINE_GRAIN +p
8 +vp 16 +balancer RotateLB +LBDebug 1
Running command: ./amr 20 1000 500 3 10 5 1 FINE_GRAIN +p 8 +vp 16 +balancer RotateLB +LBDebug 1
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 8 threads
Converse/Charm++ Commit ID: v6.7.0-1-gca55e1d
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space'
as root to disable it, or try run with '+isomalloc_sync'.
CharmLB> Verbose level 1, load balancing period: 0.5 seconds
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (16-way SMP).
Charm++> cpu topology info is gathered in 0.001 seconds.
[0] RotateLB created
Parallel Research Kernels Version 2.17
MPI AMR stencil execution on 2D grid
Number of ranks = 16
Background grid size = 1000
Radius of stencil = 2
Tiles in x/y-direction on BG = 4/4
Tiles in x/y-direction on ref 0 = 4/4
Tiles in x/y-direction on ref 1 = 4/4
Tiles in x/y-direction on ref 2 = 4/4
Tiles in x/y-direction on ref 3 = 4/4
Type of stencil = star
Data type = double precision
Compact representation of stencil loop body
Number of iterations = 20
Load balancer = FINE_GRAIN
Refinement rank spread = 16
Refinements:
Background grid points = 500
Grid size = 3993
Refinement level = 3
Period = 10
Duration = 5
Sub-iterations = 1
Rank 12 about to call AMPI_Migrate in iter 0
Rank 12 entered dchunkpup
Rank 7 about to call AMPI_Migrate in iter 0
Rank 7 entered dchunkpup
Rank 8 about to call AMPI_Migrate in iter 0
Rank 8 entered dchunkpup
Rank 4 about to call AMPI_Migrate in iter 0
Rank 4 entered dchunkpup
Rank 15 about to call AMPI_Migrate in iter 0
Rank 15 entered dchunkpup
Rank 11 about to call AMPI_Migrate in iter 0
Rank 11 entered dchunkpup
Rank 3 about to call AMPI_Migrate in iter 0
Rank 1 about to call AMPI_Migrate in iter 0
Rank 1 entered dchunkpup
Rank 3 entered dchunkpup
Rank 13 about to call AMPI_Migrate in iter 0
Rank 13 entered dchunkpup
Rank 6 about to call AMPI_Migrate in iter 0
Rank 6 entered dchunkpup
Rank 0 about to call AMPI_Migrate in iter 0
Rank 0 entered dchunkpup
Rank 9 about to call AMPI_Migrate in iter 0
Rank 9 entered dchunkpup
Rank 5 about to call AMPI_Migrate in iter 0
Rank 5 entered dchunkpup
Rank 2 about to call AMPI_Migrate in iter 0
Rank 2 entered dchunkpup
Rank 10 about to call AMPI_Migrate in iter 0
Rank 10 entered dchunkpup
Rank 14 about to call AMPI_Migrate in iter 0
Rank 14 entered dchunkpup
CharmLB> RotateLB: PE [0] step 0 starting at 0.507547 Memory: 990.820312 MB
CharmLB> RotateLB: PE [0] strategy starting at 0.511685
CharmLB> RotateLB: PE [0] Memory: LBManager: 920 KB CentralLB: 19 KB
CharmLB> RotateLB: PE [0] #Objects migrating: 16, LBMigrateMsg size: 0.00 MB
CharmLB> RotateLB: PE [0] strategy finished at 0.511696 duration 0.000011 s
Segmentation fault (core dumped)