charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
- From: Eric Bohm <ebohm AT illinois.edu>
- To: <charm AT lists.cs.illinois.edu>
- Subject: Re: [charm] memory management errors after ckexit called
- Date: Thu, 20 Aug 2015 14:00:02 -0500
I can confirm that there is an
occasional crash. However the manifestation is different from the
one previously reported under this subject line. In this case,
the crash that I can reproduce occurs during startup. Though it
may require hundreds of repetitions for it to occur.
Does that match your experience? So far the most sensible stack trace I've been able to extract implicates string handling of command line arguments. #5 0x000000000052914d in std::set<std::string, std::less<std::string>, std::allocator<std::string> >::insert(std::string&&) () #6 0x00000000005223ae in _registerCommandLineOpt(char const*) () #7 0x0000000000593a73 in LBDatabase::initnodeFn() () #8 0x0000000000523e51 in InitCallTable::enumerateInitCalls() () On 08/19/2015 01:37 PM, Scott Field wrote: Hi Phil,
Thank you (and the charm++ development team) for taking a
look into this issue.
There might be a remaining problem since the segfaults
continue to appear. I'm using the most recent version (git
hash 28284bb5e62196febdf7a72fce5cbba1e3613639) and have built
charm++ with './build charm++ multicore-linux64 gcc -j3
-std=c++11'. The segfaults appear in some of the included
examples, for example the one located in
'examples/charm++/hello/1darray'. Three such errors were
generated after running 50 executions of './hello +p2'. As
before, no errors are produced when run with +p1.
Best,
Scott
On Tue, Aug 18, 2015 at 12:01 PM, Phil
Miller <mille121 AT illinois.edu>
wrote:
There's been a longer delay than we'd hoped, by
we believe this issue is now resolved:
Please let us know if you encounter further issues.https://charm.cs.illinois.edu/redmine/issues/761 On Sun, Jun 21, 2015 at 5:37
PM, Phil Miller <mille121 AT illinois.edu>
wrote:
The case I have seems to reproduce the
issue with 100% reliability. I've entered it
in our bug tracker here https://charm.cs.illinois.edu/redmine/issues/761
Hopefully, we'll have this fixed in the next
week or so.On Sun, Jun 21,
2015 at 5:33 PM, Scott Field <sfield AT astro.cornell.edu>
wrote:
Hi Phil,
Thanks for taking a closer look
at this. It sounds like you have
some good leads, but if there's
anything I can do on my end please
let me know. I did have a look at
the jacobi3d example (using
"./jacobi3d 10 10 +p3") and found
the error to occur more frequently
as compared to simplearrayhello. On
my machine, I needed to run "MALLOC_CHECK_=3
./hello +p4 10" 50 times before
encountering an error. By
comparison, our regression tests
fail about 50% of the time.
Best,
Scott
On Sat,
Jun 20, 2015 at 4:44 PM, Phil
Miller <mille121 AT illinois.edu>
wrote:
I just tried to
reproduce your report on a
simple example program,
tests/charm++/simplearrayhello.
Running that with the
command line
"MALLOC_CHECK_=3 ./hello +p4
10" seems to show the issue.
I think we can take it from
here - no need for you to do
further testing on your end.
On
Fri, Jun 19, 2015 at
5:42 PM, Phil Miller <mille121 AT illinois.edu>
wrote:
Could
you please post
the backtrace from
the two calls to
CkExit()?
Could you
also try your
test on these
alternative
builds of
Charm++:
net-linux-x86_64-smp
netlrts-linux-x86_64-smp
Instead of
just +p P,
you'll need to
pass +p P +ppn P
when launching
your program.
This is to test
a hypothesis
that the switch
from an older
(net) to newer
(netlrts)
generation of
the underlying
infrastructure
broke the code
in some subtle
way.
Alternately, you
could try just
reverting
commit 61718f94316d22087075f213e9f1d60d9efbdb95
and
building/running
multicore-linux64
against that.
Thank you for
your help in
hunting this
down. If you
have a
reasonably small
test case that
you'd rather
just pass us to
see if it's a
runtime system
bug, we'd be
happy to take a
look.
On Wed, Jun 17, 2015 at 3:01 PM, Scott Field <sfield AT astro.cornell.edu> wrote: Hi
Phil,
On Tue, Jun 16, 2015 at 4:25 PM, Phil Miller <mille121 AT illinois.edu> wrote: Hi Scott,
This list is
definitely an
appropriate
place to post
about
potential
bugs. Thanks
for bringing
it up.Thank you
for the tips.
Disabling
production
didn't catch
anything,
unfortunately.
Running
with +p1 is
fine. Errors
occur only
when more than
1 thread is
used.
Can you
compile the
whole thing
with -g and
run it under
gdb to see
which
ostensibly
invalid
pointer is
actually being
freed at that
point?
Sure. The
charm++ build
is now ./build
charm++
multicore-linux64
-g -std=c++11
I set my
breakpoint at
CkExit and get
the following
output with
+p1 (reported
here to
compare with
+p2):
Breakpoint
1, CkExit ()
at init.C:895
895 envelope *env =
_allocEnv(StartExitMsg);
(gdb)
(gdb) n
896 env->setSrcPe(CkMyPe());
(gdb)
897 CmiSetHandler(env,
_exitHandlerIdx);
(gdb)
898 CmiSyncSendAndFree(0,
env->getTotalsize(),
(char *)env);
(gdb)
903 if(!CharmLibInterOperate)
(gdb)
904 CsdScheduler(-1);
(gdb)
Breakpoint
1, CkExit ()
at init.C:895
895 envelope *env =
_allocEnv(StartExitMsg);
(gdb)
896 env->setSrcPe(CkMyPe());
(gdb)
897 CmiSetHandler(env,
_exitHandlerIdx);
(gdb)
898 CmiSyncSendAndFree(0,
env->getTotalsize(),
(char *)env);
(gdb)
903 if(!CharmLibInterOperate)
(gdb)
904 CsdScheduler(-1);
(gdb)
[Partition
0][Node 0] End
of program
[Inferior
1 (process
7163) exited
normally]
I was a
bit surprised
that CkExit
was called
twice. Anyway,
running with
+p2 produces
Breakpoint
1, CkExit ()
at init.C:895
895 envelope *env =
_allocEnv(StartExitMsg);
(gdb) n
896 env->setSrcPe(CkMyPe());
(gdb)
897 CmiSetHandler(env,
_exitHandlerIdx);
(gdb)
898 CmiSyncSendAndFree(0,
env->getTotalsize(),
(char *)env);
(gdb)
903 if(!CharmLibInterOperate)
(gdb)
904 CsdScheduler(-1);
(gdb)
Breakpoint
1, CkExit ()
at init.C:895
895 envelope *env =
_allocEnv(StartExitMsg);
(gdb)
896 env->setSrcPe(CkMyPe());
(gdb)
897 CmiSetHandler(env,
_exitHandlerIdx);
(gdb)
898 CmiSyncSendAndFree(0,
env->getTotalsize(),
(char *)env);
(gdb)
903 if(!CharmLibInterOperate)
(gdb)
904 CsdScheduler(-1);
(gdb)
step
CsdScheduler
(maxmsgs=-1)
at
convcore.c:1797
1797 if (maxmsgs<0)
CsdScheduleForever();
(gdb)
CsdScheduleForever
() at
convcore.c:1848
1848 int isIdle=0;
(gdb)
CsdSchedulerState_new
(s=0x7fffffffd5b0)
at
convcore.c:1660
1660 s->localQ=CpvAccess(CmiLocalQueue);
(gdb) n
Program
received
signal
SIGSEGV,
Segmentation
fault.
CsdSchedulerState_new
(s=0x7fffffffd5b0)
at
convcore.c:1660
1660 s->localQ=CpvAccess(CmiLocalQueue);
Best,
Scott
On Tue,
Jun 16, 2015
at 3:14 PM,
Scott Field <sfield AT astro.cornell.edu>
wrote:
Hi,
Recently,
after pulling
a
bleeding-edge
version of the
charm++ code,
all of our
regression
tests now fail
with either a
segmentation
fault or
"double free
or corruption
(!prev):
0x0000000001c4de20
***". The
error appears
to occur after
ckexit is
called.
Charm++ was
built on my
laptop with
>>>
./build
charm++
multicore-linux32
gcc
--with-production
-j3 -std=c++11
Using
git's bisect
utility, I was
able to track
down the first
commit version
where things
go wrong. The
git hash and
commit
messages are
c96750026bbc7a9190f1381e7ac9ea56ae86f80e
and "Bug #695:
disable comm
thread in
multicore
builds". More
specifically,
if I edit line
200 of the
file
src/arch/util/machine-common-core.c
from "#define
CMK_SMP_NO_COMMTHD
CMK_MULTICORE"
to "#define
CMK_SMP_NO_COMMTHD
0" the error
message goes
away and all
tests pass
again.
Honestly I
don't really
know what why
this change
fixed the
problem -- its
pretty far
under-the-hood.
A few
questions:
1) Is
this list a
appropriate
place to post
information
about
potential
bugs?
2) Does
this seem to
be a charm++
bug introduced
by that
commit? Or a
fix which has
simply broken
our code? I
had a hard
time tracking
down the
source of the
error. Oddly
enough, I
could not
reproduce the
same error
when using
valgrind
(although it
did report an
"Uninitialised
value was
created by a
stack
allocation"
which it
tracked to one
of the
declaration
files created
by charmc).
With
MALLOC_CHECK_
set to 3 I get
the following
*** Error
in
`./Evolve1DScalarWave':
free():
invalid
pointer:
0x000000000203c920
***
=======
Backtrace:
=========
/lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7f4cebc2e38f]
/lib/x86_64-linux-gnu/libc.so.6(+0x81fb6)[0x7f4cebc3cfb6]
/lib/x86_64-linux-gnu/libc.so.6(+0x3c280)[0x7f4cebbf7280]
/lib/x86_64-linux-gnu/libc.so.6(+0x3c2a5)[0x7f4cebbf72a5]
./Evolve1DScalarWave[0x670b4a]
./Evolve1DScalarWave[0x5e39ed]
./Evolve1DScalarWave(CsdScheduleForever+0x48)[0x673e88]
./Evolve1DScalarWave(CsdScheduler+0x2d)[0x67413d]
./Evolve1DScalarWave(_ZN12ElementChareI16ScalarWaveSystemILi1EEE11endTimeStepEv+0x448)[0x580d3c]
./Evolve1DScalarWave(_ZN12ElementChareI16ScalarWaveSystemILi1EEE13endComputeRhsEv+0x5331DScalarWave':
free():
invalid
pointer:
0x000000000203c920
***
Best,
Scott
charm mailing list charm AT cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/charm |
- Re: [charm] memory management errors after ckexit called, Phil Miller, 08/18/2015
- Re: [charm] memory management errors after ckexit called, Scott Field, 08/19/2015
- Re: [charm] memory management errors after ckexit called, Eric Bohm, 08/20/2015
- Re: [charm] memory management errors after ckexit called, Scott Field, 08/21/2015
- Re: [charm] memory management errors after ckexit called, Eric Bohm, 08/20/2015
- Re: [charm] memory management errors after ckexit called, Scott Field, 08/19/2015
Archive powered by MHonArc 2.6.16.