SEGV in handle_cb_queue::enqueue() for PAR backend with many tests
Using the 2017.9.0 release on dirac/gcc-7/smp/debug:
{pcp-d-5 ~/upcxx-2017.9.0/example/prog-guide} env UPCXX_GASNET_CONDUIT=smp UPCXX_CODEMODE=debug UPCXX_THREADMODE=seq UPCXX_INSTALL=/home/pcp1/bonachea/upcxx-2017.9.0/foo gmake clean compute-pi-multi-examples
rm -rf hello-world compute-pi compute-pi-multi-examples persona-example
g++ compute-pi-multi-examples.cpp -DUPCXX_BACKEND=gasnet1_seq -D_GNU_SOURCE=1 -DGASNET_SEQ -I/home/pcp1/bonachea/upcxx-2017.9.0/foo/gasnet.debug/include -I/home/pcp1/bonachea/upcxx-2017.9.0/foo/gasnet.debug/include/smp-conduit -I/home/pcp1/bonachea/upcxx-2017.9.0/foo/upcxx.debug.gasnet1_seq.smp/include -std=c++11 -Wno-inline -g3 -Wno-unused -Wno-unused-parameter -Wno-address -std=c++11 -Wno-inline -L/home/pcp1/bonachea/upcxx-2017.9.0/foo/upcxx.debug.gasnet1_seq.smp/lib -lupcxx -lpthread -L/home/pcp1/bonachea/upcxx-2017.9.0/foo/gasnet.debug/lib -lgasnet-smp-seq -lrt -L/usr/local/pkg/gcc/7.2.0/lib/gcc/x86_64-pc-linux-gnu/7.2.0 -lgcc -lm -o compute-pi-multi-examples
{pcp-d-5 ~/upcxx-2017.9.0/example/prog-guide} env GASNET_PSHM_NODES=1 ./compute-pi-multi-examples
Testing compute-pi-multi-examples.cpp with 1 ranks
Calculating pi with 100000 trials, distributed across 1 ranks.
rpc: pi estimate: 3.14152, rank 0 alone: 3.14152
rpc_no_barrier: pi estimate: 3.14152, rank 0 alone: 3.14152
global_ptrs: pi estimate: 3.14152, rank 0 alone: 3.14152
distobj: pi estimate: 3.14152, rank 0 alone: 3.14152
async_distobj: pi estimate: 3.14152, rank 0 alone: 3.14152
atomics: pi estimate: 3.14152, rank 0 alone: 3.14152
quiescence: pi estimate: 3.14152, rank 0 alone: 3.14152
Computed pi to be 3.14152
SUCCESS
{pcp-d-5 ~/upcxx-2017.9.0/example/prog-guide} env UPCXX_GASNET_CONDUIT=smp UPCXX_CODEMODE=debug UPCXX_THREADMODE=par UPCXX_INSTALL=/home/pcp1/bonachea/upcxx-2017.9.0/foo gmake clean compute-pi-multi-examples
rm -rf hello-world compute-pi compute-pi-multi-examples persona-example
g++ compute-pi-multi-examples.cpp -DUPCXX_BACKEND=gasnetex_par -D_GNU_SOURCE=1 -DGASNET_PAR -D_REENTRANT -I/home/pcp1/bonachea/upcxx-2017.9.0/foo/gasnet.debug/include -I/home/pcp1/bonachea/upcxx-2017.9.0/foo/gasnet.debug/include/smp-conduit -I/home/pcp1/bonachea/upcxx-2017.9.0/foo/upcxx.debug.gasnetex_par.smp/include -std=c++11 -Wno-inline -g3 -Wno-unused -Wno-unused-parameter -Wno-address -std=c++11 -Wno-inline -L/home/pcp1/bonachea/upcxx-2017.9.0/foo/upcxx.debug.gasnetex_par.smp/lib -lupcxx -lpthread -L/home/pcp1/bonachea/upcxx-2017.9.0/foo/gasnet.debug/lib -lgasnet-smp-par -lpthread -lrt -L/usr/local/pkg/gcc/7.2.0/lib/gcc/x86_64-pc-linux-gnu/7.2.0 -lgcc -lm -o compute-pi-multi-examples
{pcp-d-5 ~/upcxx-2017.9.0/example/prog-guide} env GASNET_PSHM_NODES=1 ./compute-pi-multi-examples
Testing compute-pi-multi-examples.cpp with 1 ranks
Calculating pi with 100000 trials, distributed across 1 ranks.
rpc: pi estimate: 3.14152, rank 0 alone: 3.14152
rpc_no_barrier: pi estimate: 3.14152, rank 0 alone: 3.14152
*** Caught a fatal signal: SIGSEGV(11) on node 0/1
Note the test works fine on the SEQ backend but crashes on the PAR backend. In both cases this is a single rank containing a single thread (this test does not spawn threads).
Here is the crash stack:
Program received signal SIGSEGV, Segmentation fault.
0x000000000043b72c in upcxx::backend::gasnet::handle_cb_queue::enqueue (this=0x8b0188 <upcxx::backend::master+136>, cb=0x8f0580)
at /home/pcp1/bonachea/upcxx-2017.9.0/.nobs/art/9745a86cc2134db69f60402e204d700b7b484511/upcxx/backend/gasnet/handle_cb.hpp:33
33 *this->tailp_ = cb;
(gdb) where
#0 0x000000000043b72c in upcxx::backend::gasnet::handle_cb_queue::enqueue (this=0x8b0188 <upcxx::backend::master+136>, cb=0x8f0580)
at /home/pcp1/bonachea/upcxx-2017.9.0/.nobs/art/9745a86cc2134db69f60402e204d700b7b484511/upcxx/backend/gasnet/handle_cb.hpp:33
#1 0x000000000043a758 in upcxx::backend::rma_put (rank_d=0, buf_d=0x7fff763563c8, buf_s=0x8f05b0, buf_size=4, cb=0x8f0580) at /home/pcp1/bonachea/upcxx-2017.9.0/src/backend/gasnet/runtime.cpp:261
#2 0x00000000004110dd in upcxx::rput<int, upcxx::nil_cx, upcxx::nil_cx, upcxx::future_cx<0> > (value_s=78538, gp_d=..., cxs=...)
at /home/pcp1/bonachea/upcxx-2017.9.0/foo/upcxx.debug.gasnetex_par.smp/include/upcxx/rput.hpp:187
#3 0x0000000000404bfe in global_ptrs::reduce_to_rank0 (my_hits=78538) at global-ptrs-reduce_to_rank0.hpp:15
#4 0x000000000040559a in main (argc=1, argv=0x7fffffffd5f8) at compute-pi-multi-examples.cpp:83
(gdb) print *this
$4 = {head_ = 0x0, tailp_ = 0x0}
Looks like a bug pushing onto an empty queue - appears something is not properly resetting to the expected empty list state.
In addition to compute-pi-multi-examples, I see similar crashes when running these nobs tests against the par backend: rpc_barrier, rput, atomics
Comments (20)
-
Account Deleted -
reporter This occurred on both dirac (Linux) with modern gcc and my Cygwin laptop, so it's not highly platform dependent. I have not yet tested elsewhere.
Note the handle_cb_queue object itself appears to be intact, but the tail pointer inside it is nulled.
-
Account Deleted I don't know the machine "dirac" so I can't test anything. Am curious to see how this behaves:
#include<iostream> struct poo { poo *head = nullptr; poo **tailp = &head; poo() = default; }; thread_local poo foo; int main() { std::cout << "head="<<foo.head<<" tail="<<foo.tailp<<'\n'; return 0; }
-
reporter Dirac is pcp-d-[1256] behind n2001. You have access.
-
Dan's report said:
In both cases this is a single rank containing a single thread (this test does not spawn threads).
So the "TLS constructors not running correctly on this platform" theory doesn't seem likely to me.
However, I will acknowledge that I cannot reproduce the same failure on Mac OS X Sierra w/ Apple's Clang 9.0.0.
-
Account Deleted I can't extract the gasnetex collaborator tarball on dirac,
tar xzf
hangs:wget http://mantis.lbl.gov/nightly/unlisted/GASNet-EX-collaborator-snapshot.tar.gz tar xzf GASNet-EX-collaborator-snapshot.tar.gz
-
Re: can't extract the gasnetex collaborator tarball
I am seeing the same. Will regenerate the tarball ASAP.
-
The tarball has been regenerated and I can extract it now. I have no clue what was wrong w/ the previous one.
FWIW: I can reproduce the reported error on Dirac using gcc-5.1.0 instead of gcc-7.2.0.
-
Account Deleted I'm still failing to extract the tarball on Dirac, same hang. I even tried using python's "tarfile" module to extract it and that hung too. Works fine on my laptop.
-
Account Deleted False alarm, it just takes a really long time (3 minutes)!
And
rm -r
the resulting tree takes 21 seconds! -
reporter FWIW: Based on the nightly tester results, the crash results on these nobs tests using the PAR backend:
include Mac OS X High Sierra + clang, and might be slightly more widespread than the crashes on compute-pi-multi-examples (although we only have one night's worth of automated testing for this so far).
-
@jbachan re: slow I/O times.
FWIW: Scott also reported to Eric and I that I/O times for $HOME on this system was very slow recently.
Eric will look into it, I think, but I am not sure what can be done. -
Account Deleted Thanks, I'm operating out of /tmp/jdbachan for now and it speeds things up drastically.
I'm having trouble getting the debugger to do what I want. I can reproduce the issue with RANKS=2 using test/atomic.cpp. So I launch with GASNET_FREEZE_ON_ERROR=1, it stops me at the line where the assign-to-null is happening which looks like
*(this->tailp_) = ...
. In this contextthis
is a global variable (actually a field of a field of a global variable, same thing). So&this->tailp_
has a nice fixed address, I print that out and got0x8e7670
. I then kill both processes and start another run with GASNET_FREEZE=1. I attach a gdb instance to each, I runwatch *0x8e7670
(also triedwatch *(void***)0x8e7670
) in each,set gasnet_frozen=0
, andcontinue
, expecting I should be halted by the watch each time mythis->tailp_
changes. But instead no watchpoints are ever triggered and I run straight through to the segfault. I'm sure that0x8e7670
is the right address to watch since gdb nicely tells me its name at the segfault (upcxx::backend::master+...
). Does anybody have ideas as to why my watches are firing? -
@jbachan some ideas:
There is some address-space randomization on the DIrac nodes.
You might want to re-verify that you are getting same address on every run.You may want to try
module load gdb/newest
to get a newer gdb in your $PATH. -
reporter I would recommend starting with GASNET_FREEZE=1 and setting a breakpoint on the enqueue and dequeue methods, since that's where things probably start going wrong.
Also note that compute-pi-multi-examples/PAR fails on dirac using a single rank / single thread, which could simplify matters even further since everything is in one process.
-
Account Deleted @PHHargrove @bonachea thanks for your input, I think I found the bug. The datastructure's constructor is never being called due to a header being pulled in with different #define's in place depedning on the translation unit. So one TU thinks the per-persona state is the empty struct, while another correctly sees it has a linked list in it.
-
reporter - changed status to resolved
Nightly tests confirm this was resolved in 8ca817a
-
reporter - changed milestone to 2017.12.31 release
-
-
assigned issue to
-
assigned issue to
-
reporter - changed version to 2017.9.0 release
- Log in to comment
I'm having trouble believing it's in the linked-list logic. That's the same code in use with SEQ, and its only 30 lines that look solid. More likely to be TLS constructors not running correctly on this platform.