SEGV in handle_cb_queue::enqueue() for PAR backend with many tests

Issue #81 resolved
Dan Bonachea created an issue

Using the 2017.9.0 release on dirac/gcc-7/smp/debug:

{pcp-d-5 ~/upcxx-2017.9.0/example/prog-guide} env UPCXX_GASNET_CONDUIT=smp UPCXX_CODEMODE=debug UPCXX_THREADMODE=seq UPCXX_INSTALL=/home/pcp1/bonachea/upcxx-2017.9.0/foo gmake clean compute-pi-multi-examples      
rm -rf hello-world compute-pi compute-pi-multi-examples persona-example
g++ compute-pi-multi-examples.cpp -DUPCXX_BACKEND=gasnet1_seq -D_GNU_SOURCE=1 -DGASNET_SEQ -I/home/pcp1/bonachea/upcxx-2017.9.0/foo/gasnet.debug/include -I/home/pcp1/bonachea/upcxx-2017.9.0/foo/gasnet.debug/include/smp-conduit -I/home/pcp1/bonachea/upcxx-2017.9.0/foo/upcxx.debug.gasnet1_seq.smp/include -std=c++11 -Wno-inline -g3 -Wno-unused -Wno-unused-parameter -Wno-address -std=c++11 -Wno-inline -L/home/pcp1/bonachea/upcxx-2017.9.0/foo/upcxx.debug.gasnet1_seq.smp/lib -lupcxx -lpthread -L/home/pcp1/bonachea/upcxx-2017.9.0/foo/gasnet.debug/lib -lgasnet-smp-seq -lrt -L/usr/local/pkg/gcc/7.2.0/lib/gcc/x86_64-pc-linux-gnu/7.2.0 -lgcc -lm -o compute-pi-multi-examples 
{pcp-d-5 ~/upcxx-2017.9.0/example/prog-guide} env GASNET_PSHM_NODES=1 ./compute-pi-multi-examples
Testing compute-pi-multi-examples.cpp with 1 ranks
Calculating pi with 100000 trials, distributed across 1 ranks.
rpc: pi estimate: 3.14152, rank 0 alone: 3.14152
rpc_no_barrier: pi estimate: 3.14152, rank 0 alone: 3.14152
global_ptrs: pi estimate: 3.14152, rank 0 alone: 3.14152
distobj: pi estimate: 3.14152, rank 0 alone: 3.14152
async_distobj: pi estimate: 3.14152, rank 0 alone: 3.14152
atomics: pi estimate: 3.14152, rank 0 alone: 3.14152
quiescence: pi estimate: 3.14152, rank 0 alone: 3.14152
Computed pi to be 3.14152
SUCCESS

{pcp-d-5 ~/upcxx-2017.9.0/example/prog-guide} env UPCXX_GASNET_CONDUIT=smp UPCXX_CODEMODE=debug UPCXX_THREADMODE=par UPCXX_INSTALL=/home/pcp1/bonachea/upcxx-2017.9.0/foo gmake clean compute-pi-multi-examples   
rm -rf hello-world compute-pi compute-pi-multi-examples persona-example
g++ compute-pi-multi-examples.cpp -DUPCXX_BACKEND=gasnetex_par -D_GNU_SOURCE=1 -DGASNET_PAR -D_REENTRANT -I/home/pcp1/bonachea/upcxx-2017.9.0/foo/gasnet.debug/include -I/home/pcp1/bonachea/upcxx-2017.9.0/foo/gasnet.debug/include/smp-conduit -I/home/pcp1/bonachea/upcxx-2017.9.0/foo/upcxx.debug.gasnetex_par.smp/include -std=c++11 -Wno-inline -g3 -Wno-unused -Wno-unused-parameter -Wno-address -std=c++11 -Wno-inline -L/home/pcp1/bonachea/upcxx-2017.9.0/foo/upcxx.debug.gasnetex_par.smp/lib -lupcxx -lpthread -L/home/pcp1/bonachea/upcxx-2017.9.0/foo/gasnet.debug/lib -lgasnet-smp-par -lpthread -lrt -L/usr/local/pkg/gcc/7.2.0/lib/gcc/x86_64-pc-linux-gnu/7.2.0 -lgcc -lm -o compute-pi-multi-examples 
{pcp-d-5 ~/upcxx-2017.9.0/example/prog-guide} env GASNET_PSHM_NODES=1 ./compute-pi-multi-examples                                                                                                              
Testing compute-pi-multi-examples.cpp with 1 ranks
Calculating pi with 100000 trials, distributed across 1 ranks.
rpc: pi estimate: 3.14152, rank 0 alone: 3.14152
rpc_no_barrier: pi estimate: 3.14152, rank 0 alone: 3.14152
*** Caught a fatal signal: SIGSEGV(11) on node 0/1

Note the test works fine on the SEQ backend but crashes on the PAR backend. In both cases this is a single rank containing a single thread (this test does not spawn threads).

Here is the crash stack:

Program received signal SIGSEGV, Segmentation fault.
0x000000000043b72c in upcxx::backend::gasnet::handle_cb_queue::enqueue (this=0x8b0188 <upcxx::backend::master+136>, cb=0x8f0580)
    at /home/pcp1/bonachea/upcxx-2017.9.0/.nobs/art/9745a86cc2134db69f60402e204d700b7b484511/upcxx/backend/gasnet/handle_cb.hpp:33
33          *this->tailp_ = cb;
(gdb) where
#0  0x000000000043b72c in upcxx::backend::gasnet::handle_cb_queue::enqueue (this=0x8b0188 <upcxx::backend::master+136>, cb=0x8f0580)
    at /home/pcp1/bonachea/upcxx-2017.9.0/.nobs/art/9745a86cc2134db69f60402e204d700b7b484511/upcxx/backend/gasnet/handle_cb.hpp:33
#1  0x000000000043a758 in upcxx::backend::rma_put (rank_d=0, buf_d=0x7fff763563c8, buf_s=0x8f05b0, buf_size=4, cb=0x8f0580) at /home/pcp1/bonachea/upcxx-2017.9.0/src/backend/gasnet/runtime.cpp:261
#2  0x00000000004110dd in upcxx::rput<int, upcxx::nil_cx, upcxx::nil_cx, upcxx::future_cx<0> > (value_s=78538, gp_d=..., cxs=...)
    at /home/pcp1/bonachea/upcxx-2017.9.0/foo/upcxx.debug.gasnetex_par.smp/include/upcxx/rput.hpp:187
#3  0x0000000000404bfe in global_ptrs::reduce_to_rank0 (my_hits=78538) at global-ptrs-reduce_to_rank0.hpp:15
#4  0x000000000040559a in main (argc=1, argv=0x7fffffffd5f8) at compute-pi-multi-examples.cpp:83
(gdb) print *this
$4 = {head_ = 0x0, tailp_ = 0x0}

Looks like a bug pushing onto an empty queue - appears something is not properly resetting to the expected empty list state.

In addition to compute-pi-multi-examples, I see similar crashes when running these nobs tests against the par backend: rpc_barrier, rput, atomics

Comments (20)

  1. Former user Account Deleted

    I'm having trouble believing it's in the linked-list logic. That's the same code in use with SEQ, and its only 30 lines that look solid. More likely to be TLS constructors not running correctly on this platform.

  2. Dan Bonachea reporter

    This occurred on both dirac (Linux) with modern gcc and my Cygwin laptop, so it's not highly platform dependent. I have not yet tested elsewhere.

    Note the handle_cb_queue object itself appears to be intact, but the tail pointer inside it is nulled.

  3. Former user Account Deleted

    I don't know the machine "dirac" so I can't test anything. Am curious to see how this behaves:

    #include<iostream>
    struct poo {
      poo *head = nullptr;
      poo **tailp = &head;
      poo() = default;
    };
    thread_local poo foo;
    int main() {
      std::cout << "head="<<foo.head<<" tail="<<foo.tailp<<'\n';
      return 0;
    }
    
  4. Paul Hargrove

    Dan's report said:

    In both cases this is a single rank containing a single thread (this test does not spawn threads).

    So the "TLS constructors not running correctly on this platform" theory doesn't seem likely to me.

    However, I will acknowledge that I cannot reproduce the same failure on Mac OS X Sierra w/ Apple's Clang 9.0.0.

  5. Former user Account Deleted

    I can't extract the gasnetex collaborator tarball on dirac, tar xzf hangs:

    wget http://mantis.lbl.gov/nightly/unlisted/GASNet-EX-collaborator-snapshot.tar.gz
    tar xzf GASNet-EX-collaborator-snapshot.tar.gz
    
  6. Paul Hargrove

    Re: can't extract the gasnetex collaborator tarball

    I am seeing the same. Will regenerate the tarball ASAP.

  7. Paul Hargrove

    The tarball has been regenerated and I can extract it now. I have no clue what was wrong w/ the previous one.

    FWIW: I can reproduce the reported error on Dirac using gcc-5.1.0 instead of gcc-7.2.0.

  8. Former user Account Deleted

    I'm still failing to extract the tarball on Dirac, same hang. I even tried using python's "tarfile" module to extract it and that hung too. Works fine on my laptop.

  9. Former user Account Deleted

    False alarm, it just takes a really long time (3 minutes)!

    And rm -r the resulting tree takes 21 seconds!

  10. Dan Bonachea reporter

    FWIW: Based on the nightly tester results, the crash results on these nobs tests using the PAR backend:

    include Mac OS X High Sierra + clang, and might be slightly more widespread than the crashes on compute-pi-multi-examples (although we only have one night's worth of automated testing for this so far).

  11. Paul Hargrove

    @jbachan re: slow I/O times.

    FWIW: Scott also reported to Eric and I that I/O times for $HOME on this system was very slow recently.
    Eric will look into it, I think, but I am not sure what can be done.

  12. Former user Account Deleted

    Thanks, I'm operating out of /tmp/jdbachan for now and it speeds things up drastically.

    I'm having trouble getting the debugger to do what I want. I can reproduce the issue with RANKS=2 using test/atomic.cpp. So I launch with GASNET_FREEZE_ON_ERROR=1, it stops me at the line where the assign-to-null is happening which looks like *(this->tailp_) = .... In this context this is a global variable (actually a field of a field of a global variable, same thing). So &this->tailp_ has a nice fixed address, I print that out and got 0x8e7670. I then kill both processes and start another run with GASNET_FREEZE=1. I attach a gdb instance to each, I run watch *0x8e7670 (also tried watch *(void***)0x8e7670 ) in each, set gasnet_frozen=0, and continue, expecting I should be halted by the watch each time my this->tailp_ changes. But instead no watchpoints are ever triggered and I run straight through to the segfault. I'm sure that 0x8e7670 is the right address to watch since gdb nicely tells me its name at the segfault (upcxx::backend::master+...). Does anybody have ideas as to why my watches are firing?

  13. Paul Hargrove

    @jbachan some ideas:

    There is some address-space randomization on the DIrac nodes.
    You might want to re-verify that you are getting same address on every run.

    You may want to try module load gdb/newest to get a newer gdb in your $PATH.

  14. Dan Bonachea reporter

    I would recommend starting with GASNET_FREEZE=1 and setting a breakpoint on the enqueue and dequeue methods, since that's where things probably start going wrong.

    Also note that compute-pi-multi-examples/PAR fails on dirac using a single rank / single thread, which could simplify matters even further since everything is in one process.

  15. Former user Account Deleted

    @PHHargrove @bonachea thanks for your input, I think I found the bug. The datastructure's constructor is never being called due to a header being pulled in with different #define's in place depedning on the translation unit. So one TU thinks the per-persona state is the empty struct, while another correctly sees it has a linked list in it.

  16. Log in to comment