uts_omp_ranks crash w/ clang on Linux/x86-64

Issue #93 resolved
Paul Hargrove created an issue

The recently aded uts_omp_ranks test is crashing on Dirac with clang-5.0.0.
Failures can be see here

With a optimized build of GASNet one just sees SEGV.
However, the debug build of GASNet is seeing assertion failures (from GASNet and upcxx runtime) prior to reaching a SEGV.

This appears to be exclusive to clang, since gcc on the same system did not fail.

Comments (8)

  1. Dan Bonachea

    This is still consistently failing every night for the past 6 weeks.

    Based on 11/27 results, this can also manifest as a hang in debug mode after failing the same debug check:

    GASNet gasnetc_AMRequestMediumM returning an error code: GASNET_ERR_BAD_ARG (Invalid function parameter passed)
      at /home/upcnightly/EX-dirac-ibv-clang/runtime/src/gasnet/ibv-conduit/gasnet_core_sndrcv.c:4717
      reason: node index too high
    *** Caught a signal: SIGINT(2) on node 3/8
    

    Combining this with the evidence from comment #0, it seems likely there is a buffer overrun in UPC++ that results in random memory corruption, usually scribbling over a node index later passed to AMRequestMedium.

    Here is a crash stack with a single node with one OMP thread, revealing an invocation of upcxx::persona::active_with_caller with a NULL this pointer:

    $ env OMP_NUM_THREADS=1 upcxx-run -np 1 uts_omp_ranks-par
    Using default UTS_WIDTH=100
    Using default UTS_WIDTH=100
    *** Caught a fatal signal: SIGSEGV(11) on node 0/1
    [0] Invoking GDB for backtrace...
    [0] /usr/local/pkg/gdb/newest/bin/gdb -nx -batch -x /tmp/gasnet_hbpX0b '/home/data2/upcnightly/dirac/EX-dirac-ibv-clang/work/dbg/gasnet/tests/upcr-harness/external-upcxx/./uts_omp_ranks-par' 2010
    [0]   Id   Target Id         Frame 
    [0] * 1    Thread 0x7fc6bf07b8c0 (LWP 2010) "uts_omp_ranks-p" 0x00007fc6bd490dbc in waitpid () from /lib64/libc.so.6
    [0] 
    [0] Thread 1 (Thread 0x7fc6bf07b8c0 (LWP 2010)):
    [0] #0  0x00007fc6bd490dbc in waitpid () from /lib64/libc.so.6
    [0] #1  0x00007fc6bd413cc2 in do_system () from /lib64/libc.so.6
    [0] #2  0x000000000047c1e7 in gasneti_system_redirected (cmd=0x8064f0 <gasneti_bt_gdb.cmd> "/usr/local/pkg/gdb/newest/bin/gdb -nx -batch -x /tmp/gasnet_hbpX0b '/home/data2/upcnightly/dirac/EX-dirac-ibv-clang/work/dbg/gasnet/tests/upcr-harness/external-upcxx/./uts_omp_ranks-par' 2010", stdout_fd=6) at /home/upcnightly/EX-dirac-ibv-clang/runtime/src/gasnet/gasnet_tools.c:967
    [0] #3  0x000000000047b97a in gasneti_bt_gdb (fd=6) at /home/upcnightly/EX-dirac-ibv-clang/runtime/src/gasnet/gasnet_tools.c:1214
    [0] #4  0x0000000000474ec7 in gasneti_print_backtrace (fd=2) at /home/upcnightly/EX-dirac-ibv-clang/runtime/src/gasnet/gasnet_tools.c:1483
    [0] #5  0x0000000000475931 in _gasneti_print_backtrace_ifenabled (fd=2) at /home/upcnightly/EX-dirac-ibv-clang/runtime/src/gasnet/gasnet_tools.c:1614
    [0] #6  0x000000000055a844 in gasneti_defaultSignalHandler (sig=11) at /home/upcnightly/EX-dirac-ibv-clang/runtime/src/gasnet/gasnet_internal.c:1422
    [0] #7  <signal handler called>
    [0] #8  upcxx::persona::active_with_caller (this=0x0) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/.nobs/art/e15af02af9865558b39073c315239fd0d9eaf673/upcxx/persona.hpp:90
    [0] #9  0x00000000004127c1 in upcxx::persona::lpc_ff<uts_parallel(unsigned long&, upcxx::digest&)::$_1>(uts_parallel(unsigned long&, upcxx::digest&)::$_1) (this=0x0, fn=...) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/.nobs/art/8bfe9e72b7a2c266f353293000f7a5b00770be0d/upcxx/persona.hpp:96
    [0] #10 0x00000000004125ea in void vranks::send<uts_parallel(unsigned long&, upcxx::digest&)::$_1>(int, uts_parallel(unsigned long&, upcxx::digest&)::$_1)::{lambda()#1}::operator()() const (this=0x7fff51d019f0) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/test/uts/vranks_omp_ranks.hpp:32
    [0] #11 0x00000000004124d5 in upcxx::commanding<void vranks::send<uts_parallel(unsigned long&, upcxx::digest&)::$_1>(int, uts_parallel(unsigned long&, upcxx::digest&)::$_1)::{lambda()#1}>::execute(upcxx::parcel_reader&) (r=...) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/.nobs/art/8bfe9e72b7a2c266f353293000f7a5b00770be0d/upcxx/command.hpp:46
    [0] #12 0x0000000000412463 in upcxx::detail::command_executor<void vranks::send<uts_parallel(unsigned long&, upcxx::digest&)::$_1>(int, uts_parallel(unsigned long&, upcxx::digest&)::$_1)::{lambda()#1}>(upcxx::parcel_reader&) (r=...) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/.nobs/art/8bfe9e72b7a2c266f353293000f7a5b00770be0d/upcxx/command.hpp:70
    [0] #13 0x000000000040ae9d in upcxx::command_execute (r=...) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/.nobs/art/e15af02af9865558b39073c315239fd0d9eaf673/upcxx/command.hpp:76
    [0] #14 0x000000000041030f in upcxx::backend::gasnet::rpc_inbox::burst (this=0x7fd808 <(anonymous namespace)::rpcs_user_>, burst_n=100) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/src/backend/gasnet/rpc_inbox.cpp:15
    [0] #15 0x0000000000409b37 in upcxx::progress(upcxx::progress_level)::$_1::operator()() const (this=0x7fff51d01bb0) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/src/backend/gasnet/runtime.cpp:402
    [0] #16 0x0000000000408c89 in upcxx::detail::persona_as_top<upcxx::progress(upcxx::progress_level)::$_1>(upcxx::persona&, upcxx::progress(upcxx::progress_level)::$_1&&) (p=..., fn=...) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/.nobs/art/e15af02af9865558b39073c315239fd0d9eaf673/upcxx/persona.hpp:321
    [0] #17 0x00000000004070ad in upcxx::progress (level=upcxx::progress_level::user) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/src/backend/gasnet/runtime.cpp:399
    [0] #18 0x000000000041604e in vranks::progress () at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/test/uts/vranks_omp_ranks.hpp:37
    [0] #19 0x00000000004112f5 in qd_progress (local_quiescence=true) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/test/uts/uts.cpp:258
    [0] #20 0x00000000004111dc in uts_parallel (par_node_n=@0x7fff51d01d58: 1, par_hash=...) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/test/uts/uts.cpp:150
    [0] #21 0x0000000000411f3f in main::$_0::operator() (this=0x7fff51d02138, vrank_me1=0, vrank_n1=4) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/test/uts/uts.cpp:51
    [0] #22 0x0000000000411d2a in .omp_outlined.(void) (.global_tid.=0x7fff51d01e90, .bound_tid.=0x7fff51d01e88, fn=..., bar1=...) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/test/uts/vranks_omp_ranks.hpp:56
    [0] #23 0x00007fc6bda3ef43 in __kmp_invoke_microtask () from /usr/local/pkg/pathscale/ekopath-6.0.963/lib/6.0.963/x8664/64/libomp.so
    [0] #24 0x00007fc6bd9d6ad9 in __kmp_fork_call () from /usr/local/pkg/pathscale/ekopath-6.0.963/lib/6.0.963/x8664/64/libomp.so
    [0] #25 0x00007fc6bd9cf029 in __kmpc_fork_call () from /usr/local/pkg/pathscale/ekopath-6.0.963/lib/6.0.963/x8664/64/libomp.so
    [0] #26 0x0000000000410b73 in vranks::spawn<main::$_0> (fn=...) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/test/uts/vranks_omp_ranks.hpp:49
    [0] #27 0x0000000000410a29 in main () at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/test/uts/uts.cpp:43
    
  2. john bachan

    I would love to go digging into this, but I'm still not sure how to reproduce an environment just given a link to the crash page. I assume I've either missed or forgotten helpful emails and links.

  3. Dan Bonachea

    The relevant configuration page is here, although that probably has both more and less information than you need.

    The UPC++ CI build log is here

    I was able to reproduce on dirac with:

    SMP:
    env CC="/usr/local/pkg/clang/5.0.0/bin/clang -Wno-unused-command-line-argument" CXX="/usr/local/pkg/openmpi-2.1.1/clang-5.0.0/bin/mpicxx -Wno-unused-command-line-argument" UPCXX_BACKEND=gasnetex_par DBGSYM=1 OPTLEV=0 nobs exe test/uts/uts_omp_ranks.cpp
    
    IBV:
    env CC="/usr/local/pkg/clang/5.0.0/bin/clang -Wno-unused-command-line-argument" CXX="/usr/local/pkg/openmpi-2.1.1/clang-5.0.0/bin/mpicxx -Wno-unused-command-line-argument" GASNET_CONDUIT=ibv UPCXX_BACKEND=gasnetex_par DBGSYM=1 OPTLEV=0 nobs exe test/uts/uts_omp_ranks.cpp
    
  4. Paul Hargrove reporter

    @jbachan reminder: "on dirac", means the systems "pcp-d-5" and "pcp-d-6", reachable from n2001.

  5. john bachan

    Found the problem. This is either a bug in the omp runtime, or my understanding of the env var OMP_NUM_THREADS. According to: https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fNUM_005fTHREADS.html the wording seems to imply that OMP_NUM_THREADS should just be a suggested default number of threads to spin up. I was expecting a parallel region which explicitly requested an exceeding number of threads to have those extra threads spun up. This is the behavior I see on other omp runtimes. But this wasn't happening here. The fix was to explicitly call omp_set_num_threads(N) at startup, before creating a parallel region with num_threads(N).

  6. john bachan

    Fixed issue 93. It appears that OMP runtimes interpret OMP_NUM_THREADS env var differently. This fix calls omp_set_num_threads at startup to ensure parallel regions requiring N threads get N distinct threads, even when N exceeds OMP_NUM_THREADS.

    → <<cset f3701ac78ccf>>

  7. Log in to comment