berkeleylab / upcxx / issues / #93 - uts_omp_ranks crash w/ clang on Linux/x86-64 — Bitbucket

Issue #93 resolved

Paul Hargrove created an issue 2017-10-18

The recently aded uts_omp_ranks test is crashing on Dirac with clang-5.0.0.
Failures can be see here

With a optimized build of GASNet one just sees SEGV.
However, the debug build of GASNet is seeing assertion failures (from GASNet and upcxx runtime) prior to reaching a SEGV.

This appears to be exclusive to clang, since gcc on the same system did not fail.

Comments (8)

Dan Bonachea

This is still consistently failing every night for the past 6 weeks.

Based on 11/27 results, this can also manifest as a hang in debug mode after failing the same debug check:

GASNet gasnetc_AMRequestMediumM returning an error code: GASNET_ERR_BAD_ARG (Invalid function parameter passed)
  at /home/upcnightly/EX-dirac-ibv-clang/runtime/src/gasnet/ibv-conduit/gasnet_core_sndrcv.c:4717
  reason: node index too high
*** Caught a signal: SIGINT(2) on node 3/8

Combining this with the evidence from comment #0, it seems likely there is a buffer overrun in UPC++ that results in random memory corruption, usually scribbling over a node index later passed to AMRequestMedium.

Here is a crash stack with a single node with one OMP thread, revealing an invocation of upcxx::persona::active_with_caller with a NULL this pointer:

$ env OMP_NUM_THREADS=1 upcxx-run -np 1 uts_omp_ranks-par
Using default UTS_WIDTH=100
Using default UTS_WIDTH=100
*** Caught a fatal signal: SIGSEGV(11) on node 0/1
[0] Invoking GDB for backtrace...
[0] /usr/local/pkg/gdb/newest/bin/gdb -nx -batch -x /tmp/gasnet_hbpX0b '/home/data2/upcnightly/dirac/EX-dirac-ibv-clang/work/dbg/gasnet/tests/upcr-harness/external-upcxx/./uts_omp_ranks-par' 2010
[0]   Id   Target Id         Frame 
[0] * 1    Thread 0x7fc6bf07b8c0 (LWP 2010) "uts_omp_ranks-p" 0x00007fc6bd490dbc in waitpid () from /lib64/libc.so.6
[0] 
[0] Thread 1 (Thread 0x7fc6bf07b8c0 (LWP 2010)):
[0] #0  0x00007fc6bd490dbc in waitpid () from /lib64/libc.so.6
[0] #1  0x00007fc6bd413cc2 in do_system () from /lib64/libc.so.6
[0] #2  0x000000000047c1e7 in gasneti_system_redirected (cmd=0x8064f0 <gasneti_bt_gdb.cmd> "/usr/local/pkg/gdb/newest/bin/gdb -nx -batch -x /tmp/gasnet_hbpX0b '/home/data2/upcnightly/dirac/EX-dirac-ibv-clang/work/dbg/gasnet/tests/upcr-harness/external-upcxx/./uts_omp_ranks-par' 2010", stdout_fd=6) at /home/upcnightly/EX-dirac-ibv-clang/runtime/src/gasnet/gasnet_tools.c:967
[0] #3  0x000000000047b97a in gasneti_bt_gdb (fd=6) at /home/upcnightly/EX-dirac-ibv-clang/runtime/src/gasnet/gasnet_tools.c:1214
[0] #4  0x0000000000474ec7 in gasneti_print_backtrace (fd=2) at /home/upcnightly/EX-dirac-ibv-clang/runtime/src/gasnet/gasnet_tools.c:1483
[0] #5  0x0000000000475931 in _gasneti_print_backtrace_ifenabled (fd=2) at /home/upcnightly/EX-dirac-ibv-clang/runtime/src/gasnet/gasnet_tools.c:1614
[0] #6  0x000000000055a844 in gasneti_defaultSignalHandler (sig=11) at /home/upcnightly/EX-dirac-ibv-clang/runtime/src/gasnet/gasnet_internal.c:1422
[0] #7  <signal handler called>
[0] #8  upcxx::persona::active_with_caller (this=0x0) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/.nobs/art/e15af02af9865558b39073c315239fd0d9eaf673/upcxx/persona.hpp:90
[0] #9  0x00000000004127c1 in upcxx::persona::lpc_ff<uts_parallel(unsigned long&, upcxx::digest&)::$_1>(uts_parallel(unsigned long&, upcxx::digest&)::$_1) (this=0x0, fn=...) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/.nobs/art/8bfe9e72b7a2c266f353293000f7a5b00770be0d/upcxx/persona.hpp:96
[0] #10 0x00000000004125ea in void vranks::send<uts_parallel(unsigned long&, upcxx::digest&)::$_1>(int, uts_parallel(unsigned long&, upcxx::digest&)::$_1)::{lambda()#1}::operator()() const (this=0x7fff51d019f0) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/test/uts/vranks_omp_ranks.hpp:32
[0] #11 0x00000000004124d5 in upcxx::commanding<void vranks::send<uts_parallel(unsigned long&, upcxx::digest&)::$_1>(int, uts_parallel(unsigned long&, upcxx::digest&)::$_1)::{lambda()#1}>::execute(upcxx::parcel_reader&) (r=...) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/.nobs/art/8bfe9e72b7a2c266f353293000f7a5b00770be0d/upcxx/command.hpp:46
[0] #12 0x0000000000412463 in upcxx::detail::command_executor<void vranks::send<uts_parallel(unsigned long&, upcxx::digest&)::$_1>(int, uts_parallel(unsigned long&, upcxx::digest&)::$_1)::{lambda()#1}>(upcxx::parcel_reader&) (r=...) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/.nobs/art/8bfe9e72b7a2c266f353293000f7a5b00770be0d/upcxx/command.hpp:70
[0] #13 0x000000000040ae9d in upcxx::command_execute (r=...) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/.nobs/art/e15af02af9865558b39073c315239fd0d9eaf673/upcxx/command.hpp:76
[0] #14 0x000000000041030f in upcxx::backend::gasnet::rpc_inbox::burst (this=0x7fd808 <(anonymous namespace)::rpcs_user_>, burst_n=100) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/src/backend/gasnet/rpc_inbox.cpp:15
[0] #15 0x0000000000409b37 in upcxx::progress(upcxx::progress_level)::$_1::operator()() const (this=0x7fff51d01bb0) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/src/backend/gasnet/runtime.cpp:402
[0] #16 0x0000000000408c89 in upcxx::detail::persona_as_top<upcxx::progress(upcxx::progress_level)::$_1>(upcxx::persona&, upcxx::progress(upcxx::progress_level)::$_1&&) (p=..., fn=...) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/.nobs/art/e15af02af9865558b39073c315239fd0d9eaf673/upcxx/persona.hpp:321
[0] #17 0x00000000004070ad in upcxx::progress (level=upcxx::progress_level::user) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/src/backend/gasnet/runtime.cpp:399
[0] #18 0x000000000041604e in vranks::progress () at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/test/uts/vranks_omp_ranks.hpp:37
[0] #19 0x00000000004112f5 in qd_progress (local_quiescence=true) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/test/uts/uts.cpp:258
[0] #20 0x00000000004111dc in uts_parallel (par_node_n=@0x7fff51d01d58: 1, par_hash=...) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/test/uts/uts.cpp:150
[0] #21 0x0000000000411f3f in main::$_0::operator() (this=0x7fff51d02138, vrank_me1=0, vrank_n1=4) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/test/uts/uts.cpp:51
[0] #22 0x0000000000411d2a in .omp_outlined.(void) (.global_tid.=0x7fff51d01e90, .bound_tid.=0x7fff51d01e88, fn=..., bar1=...) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/test/uts/vranks_omp_ranks.hpp:56
[0] #23 0x00007fc6bda3ef43 in __kmp_invoke_microtask () from /usr/local/pkg/pathscale/ekopath-6.0.963/lib/6.0.963/x8664/64/libomp.so
[0] #24 0x00007fc6bd9d6ad9 in __kmp_fork_call () from /usr/local/pkg/pathscale/ekopath-6.0.963/lib/6.0.963/x8664/64/libomp.so
[0] #25 0x00007fc6bd9cf029 in __kmpc_fork_call () from /usr/local/pkg/pathscale/ekopath-6.0.963/lib/6.0.963/x8664/64/libomp.so
[0] #26 0x0000000000410b73 in vranks::spawn<main::$_0> (fn=...) at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/test/uts/vranks_omp_ranks.hpp:49
[0] #27 0x0000000000410a29 in main () at /home/upcnightly/EX-dirac-ibv-clang/runtime/bld/dbg/upcxx/test/uts/uts.cpp:43

2017-11-28T06:27:55+00:00

john bachan
I would love to go digging into this, but I'm still not sure how to reproduce an environment just given a link to the crash page. I assume I've either missed or forgotten helpful emails and links.
- 2017-11-29T17:21:50+00:00

Dan Bonachea

The relevant configuration page is here, although that probably has both more and less information than you need.

The UPC++ CI build log is here

I was able to reproduce on dirac with:

SMP:
env CC="/usr/local/pkg/clang/5.0.0/bin/clang -Wno-unused-command-line-argument" CXX="/usr/local/pkg/openmpi-2.1.1/clang-5.0.0/bin/mpicxx -Wno-unused-command-line-argument" UPCXX_BACKEND=gasnetex_par DBGSYM=1 OPTLEV=0 nobs exe test/uts/uts_omp_ranks.cpp

IBV:
env CC="/usr/local/pkg/clang/5.0.0/bin/clang -Wno-unused-command-line-argument" CXX="/usr/local/pkg/openmpi-2.1.1/clang-5.0.0/bin/mpicxx -Wno-unused-command-line-argument" GASNET_CONDUIT=ibv UPCXX_BACKEND=gasnetex_par DBGSYM=1 OPTLEV=0 nobs exe test/uts/uts_omp_ranks.cpp

2017-11-29T18:46:47+00:00

Paul Hargrove reporter
@jbachan reminder: "on dirac", means the systems "pcp-d-5" and "pcp-d-6", reachable from n2001.
- 2017-11-29T19:06:47+00:00
john bachan
Found the problem. This is either a bug in the omp runtime, or my understanding of the env var OMP_NUM_THREADS. According to: https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fNUM_005fTHREADS.html the wording seems to imply that OMP_NUM_THREADS should just be a suggested default number of threads to spin up. I was expecting a parallel region which explicitly requested an exceeding number of threads to have those extra threads spun up. This is the behavior I see on other omp runtimes. But this wasn't happening here. The fix was to explicitly call omp_set_num_threads(N) at startup, before creating a parallel region with num_threads(N).
- 2018-01-09T01:01:31+00:00
john bachan
- changed status to resolved
Fixed issue 93. It appears that OMP runtimes interpret OMP_NUM_THREADS env var differently. This fix calls omp_set_num_threads at startup to ensure parallel regions requiring N threads get N distinct threads, even when N exceeds OMP_NUM_THREADS.

→ <<cset f3701ac78ccf>>
- 2018-01-09T01:07:52+00:00
Dan Bonachea
Fix confirmed by nightly tests.

Thanks John!
- 2018-01-10T02:18:25+00:00
john bachan
- assigned issue to
  
  john bachan
- 2018-01-18T21:28:14+00:00
Log in to comment

Assignee: john bachan

Type: bug

Priority: major

Status: resolved

Component: Runtime

Milestone: 2017.12.31 release

Version: Development Branch

Votes: 0

Watchers: 3

Jira: the preferred issue tracker for Bitbucket. Join the team!