uts_omp_ranks crash w/ clang on Linux/x86-64
The recently aded uts_omp_ranks test is crashing on Dirac with clang-5.0.0.
Failures can be see here
With a optimized build of GASNet one just sees SEGV.
However, the debug build of GASNet is seeing assertion failures (from GASNet and upcxx runtime) prior to reaching a SEGV.
This appears to be exclusive to clang, since gcc on the same system did not fail.
Comments (8)
-
-
I would love to go digging into this, but I'm still not sure how to reproduce an environment just given a link to the crash page. I assume I've either missed or forgotten helpful emails and links.
-
The relevant configuration page is here, although that probably has both more and less information than you need.
The UPC++ CI build log is here
I was able to reproduce on dirac with:
SMP: env CC="/usr/local/pkg/clang/5.0.0/bin/clang -Wno-unused-command-line-argument" CXX="/usr/local/pkg/openmpi-2.1.1/clang-5.0.0/bin/mpicxx -Wno-unused-command-line-argument" UPCXX_BACKEND=gasnetex_par DBGSYM=1 OPTLEV=0 nobs exe test/uts/uts_omp_ranks.cpp IBV: env CC="/usr/local/pkg/clang/5.0.0/bin/clang -Wno-unused-command-line-argument" CXX="/usr/local/pkg/openmpi-2.1.1/clang-5.0.0/bin/mpicxx -Wno-unused-command-line-argument" GASNET_CONDUIT=ibv UPCXX_BACKEND=gasnetex_par DBGSYM=1 OPTLEV=0 nobs exe test/uts/uts_omp_ranks.cpp
-
reporter @jbachan reminder: "on dirac", means the systems "pcp-d-5" and "pcp-d-6", reachable from n2001.
-
Found the problem. This is either a bug in the omp runtime, or my understanding of the env var OMP_NUM_THREADS. According to: https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fNUM_005fTHREADS.html the wording seems to imply that OMP_NUM_THREADS should just be a suggested default number of threads to spin up. I was expecting a parallel region which explicitly requested an exceeding number of threads to have those extra threads spun up. This is the behavior I see on other omp runtimes. But this wasn't happening here. The fix was to explicitly call omp_set_num_threads(N) at startup, before creating a parallel region with num_threads(N).
-
- changed status to resolved
Fixed issue 93. It appears that OMP runtimes interpret OMP_NUM_THREADS env var differently. This fix calls omp_set_num_threads at startup to ensure parallel regions requiring N threads get N distinct threads, even when N exceeds OMP_NUM_THREADS.
→ <<cset f3701ac78ccf>>
-
Fix confirmed by nightly tests.
Thanks John!
-
-
assigned issue to
-
assigned issue to
- Log in to comment
This is still consistently failing every night for the past 6 weeks.
Based on 11/27 results, this can also manifest as a hang in debug mode after failing the same debug check:
Combining this with the evidence from comment #0, it seems likely there is a buffer overrun in UPC++ that results in random memory corruption, usually scribbling over a node index later passed to AMRequestMedium.
Here is a crash stack with a single node with one OMP thread, revealing an invocation of
upcxx::persona::active_with_caller
with a NULL this pointer: