lpc_barrier test fails with gcc 5.3.0 on Cori with 2 ranks
This is on Cori on the develop branch.
Here is the output of run-tests
:
UPCXX revision: upcxx-2018.3.3-21-g6940e10
System: Linux cori02 4.4.114-94.11-default #1 SMP Thu Feb 1 19:28:26 UTC 2018 (4309ff9) x86_64 x86_64 x86_64 GNU/Linux
LSB Version: n/a
Distributor ID: SUSE
Description: SUSE Linux Enterprise Server 12 SP3
Release: 12.3
Codename: n/a
Date: Wed Jul 18 14:47:47 PDT 2018
Current directory: /tmp/kamil/upcxx-origin
Install directory:
Settings:
Checking platform...
WARNING: To build for Cray XC compute nodes, you should set the CROSS variable (e.g. CROSS=cray-aries-slurm)
/opt/cray/pe/craype/2.5.14/bin/CC
g++ (GCC) 5.3.0 20151204 (Cray Inc.)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
/opt/cray/pe/craype/2.5.14/bin/cc
gcc (GCC) 5.3.0 20151204 (Cray Inc.)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Using udp conduit (set environment variable CONDUIT=<udp|smp|ibv|aries> to change)
Running tests on 2 ranks
Setting up upcxx... (this may take a while)
Running test atomics.cpp
Test result: SUCCESS
Test result: SUCCESS
Running test collectives.cpp
Test result: SUCCESS
Test result: SUCCESS
Running test dist_object.cpp
Test result: SUCCESS
Test result: SUCCESS
Running test future.cpp
Test result: SUCCESS
Running test local_team.cpp
Test result: SUCCESS
Test result: SUCCESS
Running test lpc_barrier.cpp
Test failed, the trace can be found in test/run-tests.err
The contents of run-tests.err
:
CC -std=c++11 -D_GNU_SOURCE=1 -I/tmp/kamil/upcxx-origin/.nobs/art/d1bd407ca7d1e3d14f387712d24ea9d108918227 -DNOBS_DISCOVERY -MM -MT x /tmp/kamil/upcxx-origin/test/lpc_barrier.cpp
CC -std=c++11 -D_GNU_SOURCE=1 -I/tmp/kamil/upcxx-origin/.nobs/art/d1bd407ca7d1e3d14f387712d24ea9d108918227 -DUPCXX_ASSERT_ENABLED=1 -DUPCXX_MPSC_QUEUE_ATOMIC=1 -DNOBS_DISCOVERY -MM -MT x /tmp/kamil/upcxx-origin/test/lpc_barrier.cpp
CC -std=c++11 -D_GNU_SOURCE=1 -I/tmp/kamil/upcxx-origin/.nobs/art/b682178a24687e2ca9bc78fabecfaa7f054a890b -DUPCXX_ASSERT_ENABLED=1 -DUPCXX_MPSC_QUEUE_ATOMIC=1 -DNOBS_DISCOVERY -MM -MT x /tmp/kamil/upcxx-origin/src/persona.cpp
CC -std=c++11 -D_GNU_SOURCE=1 -I/tmp/kamil/upcxx-origin/.nobs/art/b682178a24687e2ca9bc78fabecfaa7f054a890b -DUPCXX_ASSERT_ENABLED=1 -DUPCXX_MPSC_QUEUE_ATOMIC=1 -O0 -g -Wall -c /tmp/kamil/upcxx-origin/src/persona.cpp -o /tmp/kamil/upcxx-origin/.nobs/art/aafb32a422eee789c4cc1dd2180099ac57b8d81c.persona.cpp.o
CC -std=c++11 -D_GNU_SOURCE=1 -I/tmp/kamil/upcxx-origin/.nobs/art/d1bd407ca7d1e3d14f387712d24ea9d108918227 -DUPCXX_ASSERT_ENABLED=1 -DUPCXX_MPSC_QUEUE_ATOMIC=1 -O0 -g -Wall -c /tmp/kamil/upcxx-origin/test/lpc_barrier.cpp -o /tmp/kamil/upcxx-origin/.nobs/art/b840b83f239bfa4e194949f0fa2936a0a2293234.lpc_barrier.cpp.o
CC -o /tmp/kamil/upcxx-origin/.nobs/art/554ac8a3fe08a138203fac20c2f99d0941fcf872.x /tmp/kamil/upcxx-origin/.nobs/art/28c4424bc41a9df509f35c7b9afc076a470fff29.diagnostic.cpp.o /tmp/kamil/upcxx-origin/.nobs/art/b6858c629e8a56e4a5e33274aa9f1afc2395e5a7.core.cpp.o /tmp/kamil/upcxx-origin/.nobs/art/aafb32a422eee789c4cc1dd2180099ac57b8d81c.persona.cpp.o /tmp/kamil/upcxx-origin/.nobs/art/b840b83f239bfa4e194949f0fa2936a0a2293234.lpc_barrier.cpp.o -pthread
Test: lpc_barrier.cpp
Barrier 0
Barrier 1
Barrier 2
Barrier 3
Barrier 4
Barrier 5
Barrier 6
Barrier 7
Barrier 8
Barrier 9
0: from left
8: from left
1: from left
3: from left
6: from left
5: from left
4: from left
2: from left
9: from left
7: from left
Eyeball me! No 'rights' before this message, no 'lefts' after.
7: from right
6: from right
8: from right
9: from right
5: from right
4: from right
0: from right
3: from right
2: from right
1: from right
sourceme: line 32: 30811 Segmentation fault (core dumped) "${NOBS_PATH}/tool.py" "$@"
Output of module list
:
1) modules/3.2.10.6
2) nsg/1.2.0
3) gcc/5.3.0
4) craype-haswell
5) craype-network-aries
6) craype/2.5.14
7) cray-mpich/7.7.0
8) altd/2.0
9) darshan/3.1.4
10) cray-libsci/18.03.1
11) udreg/2.3.2-6.0.5.0_13.12__ga14955a.ari
12) ugni/6.0.14-6.0.5.0_16.9__g19583bb.ari
13) pmi/5.0.13
14) dmapp/7.1.1-6.0.5.0_49.8__g1125556.ari
15) gni-headers/5.0.12-6.0.5.0_2.15__g2ef1ebc.ari
16) xpmem/2.2.4-6.0.5.1_8.18__g35d5e73.ari
17) job/2.2.2-6.0.5.0_8.47__g3c644b5.ari
18) dvs/2.7_2.2.65-6.0.5.2_16.2__gbec2cb0
19) alps/6.5.28-6.0.5.0_18.6__g13a91b6.ari
20) rca/2.2.16-6.0.5.0_15.34__g5e09e6d.ari
21) atp/2.1.1
22) PrgEnv-gnu/6.0.4
I do not run into this with gcc 6.1.0 or gcc 7.3.0.
Comments (9)
-
-
reporter I get the same results with
CROSS=cray-aries-slurm CONDUIT=aries
. No backtrace withGASNET_BACKTRACE=1
. -
To recreate environment do I just module swap
gcc/<current>
withgcc/5.3.0
?Having tried that,
lpc_barrier
ran fine for me using Aries conduit. -
reporter The following replicates this with the latest develop and PrgEnv-intel:
$ export CROSS=cray-aries-slurm CONDUIT=aries RANKS=2 $ module load gcc/5.3.0 $ ./run-tests
-
An I mentioned on the call today, it is possible this issue is related to one I have yet to enter into the tracker:
On Theta (Cray XC40 at ALCF) and Titan (Cray XK series at OLCF) I have recently (past 5 days) seen several tests getting a SEGV below
std::thread::~thread
with gcc-5.x's libstdc++, both via PrgEnv-gnu and PrgEnv-intel.
On Titan, I see the problem with all GCC versions installed.I have seen SEGV from lpc_barrier in automated testing on those systems, but do not have any backtrace for it.
However, here is what is seen for one of the uts variants:[7] #8 <signal handler called> [7] #9 0x0000000000000000 in ?? () [7] #10 0x0000000000479979 in __gthread_equal (__t1=0, __t2=0) at /opt/gcc/5.3.0/snos/include/g++/x86_64-suse-linux/bits/gthr-default.h:680 [7] #11 0x000000000047ea3f in std::operator== (__x=..., __y=...) at /opt/gcc/5.3.0/snos/include/g++/thread:84 [7] #12 0x000000000047ea95 in std::thread::joinable (this=0xcf2d80) at /opt/gcc/5.3.0/snos/include/g++/thread:170 [7] #13 0x000000000047ea5e in std::thread::~thread (this=0xcf2d80, __in_chrg=<optimized out>) at /opt/gcc/5.3.0/snos/include/g++/thread:150 [7] #14 0x000000000047a9b5 in vranks::spawn<main()::<lambda(int, int)> >(<lambda(int, int)>) (fn=...) at /lustre/atlas2/csc296/scratch/hargrove/upcnightly-titan/EX-titan-gemini-gcc/runtime/work/dbg/upcxx/test/uts/vranks_hybrid.hpp:74 [7] #15 0x0000000000479b1a in main () at /lustre/atlas2/csc296/scratch/hargrove/upcnightly-titan/EX-titan-gemini-gcc/runtime/work/dbg/upcxx/test/uts/uts.cpp:54
Dan commented earier:
Also setting GASNET_BACKTRACE=1 might generate a backtrace, unless the crash is actually coming from the ssh commands used for spawning (which seems possible given the normal GEX fatal signal output is missing).
However,
lpc_barrier
is a non-GASNet test. So, capturing a core file is required to get the backtrace. I have done so just now interactively on Titan, where the problem is present even with gcc-7.3.0 (though now appearing injoin
):Core was generated by `./lpc_barrier-par'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000000000 in ?? () (gdb) where #0 0x0000000000000000 in ?? () #1 0x000000000042cda3 in __gthread_join (__value_ptr=0x0, __threadid=<optimized out>) at /b/tmp/peint/build-cray-gcc-20180126.202153.829775000/cray-gcc/BUILD/snos_objdir/x86_64-suse-linux/libstdc++-v3/include/x86_64-suse-linux/bits/gthr-default.h:668 #2 std::thread::join (this=0x7b36e0) at ../../../../../cray-gcc-7.3.0-201801270210.d61239fc6000b/libstdc++-v3/src/c++11/thread.cc:136 #3 0x0000000000401d90 in main () at /lustre/atlas2/csc296/scratch/hargrove/upcnightly-titan/EX-titan-gemini-gcc/runtime/work/dbg/upcxx/test/lpc_barrier.cpp:164
I anticipate entering this all again in another issue which will focus specifically on the possibility this is a GCC library bug, and will contain more info on the versions tested. However, in the meantime I would appreciate it if @jdbachan and @akamil could double-check the test code using
std::thread
. Dan and I both had a look and found nothing obviously wrong. -
I have yet to complete a new issue w/ the details, but I have confirmed that the C++ code below (no UPC++) gets a SEGV at thread destruction time, when compiled against libstdc++ from the gcc/5.3.0 module on Edison, Titan and Theta (Cori is in maintenance today).
So, I am resolving this issue as "invalid" (for lack of a better way to say "real bug, but not in our product").
FWIW, I am finding that gcc/7.3.0 is free of the problem on both Edison and Theta (but not Titan for some reason).
#include <atomic> #include <iostream> #include <thread> #include <sched.h> const int thread_n = 8; int main() { std::atomic<int> setup_bar{0}; auto thread_fn = [&](int me) { setup_bar.fetch_add(1); while(setup_bar.load(std::memory_order_relaxed) != thread_n) sched_yield(); }; std::thread* threads[thread_n]; for(int t=1; t < thread_n; t++) threads[t] = new std::thread{thread_fn, t}; thread_fn(0); for(int t=1; t < thread_n; t++) { threads[t]->join(); delete threads[t]; } std::cout << "Done.\n"; return 0; }
-
- changed status to invalid
Appears to be a genuine bug in libstdc++ from gcc/5.3.0 on the Crays.
However, not a upc++bug. -
- changed component to External
-
- changed status to wontfix
This has been confirmed to be a bug in the C++ threads implementation on Cray.
We are tracking the status of this external issue here: https://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=3813
- Log in to comment
Note this is using udp-conduit, which should work but is not the preferred conduit here and less carefully tested (in particular it looks like it might be related to exit-time handling which can sometimes be sensitive to spawning details).
It's worth following the instructions and setting
CROSS=cray-aries-slurm CONDUIT=aries
to see if that affects the results. Also settingGASNET_BACKTRACE=1
might generate a backtrace, unless the crash is actually coming from the ssh commands used for spawning (which seems possible given the normal GEX fatal signal output is missing).