lpc_barrier test fails with gcc 5.3.0 on Cori with 2 ranks

Issue #157 wontfix
Amir Kamil created an issue

This is on Cori on the develop branch.

Here is the output of run-tests:

UPCXX revision: upcxx-2018.3.3-21-g6940e10
System: Linux cori02 4.4.114-94.11-default #1 SMP Thu Feb 1 19:28:26 UTC 2018 (4309ff9) x86_64 x86_64 x86_64 GNU/Linux
LSB Version:    n/a
Distributor ID: SUSE
Description:    SUSE Linux Enterprise Server 12 SP3
Release:    12.3
Codename:   n/a

Date: Wed Jul 18 14:47:47 PDT 2018
Current directory: /tmp/kamil/upcxx-origin
Install directory:
Settings:

Checking platform...
WARNING: To build for Cray XC compute nodes, you should set the CROSS variable (e.g. CROSS=cray-aries-slurm)
/opt/cray/pe/craype/2.5.14/bin/CC
g++ (GCC) 5.3.0 20151204 (Cray Inc.)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

/opt/cray/pe/craype/2.5.14/bin/cc
gcc (GCC) 5.3.0 20151204 (Cray Inc.)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


Using udp conduit (set environment variable CONDUIT=<udp|smp|ibv|aries> to change)
Running tests on 2 ranks
Setting up upcxx... (this may take a while)
Running test atomics.cpp
Test result: SUCCESS
Test result: SUCCESS
Running test collectives.cpp
Test result: SUCCESS
Test result: SUCCESS
Running test dist_object.cpp
Test result: SUCCESS
Test result: SUCCESS
Running test future.cpp
Test result: SUCCESS
Running test local_team.cpp
Test result: SUCCESS
Test result: SUCCESS
Running test lpc_barrier.cpp
Test failed, the trace can be found in test/run-tests.err

The contents of run-tests.err:

CC -std=c++11 -D_GNU_SOURCE=1 -I/tmp/kamil/upcxx-origin/.nobs/art/d1bd407ca7d1e3d14f387712d24ea9d108918227 -DNOBS_DISCOVERY -MM -MT x /tmp/kamil/upcxx-origin/test/lpc_barrier.cpp

CC -std=c++11 -D_GNU_SOURCE=1 -I/tmp/kamil/upcxx-origin/.nobs/art/d1bd407ca7d1e3d14f387712d24ea9d108918227 -DUPCXX_ASSERT_ENABLED=1 -DUPCXX_MPSC_QUEUE_ATOMIC=1 -DNOBS_DISCOVERY -MM -MT x /tmp/kamil/upcxx-origin/test/lpc_barrier.cpp

CC -std=c++11 -D_GNU_SOURCE=1 -I/tmp/kamil/upcxx-origin/.nobs/art/b682178a24687e2ca9bc78fabecfaa7f054a890b -DUPCXX_ASSERT_ENABLED=1 -DUPCXX_MPSC_QUEUE_ATOMIC=1 -DNOBS_DISCOVERY -MM -MT x /tmp/kamil/upcxx-origin/src/persona.cpp

CC -std=c++11 -D_GNU_SOURCE=1 -I/tmp/kamil/upcxx-origin/.nobs/art/b682178a24687e2ca9bc78fabecfaa7f054a890b -DUPCXX_ASSERT_ENABLED=1 -DUPCXX_MPSC_QUEUE_ATOMIC=1 -O0 -g -Wall -c /tmp/kamil/upcxx-origin/src/persona.cpp -o /tmp/kamil/upcxx-origin/.nobs/art/aafb32a422eee789c4cc1dd2180099ac57b8d81c.persona.cpp.o

CC -std=c++11 -D_GNU_SOURCE=1 -I/tmp/kamil/upcxx-origin/.nobs/art/d1bd407ca7d1e3d14f387712d24ea9d108918227 -DUPCXX_ASSERT_ENABLED=1 -DUPCXX_MPSC_QUEUE_ATOMIC=1 -O0 -g -Wall -c /tmp/kamil/upcxx-origin/test/lpc_barrier.cpp -o /tmp/kamil/upcxx-origin/.nobs/art/b840b83f239bfa4e194949f0fa2936a0a2293234.lpc_barrier.cpp.o

CC -o /tmp/kamil/upcxx-origin/.nobs/art/554ac8a3fe08a138203fac20c2f99d0941fcf872.x /tmp/kamil/upcxx-origin/.nobs/art/28c4424bc41a9df509f35c7b9afc076a470fff29.diagnostic.cpp.o /tmp/kamil/upcxx-origin/.nobs/art/b6858c629e8a56e4a5e33274aa9f1afc2395e5a7.core.cpp.o /tmp/kamil/upcxx-origin/.nobs/art/aafb32a422eee789c4cc1dd2180099ac57b8d81c.persona.cpp.o /tmp/kamil/upcxx-origin/.nobs/art/b840b83f239bfa4e194949f0fa2936a0a2293234.lpc_barrier.cpp.o -pthread

Test: lpc_barrier.cpp
Barrier 0
Barrier 1
Barrier 2
Barrier 3
Barrier 4
Barrier 5
Barrier 6
Barrier 7
Barrier 8
Barrier 9
0: from left
8: from left
1: from left
3: from left
6: from left
5: from left
4: from left
2: from left
9: from left
7: from left
Eyeball me! No 'rights' before this message, no 'lefts' after.
7: from right
6: from right
8: from right
9: from right
5: from right
4: from right
0: from right
3: from right
2: from right
1: from right
sourceme: line 32: 30811 Segmentation fault      (core dumped) "${NOBS_PATH}/tool.py" "$@"

Output of module list:

  1) modules/3.2.10.6
  2) nsg/1.2.0
  3) gcc/5.3.0
  4) craype-haswell
  5) craype-network-aries
  6) craype/2.5.14
  7) cray-mpich/7.7.0
  8) altd/2.0
  9) darshan/3.1.4
 10) cray-libsci/18.03.1
 11) udreg/2.3.2-6.0.5.0_13.12__ga14955a.ari
 12) ugni/6.0.14-6.0.5.0_16.9__g19583bb.ari
 13) pmi/5.0.13
 14) dmapp/7.1.1-6.0.5.0_49.8__g1125556.ari
 15) gni-headers/5.0.12-6.0.5.0_2.15__g2ef1ebc.ari
 16) xpmem/2.2.4-6.0.5.1_8.18__g35d5e73.ari
 17) job/2.2.2-6.0.5.0_8.47__g3c644b5.ari
 18) dvs/2.7_2.2.65-6.0.5.2_16.2__gbec2cb0
 19) alps/6.5.28-6.0.5.0_18.6__g13a91b6.ari
 20) rca/2.2.16-6.0.5.0_15.34__g5e09e6d.ari
 21) atp/2.1.1
 22) PrgEnv-gnu/6.0.4

I do not run into this with gcc 6.1.0 or gcc 7.3.0.

Comments (9)

  1. Dan Bonachea

    Note this is using udp-conduit, which should work but is not the preferred conduit here and less carefully tested (in particular it looks like it might be related to exit-time handling which can sometimes be sensitive to spawning details).

    It's worth following the instructions and setting CROSS=cray-aries-slurm CONDUIT=aries to see if that affects the results. Also setting GASNET_BACKTRACE=1 might generate a backtrace, unless the crash is actually coming from the ssh commands used for spawning (which seems possible given the normal GEX fatal signal output is missing).

  2. Amir Kamil reporter

    I get the same results with CROSS=cray-aries-slurm CONDUIT=aries. No backtrace with GASNET_BACKTRACE=1.

  3. john bachan

    To recreate environment do I just module swap gcc/<current> with gcc/5.3.0?

    Having tried that, lpc_barrier ran fine for me using Aries conduit.

  4. Amir Kamil reporter

    The following replicates this with the latest develop and PrgEnv-intel:

    $ export CROSS=cray-aries-slurm CONDUIT=aries RANKS=2
    $ module load gcc/5.3.0
    $ ./run-tests
    
  5. Paul Hargrove

    An I mentioned on the call today, it is possible this issue is related to one I have yet to enter into the tracker:

    On Theta (Cray XC40 at ALCF) and Titan (Cray XK series at OLCF) I have recently (past 5 days) seen several tests getting a SEGV below std::thread::~thread with gcc-5.x's libstdc++, both via PrgEnv-gnu and PrgEnv-intel.
    On Titan, I see the problem with all GCC versions installed.

    I have seen SEGV from lpc_barrier in automated testing on those systems, but do not have any backtrace for it.
    However, here is what is seen for one of the uts variants:

    [7] #8  <signal handler called>
    [7] #9  0x0000000000000000 in ?? ()
    [7] #10 0x0000000000479979 in __gthread_equal (__t1=0, __t2=0) at /opt/gcc/5.3.0/snos/include/g++/x86_64-suse-linux/bits/gthr-default.h:680
    [7] #11 0x000000000047ea3f in std::operator== (__x=..., __y=...) at /opt/gcc/5.3.0/snos/include/g++/thread:84
    [7] #12 0x000000000047ea95 in std::thread::joinable (this=0xcf2d80) at /opt/gcc/5.3.0/snos/include/g++/thread:170
    [7] #13 0x000000000047ea5e in std::thread::~thread (this=0xcf2d80, __in_chrg=<optimized out>) at /opt/gcc/5.3.0/snos/include/g++/thread:150
    [7] #14 0x000000000047a9b5 in vranks::spawn<main()::<lambda(int, int)> >(<lambda(int, int)>) (fn=...) at /lustre/atlas2/csc296/scratch/hargrove/upcnightly-titan/EX-titan-gemini-gcc/runtime/work/dbg/upcxx/test/uts/vranks_hybrid.hpp:74
    [7] #15 0x0000000000479b1a in main () at /lustre/atlas2/csc296/scratch/hargrove/upcnightly-titan/EX-titan-gemini-gcc/runtime/work/dbg/upcxx/test/uts/uts.cpp:54
    

    Dan commented earier:

    Also setting GASNET_BACKTRACE=1 might generate a backtrace, unless the crash is actually coming from the ssh commands used for spawning (which seems possible given the normal GEX fatal signal output is missing).

    However, lpc_barrier is a non-GASNet test. So, capturing a core file is required to get the backtrace. I have done so just now interactively on Titan, where the problem is present even with gcc-7.3.0 (though now appearing in join):

    Core was generated by `./lpc_barrier-par'.
    Program terminated with signal 11, Segmentation fault.
    #0  0x0000000000000000 in ?? ()
    (gdb) where
    #0  0x0000000000000000 in ?? ()
    #1  0x000000000042cda3 in __gthread_join (__value_ptr=0x0, __threadid=<optimized out>)
        at /b/tmp/peint/build-cray-gcc-20180126.202153.829775000/cray-gcc/BUILD/snos_objdir/x86_64-suse-linux/libstdc++-v3/include/x86_64-suse-linux/bits/gthr-default.h:668
    #2  std::thread::join (this=0x7b36e0)
        at ../../../../../cray-gcc-7.3.0-201801270210.d61239fc6000b/libstdc++-v3/src/c++11/thread.cc:136
    #3  0x0000000000401d90 in main ()
        at /lustre/atlas2/csc296/scratch/hargrove/upcnightly-titan/EX-titan-gemini-gcc/runtime/work/dbg/upcxx/test/lpc_barrier.cpp:164
    

    I anticipate entering this all again in another issue which will focus specifically on the possibility this is a GCC library bug, and will contain more info on the versions tested. However, in the meantime I would appreciate it if @jdbachan and @akamil could double-check the test code using std::thread. Dan and I both had a look and found nothing obviously wrong.

  6. Paul Hargrove

    I have yet to complete a new issue w/ the details, but I have confirmed that the C++ code below (no UPC++) gets a SEGV at thread destruction time, when compiled against libstdc++ from the gcc/5.3.0 module on Edison, Titan and Theta (Cori is in maintenance today).

    So, I am resolving this issue as "invalid" (for lack of a better way to say "real bug, but not in our product").

    FWIW, I am finding that gcc/7.3.0 is free of the problem on both Edison and Theta (but not Titan for some reason).

    #include <atomic>
    #include <iostream>
    #include <thread>
    
    #include <sched.h>
    
    const int thread_n = 8;
    
    int main() {
      std::atomic<int> setup_bar{0};
      auto thread_fn = [&](int me) {
        setup_bar.fetch_add(1);
        while(setup_bar.load(std::memory_order_relaxed) != thread_n)
          sched_yield();
      };
    
      std::thread* threads[thread_n];
      for(int t=1; t < thread_n; t++)
        threads[t] = new std::thread{thread_fn, t};
      thread_fn(0);
    
      for(int t=1; t < thread_n; t++) {
        threads[t]->join();
        delete threads[t];
      }
    
      std::cout << "Done.\n";
    
      return 0;
    }
    
  7. Log in to comment