lpc_barrier exit SEGV on cori

Issue #71 resolved
Dan Bonachea created an issue

run-tests is currently failing on cori because lpc_barrier gets a SIGSEGV on exit - see full trace below.

This test does not use GASNet, so the crash is conduit-independent.

Note it still has to be run inside an interactive batch job due to the cross-compilation.

{cori[1] ~/UPC/upcxx} git describe --always
b0a593a
{cori[1] ~/UPC/upcxx} module list
Currently Loaded Modulefiles:
  1) modules/3.2.10.6                               7) udreg/2.3.2-6.0.4.0_12.2__g2f9c3ee.ari        13) dvs/2.7_2.2.31-6.0.4.1_6.1__gb3b87e6          19) Base-opts/2.4.123-6.0.4.0_10.1__g6460790.ari
  2) gcc/6.3.0                                      8) ugni/6.0.14-6.0.4.0_14.1__ge7db4a2.ari        14) alps/6.4.1-6.0.4.0_7.2__g86d0f3d.ari          20) cray-libsci/17.06.1
  3) craype-haswell                                 9) dmapp/7.1.1-6.0.4.0_46.2__gb8abda2.ari        15) rca/2.2.11-6.0.4.0_13.2__g84de67a.ari         21) pmi/5.0.12
  4) craype-network-aries                          10) gni-headers/5.0.11-6.0.4.0_7.2__g7136988.ari  16) cray-shmem/7.6.0                              22) atp/2.1.1
  5) craype/2.5.12                                 11) xpmem/2.2.2-6.0.4.0_3.1__g43b0535.ari         17) bupc/2.26.0                                   23) PrgEnv-gnu/6.0.4
  6) cray-mpich/7.6.0                              12) job/2.2.2-6.0.4.0_8.2__g3c644b5.ari           18) git/2.9.1
{cori[1] ~/UPC/upcxx} rm -Rf .nobs ; env DBGSYM=1 OPTLEV=0 nobs run test/lpc_barrier.cpp   
CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/107df87196fc3e462886e41ee307b7343d725152 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/test/lpc_barrier.cpp

CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/107df87196fc3e462886e41ee307b7343d725152 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/test/lpc_barrier.cpp -o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/d3b9d0c421c5f114f489f9893c296fdc40d290e1.lpc_barrier.cpp.o

CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/src/persona.cpp

CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/src/future/core.cpp

CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/src/diagnostic.cpp

CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/src/lpc_inbox.cpp

CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/src/future/core.cpp -o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/7d6d5014fd14fef7f9ef9093a8406acd222193cb.core.cpp.o

CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/src/diagnostic.cpp -o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/9750feb5cf2440c84d13bbe8af56862750cb6f76.diagnostic.cpp.o

CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/src/persona.cpp -o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/376b63e6156cd83ff358cf89777b2da6d109ee00.persona.cpp.o

CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/src/lpc_inbox.cpp -o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/bfd68c56514a2091909af823ad57e003e8e2b770.lpc_inbox.cpp.o

CC -o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/5a333efd10c50cb6ba131ec0af87ef9e7d84a503.x /global/homes/b/bonachea/UPC/upcxx/.nobs/art/9750feb5cf2440c84d13bbe8af56862750cb6f76.diagnostic.cpp.o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/7d6d5014fd14fef7f9ef9093a8406acd222193cb.core.cpp.o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/bfd68c56514a2091909af823ad57e003e8e2b770.lpc_inbox.cpp.o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/376b63e6156cd83ff358cf89777b2da6d109ee00.persona.cpp.o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/d3b9d0c421c5f114f489f9893c296fdc40d290e1.lpc_barrier.cpp.o -lpthread

Test: lpc_barrier.cpp
Barrier 0
Barrier 1
Barrier 2
Barrier 3
Barrier 4
Barrier 5
Barrier 6
Barrier 7
Barrier 8
Barrier 9
0: from left
2: from left
3: from left
7: from left
6: from left
5: from left
4: from left
1: from left
8: from left
9: from left
Eyeball me! No 'rights' before this message, no 'lefts' after.
9: from right
5: from right
8: from right
7: from right
6: from right
3: from right
0: from right
4: from right
2: from right
1: from right
Segmentation fault

{cori[1] ~/UPC/upcxx} srun -N 1 /global/homes/b/bonachea/UPC/upcxx/.nobs/art/5a333efd10c50cb6ba131ec0af87ef9e7d84a503.x
Test: lpc_barrier.cpp
Barrier 0
Barrier 1
Barrier 2
Barrier 3
Barrier 4
Barrier 5
Barrier 6
Barrier 7
Barrier 8
Barrier 9
2: from left
0: from left
8: from left
1: from left
6: from left
7: from left
9: from left
4: from left
5: from left
3: from left
Eyeball me! No 'rights' before this message, no 'lefts' after.
5: from right
0: from right
2: from right
6: from right
7: from right
9: from right
8: from right
1: from right
4: from right
3: from right
srun: error: nid00063: task 0: Segmentation fault
srun: Terminating job step 7054140.4

{cori[1] ~/UPC/upcxx} gdb /global/homes/b/bonachea/UPC/upcxx/.nobs/art/5a333efd10c50cb6ba131ec0af87ef9e7d84a503.x
GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.opensuse.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /global/homes/b/bonachea/UPC/upcxx/.nobs/art/5a333efd10c50cb6ba131ec0af87ef9e7d84a503.x...done.
(gdb) r
Starting program: /global/u1/b/bonachea/UPC/upcxx/.nobs/art/5a333efd10c50cb6ba131ec0af87ef9e7d84a503.x 
## Loaded 'bupc/2.26.0-6.0.4-gnu-6.3.0' based on currently loaded modules.
## If you change PrgEnv, craype or compiler modules then you must
## run 'module switch bupc bupc' to get the correct bupc module.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Test: lpc_barrier.cpp
[New Thread 0x7ffff7ff8700 (LWP 48503)]
[New Thread 0x7ffff77f7700 (LWP 48504)]
[New Thread 0x7ffff6ff6700 (LWP 48505)]
[New Thread 0x7ffff67f5700 (LWP 48506)]
[New Thread 0x7ffff5ff4700 (LWP 48507)]
[New Thread 0x7ffff57f3700 (LWP 48508)]
[New Thread 0x7ffff4ff2700 (LWP 48509)]
[New Thread 0x7fffd7fff700 (LWP 48510)]
[New Thread 0x7fffd77fe700 (LWP 48511)]
Barrier 0
Barrier 1
Barrier 2
Barrier 3
Barrier 4
Barrier 5
Barrier 6
Barrier 7
Barrier 8
Barrier 9
9: from left
5: from left
2: from left
7: from left
3: from left
8: from left
6: from left
4: from left
0: from left
1: from left
Eyeball me! No 'rights' before this message, no 'lefts' after.
0: from right
8: from right
1: from right
3: from right
7: from right
6: from right
2: from right
4: from right
5: from right
9: from right
[Thread 0x7fffd77fe700 (LWP 48511) exited]
[Thread 0x7fffd7fff700 (LWP 48510) exited]
[Thread 0x7ffff4ff2700 (LWP 48509) exited]
[Thread 0x7ffff57f3700 (LWP 48508) exited]
[Thread 0x7ffff5ff4700 (LWP 48507) exited]
[Thread 0x7ffff67f5700 (LWP 48506) exited]
[Thread 0x7ffff6ff6700 (LWP 48505) exited]
[Thread 0x7ffff77f7700 (LWP 48504) exited]
[Thread 0x7ffff7ff8700 (LWP 48503) exited]

Thread 1 "5a333efd10c50cb" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) where
#0  0x0000000000000000 in ?? ()
#1  0x000000000041a997 in __gthread_join (__value_ptr=0x0, __threadid=<optimized out>) at /tmp/peint/cray-gcc/BUILD/snos_objdir/x86_64-suse-linux/libstdc++-v3/include/x86_64-suse-linux/bits/gthr-default.h:668
#2  std::thread::join (this=0x7c20a0) at ../../../../../cray-gcc-6.3.0-201701050407.93fe37becc347/libstdc++-v3/src/c++11/thread.cc:136
#3  0x000000000040425f in main () at /global/u1/b/bonachea/UPC/upcxx/test/lpc_barrier.cpp:167
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7c09c0 (LWP 48406) "5a333efd10c50cb" 0x0000000000000000 in ?? ()

Comments (10)

  1. Paul Hargrove

    FWIW here are the environment modules loaded when the nightly tester runs:

    modules/3.2.10.6
    nsg/1.2.0
    cray-mpich/7.6.0
    darshan/3.1.4
    gcc/6.3.0
    craype-haswell
    craype-network-aries
    craype/2.5.12
    cray-libsci/17.06.1
    udreg/2.3.2-6.0.4.0_12.2__g2f9c3ee.ari
    ugni/6.0.14-6.0.4.0_14.1__ge7db4a2.ari
    pmi/5.0.12
    dmapp/7.1.1-6.0.4.0_46.2__gb8abda2.ari
    gni-headers/5.0.11-6.0.4.0_7.2__g7136988.ari
    xpmem/2.2.2-6.0.4.0_3.1__g43b0535.ari
    job/2.2.2-6.0.4.0_8.2__g3c644b5.ari
    dvs/2.7_2.2.31-6.0.4.1_6.1__gb3b87e6
    alps/6.4.1-6.0.4.0_7.2__g86d0f3d.ari
    rca/2.2.11-6.0.4.0_13.2__g84de67a.ari
    atp/2.1.1
    PrgEnv-gnu/6.0.4
    

    I didn't do an exact comparison but both bupc and cray-shmem popped out as present in Dan's list and absent from the list above.

  2. Former user Account Deleted

    I cannot reproduce on either haswell or knl running under either gdb or srun. The trace shows that its std::thread::join which is failing, I'm not sure how that could be our (my) fault.

  3. Paul Hargrove

    Hmm. Worked for me w/o removing cray-shmem env module when I compiled on the front end before launching an interactive job to run. I did not set and env vars.

    I will be trying to more closely reproduce Dan's failure.

    Dan,
    How is is that env .... nobs ... works for you when nobs is a shell function not an real executable?

  4. Paul Hargrove

    I've tried compiling on the front-end or the compute node.
    I've tried with and without the bupc env module loaded.
    I've tried with and without DBGSYM=1 OPTLEV=0.
    While I've not tried all 8 combinations, I have covered each dimension and still cannot reproduce.

    FWIW: I am running git clean -fxd between attempts to be certain of no artifacts spilling over from one to the next.

  5. Dan Bonachea reporter

    How is is that env .... nobs ... works for you when nobs is a shell function not an real executable?

    I have a nobs driver script so I can invoke it from any shell or within env without explicitly sourcing anything.

    Here it is with bash and without my script:

    bonachea@nid00072:~/UPC/upcxx> . sourceme
    bonachea@nid00072:~/UPC/upcxx> git describe --always
    fc49099
    bonachea@nid00072:~/UPC/upcxx> rm -Rf .nobs ; DBGSYM=1 OPTLEV=0 nobs run test/lpc_barrier.cpp   
    CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/107df87196fc3e462886e41ee307b7343d725152 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/test/lpc_barrier.cpp
    
    CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/107df87196fc3e462886e41ee307b7343d725152 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/test/lpc_barrier.cpp -o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/162cff595faed947ac86c2015cabc3a468065842.lpc_barrier.cpp.o
    
    CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/src/persona.cpp
    
    CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/src/future/core.cpp
    
    CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/src/diagnostic.cpp
    
    CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/src/lpc_inbox.cpp
    
    CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/src/diagnostic.cpp -o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/9afa4e0afc4f9291624c72ae2fe53c8ef69bcdd9.diagnostic.cpp.o
    
    CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/src/lpc_inbox.cpp -o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/ffe2051ddde8e33bde333abe2135d77627ff2d95.lpc_inbox.cpp.o
    
    CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/src/future/core.cpp -o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/79b2cb10970b4996f25115c98664c3731cc0dc34.core.cpp.o
    
    CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/src/persona.cpp -o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/cd323a957e343279073980471d88a162f2d1b35d.persona.cpp.o
    
    CC -o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/e1b9b4b8c1ba82130339300c8e766dff0cdfd16c.x /global/u1/b/bonachea/UPC/upcxx/.nobs/art/9afa4e0afc4f9291624c72ae2fe53c8ef69bcdd9.diagnostic.cpp.o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/79b2cb10970b4996f25115c98664c3731cc0dc34.core.cpp.o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/ffe2051ddde8e33bde333abe2135d77627ff2d95.lpc_inbox.cpp.o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/cd323a957e343279073980471d88a162f2d1b35d.persona.cpp.o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/162cff595faed947ac86c2015cabc3a468065842.lpc_barrier.cpp.o -lpthread
    
    Test: lpc_barrier.cpp
    Barrier 0
    Barrier 1
    Barrier 2
    Barrier 3
    Barrier 4
    Barrier 5
    Barrier 6
    Barrier 7
    Barrier 8
    Barrier 9
    0: from left
    7: from left
    5: from left
    8: from left
    4: from left
    3: from left
    1: from left
    2: from left
    6: from left
    9: from left
    Eyeball me! No 'rights' before this message, no 'lefts' after.
    9: from right
    6: from right
    0: from right
    1: from right
    5: from right
    3: from right
    4: from right
    8: from right
    7: from right
    2: from right
    Segmentation fault
    bonachea@nid00072:~/UPC/upcxx> module list
    Currently Loaded Modulefiles:
      1) modules/3.2.10.6                               7) udreg/2.3.2-6.0.4.0_12.2__g2f9c3ee.ari        13) dvs/2.7_2.2.31-6.0.4.1_6.1__gb3b87e6          19) Base-opts/2.4.123-6.0.4.0_10.1__g6460790.ari
      2) gcc/6.3.0                                      8) ugni/6.0.14-6.0.4.0_14.1__ge7db4a2.ari        14) alps/6.4.1-6.0.4.0_7.2__g86d0f3d.ari          20) cray-libsci/17.06.1
      3) craype-haswell                                 9) dmapp/7.1.1-6.0.4.0_46.2__gb8abda2.ari        15) rca/2.2.11-6.0.4.0_13.2__g84de67a.ari         21) pmi/5.0.12
      4) craype-network-aries                          10) gni-headers/5.0.11-6.0.4.0_7.2__g7136988.ari  16) cray-shmem/7.6.0                              22) atp/2.1.1
      5) craype/2.5.12                                 11) xpmem/2.2.2-6.0.4.0_3.1__g43b0535.ari         17) bupc/2.26.0                                   23) PrgEnv-gnu/6.0.4
      6) cray-mpich/7.6.0                              12) job/2.2.2-6.0.4.0_8.2__g3c644b5.ari           18) git/2.9.1
    bonachea@nid00072:~/UPC/upcxx> 
    
  6. Dan Bonachea reporter

    The nightly tester output includes:

    --- App stderr ---
    Tue Sep 26 03:12:24 2017: [unset]:_pmi_alps_get_apid:alps response not OKAY
    Tue Sep 26 03:12:24 2017: [unset]:_pmi_init:_pmi_alps_init returned -1
    

    now suspecting this is due to Cray not liking us running a.out executables directly on a compute node without using PMI.

  7. Paul Hargrove

    Dan and I worked out that the key difference is the presence of the "darshan" environment module.
    It is loaded by default for me, and the nightly testers (and presumably for John as well).
    However it is not in Dan's modules listed anywhere above.

    We have no idea why Darshan (an I/O characterization library) is relevant, but Dan's errors went away when he loaded it. Similarly, after unloading the darshan module I also get the SEGV Dan reported.

    Note that it appears to matter whether or not darshan is loaded when the executable is built. After that loading the module does not prevent the executable from SEGVing.

  8. Log in to comment