lpc_barrier exit SEGV on cori
run-tests is currently failing on cori because lpc_barrier gets a SIGSEGV on exit - see full trace below.
This test does not use GASNet, so the crash is conduit-independent.
Note it still has to be run inside an interactive batch job due to the cross-compilation.
{cori[1] ~/UPC/upcxx} git describe --always
b0a593a
{cori[1] ~/UPC/upcxx} module list
Currently Loaded Modulefiles:
1) modules/3.2.10.6 7) udreg/2.3.2-6.0.4.0_12.2__g2f9c3ee.ari 13) dvs/2.7_2.2.31-6.0.4.1_6.1__gb3b87e6 19) Base-opts/2.4.123-6.0.4.0_10.1__g6460790.ari
2) gcc/6.3.0 8) ugni/6.0.14-6.0.4.0_14.1__ge7db4a2.ari 14) alps/6.4.1-6.0.4.0_7.2__g86d0f3d.ari 20) cray-libsci/17.06.1
3) craype-haswell 9) dmapp/7.1.1-6.0.4.0_46.2__gb8abda2.ari 15) rca/2.2.11-6.0.4.0_13.2__g84de67a.ari 21) pmi/5.0.12
4) craype-network-aries 10) gni-headers/5.0.11-6.0.4.0_7.2__g7136988.ari 16) cray-shmem/7.6.0 22) atp/2.1.1
5) craype/2.5.12 11) xpmem/2.2.2-6.0.4.0_3.1__g43b0535.ari 17) bupc/2.26.0 23) PrgEnv-gnu/6.0.4
6) cray-mpich/7.6.0 12) job/2.2.2-6.0.4.0_8.2__g3c644b5.ari 18) git/2.9.1
{cori[1] ~/UPC/upcxx} rm -Rf .nobs ; env DBGSYM=1 OPTLEV=0 nobs run test/lpc_barrier.cpp
CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/107df87196fc3e462886e41ee307b7343d725152 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/test/lpc_barrier.cpp
CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/107df87196fc3e462886e41ee307b7343d725152 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/test/lpc_barrier.cpp -o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/d3b9d0c421c5f114f489f9893c296fdc40d290e1.lpc_barrier.cpp.o
CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/src/persona.cpp
CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/src/future/core.cpp
CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/src/diagnostic.cpp
CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/src/lpc_inbox.cpp
CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/src/future/core.cpp -o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/7d6d5014fd14fef7f9ef9093a8406acd222193cb.core.cpp.o
CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/src/diagnostic.cpp -o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/9750feb5cf2440c84d13bbe8af56862750cb6f76.diagnostic.cpp.o
CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/src/persona.cpp -o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/376b63e6156cd83ff358cf89777b2da6d109ee00.persona.cpp.o
CC -std=c++11 -D_GNU_SOURCE=1 -I/global/homes/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/src/lpc_inbox.cpp -o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/bfd68c56514a2091909af823ad57e003e8e2b770.lpc_inbox.cpp.o
CC -o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/5a333efd10c50cb6ba131ec0af87ef9e7d84a503.x /global/homes/b/bonachea/UPC/upcxx/.nobs/art/9750feb5cf2440c84d13bbe8af56862750cb6f76.diagnostic.cpp.o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/7d6d5014fd14fef7f9ef9093a8406acd222193cb.core.cpp.o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/bfd68c56514a2091909af823ad57e003e8e2b770.lpc_inbox.cpp.o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/376b63e6156cd83ff358cf89777b2da6d109ee00.persona.cpp.o /global/homes/b/bonachea/UPC/upcxx/.nobs/art/d3b9d0c421c5f114f489f9893c296fdc40d290e1.lpc_barrier.cpp.o -lpthread
Test: lpc_barrier.cpp
Barrier 0
Barrier 1
Barrier 2
Barrier 3
Barrier 4
Barrier 5
Barrier 6
Barrier 7
Barrier 8
Barrier 9
0: from left
2: from left
3: from left
7: from left
6: from left
5: from left
4: from left
1: from left
8: from left
9: from left
Eyeball me! No 'rights' before this message, no 'lefts' after.
9: from right
5: from right
8: from right
7: from right
6: from right
3: from right
0: from right
4: from right
2: from right
1: from right
Segmentation fault
{cori[1] ~/UPC/upcxx} srun -N 1 /global/homes/b/bonachea/UPC/upcxx/.nobs/art/5a333efd10c50cb6ba131ec0af87ef9e7d84a503.x
Test: lpc_barrier.cpp
Barrier 0
Barrier 1
Barrier 2
Barrier 3
Barrier 4
Barrier 5
Barrier 6
Barrier 7
Barrier 8
Barrier 9
2: from left
0: from left
8: from left
1: from left
6: from left
7: from left
9: from left
4: from left
5: from left
3: from left
Eyeball me! No 'rights' before this message, no 'lefts' after.
5: from right
0: from right
2: from right
6: from right
7: from right
9: from right
8: from right
1: from right
4: from right
3: from right
srun: error: nid00063: task 0: Segmentation fault
srun: Terminating job step 7054140.4
{cori[1] ~/UPC/upcxx} gdb /global/homes/b/bonachea/UPC/upcxx/.nobs/art/5a333efd10c50cb6ba131ec0af87ef9e7d84a503.x
GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.opensuse.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /global/homes/b/bonachea/UPC/upcxx/.nobs/art/5a333efd10c50cb6ba131ec0af87ef9e7d84a503.x...done.
(gdb) r
Starting program: /global/u1/b/bonachea/UPC/upcxx/.nobs/art/5a333efd10c50cb6ba131ec0af87ef9e7d84a503.x
## Loaded 'bupc/2.26.0-6.0.4-gnu-6.3.0' based on currently loaded modules.
## If you change PrgEnv, craype or compiler modules then you must
## run 'module switch bupc bupc' to get the correct bupc module.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Test: lpc_barrier.cpp
[New Thread 0x7ffff7ff8700 (LWP 48503)]
[New Thread 0x7ffff77f7700 (LWP 48504)]
[New Thread 0x7ffff6ff6700 (LWP 48505)]
[New Thread 0x7ffff67f5700 (LWP 48506)]
[New Thread 0x7ffff5ff4700 (LWP 48507)]
[New Thread 0x7ffff57f3700 (LWP 48508)]
[New Thread 0x7ffff4ff2700 (LWP 48509)]
[New Thread 0x7fffd7fff700 (LWP 48510)]
[New Thread 0x7fffd77fe700 (LWP 48511)]
Barrier 0
Barrier 1
Barrier 2
Barrier 3
Barrier 4
Barrier 5
Barrier 6
Barrier 7
Barrier 8
Barrier 9
9: from left
5: from left
2: from left
7: from left
3: from left
8: from left
6: from left
4: from left
0: from left
1: from left
Eyeball me! No 'rights' before this message, no 'lefts' after.
0: from right
8: from right
1: from right
3: from right
7: from right
6: from right
2: from right
4: from right
5: from right
9: from right
[Thread 0x7fffd77fe700 (LWP 48511) exited]
[Thread 0x7fffd7fff700 (LWP 48510) exited]
[Thread 0x7ffff4ff2700 (LWP 48509) exited]
[Thread 0x7ffff57f3700 (LWP 48508) exited]
[Thread 0x7ffff5ff4700 (LWP 48507) exited]
[Thread 0x7ffff67f5700 (LWP 48506) exited]
[Thread 0x7ffff6ff6700 (LWP 48505) exited]
[Thread 0x7ffff77f7700 (LWP 48504) exited]
[Thread 0x7ffff7ff8700 (LWP 48503) exited]
Thread 1 "5a333efd10c50cb" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) where
#0 0x0000000000000000 in ?? ()
#1 0x000000000041a997 in __gthread_join (__value_ptr=0x0, __threadid=<optimized out>) at /tmp/peint/cray-gcc/BUILD/snos_objdir/x86_64-suse-linux/libstdc++-v3/include/x86_64-suse-linux/bits/gthr-default.h:668
#2 std::thread::join (this=0x7c20a0) at ../../../../../cray-gcc-6.3.0-201701050407.93fe37becc347/libstdc++-v3/src/c++11/thread.cc:136
#3 0x000000000040425f in main () at /global/u1/b/bonachea/UPC/upcxx/test/lpc_barrier.cpp:167
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7c09c0 (LWP 48406) "5a333efd10c50cb" 0x0000000000000000 in ?? ()
Comments (10)
-
-
Account Deleted I cannot reproduce on either haswell or knl running under either gdb or srun. The trace shows that its
std::thread::join
which is failing, I'm not sure how that could be our (my) fault. -
Hmm. Worked for me w/o removing cray-shmem env module when I compiled on the front end before launching an interactive job to run. I did not set and env vars.
I will be trying to more closely reproduce Dan's failure.
Dan,
How is is thatenv .... nobs ...
works for you when nobs is a shell function not an real executable? -
I've tried compiling on the front-end or the compute node.
I've tried with and without the bupc env module loaded.
I've tried with and withoutDBGSYM=1 OPTLEV=0
.
While I've not tried all 8 combinations, I have covered each dimension and still cannot reproduce.FWIW: I am running
git clean -fxd
between attempts to be certain of no artifacts spilling over from one to the next. -
reporter How is is that env .... nobs ... works for you when nobs is a shell function not an real executable?
I have a nobs driver script so I can invoke it from any shell or within env without explicitly sourcing anything.
Here it is with bash and without my script:
bonachea@nid00072:~/UPC/upcxx> . sourceme bonachea@nid00072:~/UPC/upcxx> git describe --always fc49099 bonachea@nid00072:~/UPC/upcxx> rm -Rf .nobs ; DBGSYM=1 OPTLEV=0 nobs run test/lpc_barrier.cpp CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/107df87196fc3e462886e41ee307b7343d725152 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/test/lpc_barrier.cpp CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/107df87196fc3e462886e41ee307b7343d725152 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/test/lpc_barrier.cpp -o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/162cff595faed947ac86c2015cabc3a468065842.lpc_barrier.cpp.o CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/src/persona.cpp CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/src/future/core.cpp CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/src/diagnostic.cpp CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -MM -MT x /global/u1/b/bonachea/UPC/upcxx/src/lpc_inbox.cpp CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/src/diagnostic.cpp -o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/9afa4e0afc4f9291624c72ae2fe53c8ef69bcdd9.diagnostic.cpp.o CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/src/lpc_inbox.cpp -o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/ffe2051ddde8e33bde333abe2135d77627ff2d95.lpc_inbox.cpp.o CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/src/future/core.cpp -o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/79b2cb10970b4996f25115c98664c3731cc0dc34.core.cpp.o CC -std=c++11 -D_GNU_SOURCE=1 -I/global/u1/b/bonachea/UPC/upcxx/.nobs/art/7244447ed1b899e12e9515320f25dae776bd0b84 -O0 -g -Wall -c /global/u1/b/bonachea/UPC/upcxx/src/persona.cpp -o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/cd323a957e343279073980471d88a162f2d1b35d.persona.cpp.o CC -o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/e1b9b4b8c1ba82130339300c8e766dff0cdfd16c.x /global/u1/b/bonachea/UPC/upcxx/.nobs/art/9afa4e0afc4f9291624c72ae2fe53c8ef69bcdd9.diagnostic.cpp.o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/79b2cb10970b4996f25115c98664c3731cc0dc34.core.cpp.o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/ffe2051ddde8e33bde333abe2135d77627ff2d95.lpc_inbox.cpp.o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/cd323a957e343279073980471d88a162f2d1b35d.persona.cpp.o /global/u1/b/bonachea/UPC/upcxx/.nobs/art/162cff595faed947ac86c2015cabc3a468065842.lpc_barrier.cpp.o -lpthread Test: lpc_barrier.cpp Barrier 0 Barrier 1 Barrier 2 Barrier 3 Barrier 4 Barrier 5 Barrier 6 Barrier 7 Barrier 8 Barrier 9 0: from left 7: from left 5: from left 8: from left 4: from left 3: from left 1: from left 2: from left 6: from left 9: from left Eyeball me! No 'rights' before this message, no 'lefts' after. 9: from right 6: from right 0: from right 1: from right 5: from right 3: from right 4: from right 8: from right 7: from right 2: from right Segmentation fault bonachea@nid00072:~/UPC/upcxx> module list Currently Loaded Modulefiles: 1) modules/3.2.10.6 7) udreg/2.3.2-6.0.4.0_12.2__g2f9c3ee.ari 13) dvs/2.7_2.2.31-6.0.4.1_6.1__gb3b87e6 19) Base-opts/2.4.123-6.0.4.0_10.1__g6460790.ari 2) gcc/6.3.0 8) ugni/6.0.14-6.0.4.0_14.1__ge7db4a2.ari 14) alps/6.4.1-6.0.4.0_7.2__g86d0f3d.ari 20) cray-libsci/17.06.1 3) craype-haswell 9) dmapp/7.1.1-6.0.4.0_46.2__gb8abda2.ari 15) rca/2.2.11-6.0.4.0_13.2__g84de67a.ari 21) pmi/5.0.12 4) craype-network-aries 10) gni-headers/5.0.11-6.0.4.0_7.2__g7136988.ari 16) cray-shmem/7.6.0 22) atp/2.1.1 5) craype/2.5.12 11) xpmem/2.2.2-6.0.4.0_3.1__g43b0535.ari 17) bupc/2.26.0 23) PrgEnv-gnu/6.0.4 6) cray-mpich/7.6.0 12) job/2.2.2-6.0.4.0_8.2__g3c644b5.ari 18) git/2.9.1 bonachea@nid00072:~/UPC/upcxx>
-
reporter The nightly tester output includes:
--- App stderr --- Tue Sep 26 03:12:24 2017: [unset]:_pmi_alps_get_apid:alps response not OKAY Tue Sep 26 03:12:24 2017: [unset]:_pmi_init:_pmi_alps_init returned -1
now suspecting this is due to Cray not liking us running a.out executables directly on a compute node without using PMI.
-
Dan and I worked out that the key difference is the presence of the "darshan" environment module.
It is loaded by default for me, and the nightly testers (and presumably for John as well).
However it is not in Dan's modules listed anywhere above.We have no idea why Darshan (an I/O characterization library) is relevant, but Dan's errors went away when he loaded it. Similarly, after unloading the darshan module I also get the SEGV Dan reported.
Note that it appears to matter whether or not darshan is loaded when the executable is built. After that loading the module does not prevent the executable from SEGVing.
-
- changed component to External
-
- changed status to resolved
Not our fault as far as we can tell. NERSC seems to require the darshan module.
-
-
assigned issue to
-
assigned issue to
- Log in to comment
FWIW here are the environment modules loaded when the nightly tester runs:
I didn't do an exact comparison but both bupc and cray-shmem popped out as present in Dan's list and absent from the list above.