Regression (CRASH) on compute-pi-multi-examples
Having recently merged both PR#110 and PR#116, we now have a SEGV from compute-pi-multi-examples on a few platforms not included in the manual CI performed pre-merge.
These appear to to be SMP, UDP or MPI platforms.
Todya's backtraces can be seen here for seq and here for par.
I see SEGV on Linux and WSL platforms, and an invalid free()
on one macOS testers.
As I type this, not all of the results for today are done. So, there may be other failure modes I've not yet seen.
However, the links above will show new results as they are completed today.
Comments (7)
-
reporter -
Here is a relevant stack trace snapshot from dirac/ibv/pgi, for archival purposes:
[7] #7 <signal handler called> [7] #8 0x000000000044d287 in upcxx::detail::lpc_inbox<(upcxx::detail::intru_queue_safety)0>::burst(int)::{lambda(upcxx::detail::lpc_base*)#1} (m=0x1b9e150) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/.nobs/art/1aea07ef7e76a04150854a24409354e8509d0273/upcxx/lpc.hpp:78 [7] #9 0x0000000000448d28 in upcxx::detail::intru_queue::burst_something (max_n=100, fn=0x7ffd39fea1c0, head1=0x1b9e150) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/.nobs/art/1aea07ef7e76a04150854a24409354e8509d0273/upcxx/intru_queue.hpp:173 [7] #10 0x0000000000448cae in upcxx::detail::intru_queue::burst (max_n=100, fn=0x7ffd39fea1c0) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/.nobs/art/1aea07ef7e76a04150854a24409354e8509d0273/upcxx/intru_queue.hpp:158 [7] #11 0x0000000000448e7f in upcxx::detail::lpc_inbox::burst (max_n=100) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/.nobs/art/1aea07ef7e76a04150854a24409354e8509d0273/upcxx/lpc.hpp:78 [7] #12 0x00000000004494e6 in upcxx::detail::persona_tls (p=0xcda9f0 <upcxx::backend::master>) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/.nobs/art/1aea07ef7e76a04150854a24409354e8509d0273/upcxx/persona.hpp:720 [7] #13 0x00000000004599c8 in upcxx::progress(upcxx::progress_level)::{lambda(upcxx::persona&)#1} (p=...) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/src/backend/gasnet/runtime.cpp:1541 [7] #14 0x00000000004496d6 in upcxx::detail::persona_tls (fn=0x7ffd39fea3e0) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/.nobs/art/1aea07ef7e76a04150854a24409354e8509d0273/upcxx/persona.hpp:700 [7] #15 0x00000000004570af in upcxx::progress(enum _ZN5upcxx14progress_levelE) (level=_ZN5upcxx14progress_level4userE) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/src/backend/gasnet/runtime.cpp:1531 [7] #16 0x00000000004565e1 in upcxx::finalize () at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/src/backend/gasnet/runtime.cpp:777 [7] #17 0x000000000040b46d in main (argc=2, argv=0x7ffd39feb918) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/example/prog-guide/compute-pi-multi-examples.cpp:99
-
reporter The following is sufficient to reproduce, starting from a fresh login (default modules) on Dirac.
{phargrov@pcp-d-1 ~}$ module load pgi {phargrov@pcp-d-1 ~}$ module swap mpi mpi/openmpi4-pgi-19.7 {phargrov@pcp-d-1 ~}$ export GASNET_IBV_SPAWNER=ssh GASNET_BACKTRACE=1 {phargrov@pcp-d-1 ~}$ cd upcxx {phargrov@pcp-d-1 upcxx}$ git describe upcxx-2019.3.7-88-ga6d23f1 {phargrov@pcp-d-1 upcxx}$ rm -rf .nobs/ {phargrov@pcp-d-1 upcxx}$ export UPCXX_INSTALL=$(pwd)/inst-pgi {phargrov@pcp-d-1 upcxx}$ PATH+=:$UPCXX_INSTALL/bin {phargrov@pcp-d-1 upcxx}$ CC=pgcc CXX=mpicxx ./install $UPCXX_INSTALL >inst-pgi-log.txt 2>&1 {phargrov@pcp-d-1 upcxx}$ tail -1 inst-pgi-log.txt UPC++ successfully installed {phargrov@pcp-d-1 upcxx}$ upcxx -g -network=ibv example/prog-guide/compute-pi-multi-examples.cpp {phargrov@pcp-d-1 upcxx}$ upcxx-run -n 2 ./a.out [backtrace as previously posted by Dan]
-
reporter Not sure if this helps (perhaps is was obvious to John), but here is a slightly more precise diagnosis of the point at which the SEGV is occurring:
--- a/src/lpc.hpp +++ b/src/lpc.hpp @@ -75,7 +75,10 @@ namespace upcxx { // returns num lpc's executed int burst(int max_n = 100) { - return q_.burst(max_n, [](lpc_base *m) { m->vtbl->execute_and_delete(m); }); + return q_.burst(max_n, [](lpc_base *m) { + UPCXX_ASSERT(!!m, "NULL m"); + UPCXX_ASSERT(!!(m->vtbl), "NULL m->vtbl"); // <== FAILS + m->vtbl->execute_and_delete(m); }); } }; }
-
- changed status to resolved
Bugfix issue 263.
upcxx::backend::gasnet::rma_put_then_am_master_procol()
was not returning the correct synchronization level achieved during injection.→ <<cset 4eff0dde2e1b>>
-
Thanks for the detailed bug crash and reproduction instructions. It made this one quick to resolve!
-
reporter Thanks for the prompt fix, @john bachan .
Fix confirmed on Dirac (pgi) and on Summit (clang).I had hoped to retry Theta (PrgEnv-llvm) where I'd seen the problem before. However, the queues are too clogged for that.
- Log in to comment
Important observation: The invalid-free failures seen on masOS Sierra are for a tester built with
UPCXX_LPC_INBOX=locked
. We have issue 245 for persona-example hanging in that configuration.So, unless that case helps elucidate the other failures, it should not be considered "representative".
I am seeking a system where @john bachan can reproduce, and unfortunately that was the only tester with a failure that I believe he has access to.
UPDATE: since I typed the above, the EX-dirac-ibv-pgi tester has shown the error. Dan has added the backtrace from that run to this issue.