Test failures with IBV's recv thread enabled

Issue #495 resolved
Paul Hargrove created an issue

On 2021.07.20, the EX-dirac-ibv-gcc-pthr regression tester ran for the first time with the ibv-conduit AM receive progress thread enabled. This resulted in numerous new failures (crashes and hangs), as can be seen here

The failures on gasnet-tests were expected, as were failures on some (small) fraction of the external-upcxx tests which use the GASNet-EX collectives. However, there were several unexpected failures with crashes (all in -seq tests) such as the one at the end of this Description, from misc_perf-seq.

Dan thinks this is due to a non-threadsafe lpc queue being manipulated from an AM handler from the ibv hidden threads while the primordial thread is polling. Fixing this would likely sacrifice most of the performance benefit to threadmode=seq when ibv conduit is configured with hidden threads.

  • option 1: force no hidden threads (#error)
  • option 2: Add a workaround to UPC++ runtime and change configure to default hidden threads to off

While this observation was made with the develop branch, it seems likely to me that this bug has always been present, but never seen due to lack of testing coverage (no longer the case) for this non-default configuration.

*** Caught a fatal signal (proc 3): SIGSEGV(11)
[3] Invoking GDB for backtrace...
[3] /usr/local/pkg/gdb/newest/bin/gdb -nx -batch -x /tmp/gasnet_hEYqLm '/home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/gasnet/tests/upcr-harness/external-upcxx/./misc_perf-seq' 13115
[3] [New LWP 13119]
[3] [New LWP 13123]
[3] [New LWP 13127]
[3] [Thread debugging using libthread_db enabled]
[3] Using host libthread_db library "/usr/lib64/libthread_db.so.1".
[3] 0x00007eff5e06f159 in waitpid () from /usr/lib64/libc.so.6
[3] To enable execution of this file add
[3]     add-auto-load-safe-path /usr/local/pkg/gcc/11.1.0/lib64/libstdc++.so.6.0.29-gdb.py
[3] line to your configuration file "/home/pagoda1/phargrov/.gdbinit".
[3] To completely disable this security protection add
[3]     set auto-load safe-path /
[3] line to your configuration file "/home/pagoda1/phargrov/.gdbinit".
[3] For more information about this security protection see the
[3] "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
[3]     info "(gdb)Auto-loading safe path"
[3] #0  0x00007eff5e06f159 in waitpid () from /usr/lib64/libc.so.6
[3] #1  0x00007eff5dfecde2 in do_system () from /usr/lib64/libc.so.6
[3] #2  0x00007eff5dfed191 in system () from /usr/lib64/libc.so.6
[3] #3  0x0000565466b29be5 in gasneti_system_redirected (cmd=0x565467277a40 <cmd> "/usr/local/pkg/gdb/newest/bin/gdb -nx -batch -x /tmp/gasnet_hEYqLm '/home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/gasnet/tests/upcr-harness/external-upcxx/./misc_perf-seq' 13115", stdout_fd=23) at /scratch/upcnightly/EX-dirac-ibv-gcc-pthr/runtime/src/gasnet/gasnet_tools.c:1279
[3] #4  0x0000565466b2a443 in gasneti_bt_gdb (fd=23) at /scratch/upcnightly/EX-dirac-ibv-gcc-pthr/runtime/src/gasnet/gasnet_tools.c:1544
[3] #5  0x0000565466b2ad79 in gasneti_print_backtrace (fd=2) at /scratch/upcnightly/EX-dirac-ibv-gcc-pthr/runtime/src/gasnet/gasnet_tools.c:1829
[3] #6  0x0000565466b2b40e in _gasneti_print_backtrace_ifenabled (fd=2) at /scratch/upcnightly/EX-dirac-ibv-gcc-pthr/runtime/src/gasnet/gasnet_tools.c:1962
[3] #7  0x0000565466d66fb2 in gasneti_defaultSignalHandler (sig=11) at /scratch/upcnightly/EX-dirac-ibv-gcc-pthr/runtime/src/gasnet/gasnet_internal.c:1027
[3] #8  <signal handler called>
[3] #9  std::__atomic_base<upcxx::detail::lpc_base*>::load (__m=std::memory_order::relaxed, this=0x9) at /usr/local/pkg/gcc/11.1.0/include/c++/11.1.0/bits/atomic_base.h:838
[3] #10 std::atomic<upcxx::detail::lpc_base*>::load (this=0x9, __m=std::memory_order::relaxed) at /usr/local/pkg/gcc/11.1.0/include/c++/11.1.0/atomic:570
[3] #11 0x00005654669d30a3 in upcxx::detail::intru_queue<upcxx::detail::lpc_base, (upcxx::detail::intru_queue_safety)0, &upcxx::detail::lpc_base::intruder>::burst_something<upcxx::detail::persona_tls::burst_user(upcxx::persona&)::{lambda(upcxx::detail::lpc_base*)#1}>(upcxx::detail::persona_tls::burst_user(upcxx::persona&)::{lambda(upcxx::detail::lpc_base*)#1}&&, upcxx::detail::lpc_base*) (this=0x5654672714a8 <upcxx::backend::master+72>, fn=..., head1=0x1) at /home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/upcxx/bld/upcxx.assert1.optlev0.dbgsym1.gasnet_seq.ibv/include/upcxx/intru_queue.hpp:137
[3] #12 0x00005654669d0f03 in upcxx::detail::intru_queue<upcxx::detail::lpc_base, (upcxx::detail::intru_queue_safety)0, &upcxx::detail::lpc_base::intruder>::burst<upcxx::detail::persona_tls::burst_user(upcxx::persona&)::{lambda(upcxx::detail::lpc_base*)#1}>(upcxx::detail::persona_tls::burst_user(upcxx::persona&)::{lambda(upcxx::detail::lpc_base*)#1}&&) (this=0x5654672714a8 <upcxx::backend::master+72>, fn=...) at /home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/upcxx/bld/upcxx.assert1.optlev0.dbgsym1.gasnet_seq.ibv/include/upcxx/intru_queue.hpp:124
[3] #13 0x00005654669ceb91 in upcxx::detail::persona_tls::burst_user (this=0x7eff5fc36768, p=...) at /home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/upcxx/bld/upcxx.assert1.optlev0.dbgsym1.gasnet_seq.ibv/include/upcxx/persona.hpp:821
[3] #14 0x00005654669c5750 in operator() (__closure=0x7ffd89732160, p=...) at /home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/upcxx/src/backend/gasnet/runtime.cpp:2116
[3] #15 0x00005654669c7204 in upcxx::detail::persona_tls::foreach_active_as_top<do_progress<(upcxx::progress_level)1>()::<lambda(upcxx::persona&)> >(struct {...} &&) (this=0x7eff5fc36768, fn=...) at /home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/upcxx/bld/upcxx.assert1.optlev0.dbgsym1.gasnet_seq.ibv/include/upcxx/persona.hpp:778
[3] #16 0x00005654669c58d7 in do_progress<(upcxx::progress_level)1> () at /home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/upcxx/src/backend/gasnet/runtime.cpp:2102
[3] #17 0x00005654669c3121 in upcxx::detail::progress_user () at /home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/upcxx/src/backend/gasnet/runtime.cpp:2155
[3] #18 0x000056546696b22a in upcxx::progress (level=upcxx::progress_level::user) at /home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/upcxx-inst/include/upcxx/backend.hpp:34
[3] #19 0x0000565466969a17 in upcxx::detail::future_wait_upcxx_progress_user::operator() (this=0x7ffd89732757) at /home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/upcxx-inst/include/upcxx/future/future1.hpp:53
[3] #20 0x000056546696db6e in upcxx::detail::future1<upcxx::detail::future_kind_shref<upcxx::detail::future_header_ops_general, false>>::wait<-1, upcxx::detail::future_wait_upcxx_progress_user>(upcxx::detail::future_wait_upcxx_progress_user&&) && (this=0x7ffd89732720, progress=...) at /home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/upcxx-inst/include/upcxx/future/future1.hpp:358
[3] #21 0x00005654668f8580 in doit3 () at /home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/upcxx/bench/misc_perf.cpp:256
[3] #22 0x00005654668f7a82 in doit2 () at /home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/upcxx/bench/misc_perf.cpp:248
[3] #23 0x00005654668f50d2 in doit1 () at /home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/upcxx/bench/misc_perf.cpp:179
[3] #24 0x00005654668f3170 in doit () at /home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/upcxx/bench/misc_perf.cpp:123
[3] #25 0x00005654668f2c2f in main (argc=1, argv=0x7ffd89733f28) at /home/data2/upcnightly/dirac/EX-dirac-ibv-gcc-pthr/work/dbg/upcxx/bench/misc_perf.cpp:101

Comments (4)

  1. Dan Bonachea

    I've confirmed my diagnosis described in the original report: In SEQ threadmode AM handlers are assuming they run on the primordial thread and pushing things onto the master persona's non-threadsafe queue, which explodes when ibv-conduit runs those AM handlers on a hidden thread in SEQ mode.

    PR 379 "fixes" the problem of the ibv receive thread by forcing it off at configure time and asserting no conduit is advertising a non-zero GASNET_HIDDEN_AM_CONCURRENCY_LEVEL, because currently the runtime code does not correctly handle AM handlers running on non-primordial threads (which includes conduit hidden threads) in SEQ mode.

    I have some incomplete WIP on this branch for actually supporting this case, but as discussed in our 2021-09-20 meeting there is little perceived benefit to a GASNet-level progress thread in UPC++, because it only AM polls and can't/won't advance the UPC++ runtime's internal progress. Users who want useful asynchronous UPC++ progress need to run threadmode=par and roll their own progress thread around upcxx::progress().

  2. Log in to comment