Crashes of issue421.cpp with NVIDIA compilers on ppc64le
We now have automated testing of the most recent (21.2) release of the "nvhpc" branded compilers from NVIDIA (successor to the the "PGI" brand).
The testing is showing intermittent SEGVs on the issue421.cpp
test on the PPC64le platform, though not (yet?) on x86_64. Additionally, issue421c.cpp
has now been seen to crash as well.
Because this is occurring opt-only, we don't ahve backtraces from the automated tests. However, I have reproduced manually and will provide backtraces in comments.
Comments (8)
-
reporter -
reporter Representative backtrace from
issue421c
on PPC64le.[2] #7 <signal handler called> [2] #8 __GI___libc_free (mem=0x3234203e3d202930) at malloc.c:3102 [2] #9 0x000000001000bee8 in upcxx::backend::gasnet::rpc_as_lpc::cleanup<true, false> () at /home/phargrov/upcxx/B -pgi-21.2/bld/upcxx.assert0.optlev3.dbgsym0.gasnet_seq.smp/include/upcxx/backend/gasnet/runtime.hpp:723 [2] #10 _ZN5upcxx6detail33apply_variadic_as_future_dispatchIONS0_7commandIJPNS0_8lpc_baseEEE13after_executeIZZZZZNS 0_4copyINS_11completionsIJNS_6rpc_cxINS_15remote_cx_eventENS_14bound_functionIZ4mainEUlvE1_JEEEEEEEEEENS0_11copy_tr aitsIT_E8return_tEiiPviiSK_mOSH_ENKUlvE3_clEvENKUlSK_E_clESK_ENKUlvE_clEvENKUlvE_clEvEUlvE_Lb0EXadL_ZNS_7backend6ga snet10rpc_as_lpc7cleanupILb1ELb0EEEvS4_EEEESt5tupleIJEEvEclESW_ () at /home/phargrov/upcxx/B-pgi-21.2/bld/upcxx.ass ert0.optlev3.dbgsym0.gasnet_seq.smp/include/upcxx/future/apply.hpp:29 [2] #11 _ZN5upcxx6detail7commandIJPNS0_8lpc_baseEEE12the_executorIZZZZZNS0_4copyINS_11completionsIJNS_6rpc_cxINS_15remote_cx_eventENS_14bound_functionIZ4mainEUlvE1_JEEEEEEEEEENS0_11copy_traitsIT_E8return_tEiiPviiSJ_mOSG_ENKUlvE3_clEvENKUlSJ_E_clESJ_ENKUlvE_clEvENKUlvE_clEvEUlvE_XadL_ZNS_7backend6gasnet10rpc_as_lpc9reader_ofES3_EEXadL_ZNSS_7cleanupILb1ELb0EEEvS3_EEEEvS3_ () at /home/phargrov/upcxx/B-pgi-21.2/bld/upcxx.assert0.optlev3.dbgsym0.gasnet_seq.smp/include/upcxx/command.hpp:232 [2] #12 0x000000001002ae88 in upcxx::progress(upcxx::progress_level)::{lambda(upcxx::persona&)#1}::operator() () at /home/phargrov/upcxx/src/backend/gasnet/runtime.cpp:792 [2] #13 0x0000000010025a30 in void upcxx::detail::persona_tls::foreach_active_as_top<upcxx::progress(upcxx::progress_level)::{lambda(upcxx::persona&)#1}>(upcxx::progress(upcxx::progress_level)::{lambda(upcxx::persona&)#1}&&) () at /home/phargrov/upcxx/B-pgi-21.2/bld/upcxx.assert0.optlev3.dbgsym0.gasnet_seq.smp/include/upcxx/persona.hpp:772 [2] #14 upcxx::progress () at /home/phargrov/upcxx/src/backend/gasnet/runtime.cpp:2091 [2] #15 0x0000000010007614 in main () at ../test/regression/issue421c.cpp:55
-
reporter 500 consecutive runs of
issue421b
show no failures, while the other tests would fail at lest once in ten trials. Perhaps this difference is of some value in identifying the problem. -
reporter Pre-backtrace output from 421b, as requested by @Dan Bonachea
Test: issue421c.cpp Ranks: 4 [0] (gp: 0, 0x7a39e56403e0, heap=0) => 42 expect=42 [3] (gp: 3, 0x7a39e56403e0, heap=0) => 42 expect=42 [1] (gp: 1, 0x7a39e56403e0, heap=0) => 42 expect=42 [2] (gp: 2, 0x7a39e56403e0, heap=0) => 42 expect=42 [3] (gp: 3, 0x7a39e56403e0, heap=0) => 420 expect=420 [0] (gp: 0, 0x7a39e56403e0, heap=0) => 421 expect=421 [1] (gp: 1, 0x7a39e56403e0, heap=0) => 422 expect=422 [2] (gp: 2, 0x7a39e56403e0, heap=0) => 423 expect=423 *** Caught a fatal signal (proc 1): SIGSEGV(11) *** Caught a fatal signal (proc 2): SIGSEGV(11) *** Caught a fatal signal (proc 3): SIGSEGV(11) NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue. NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue. NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue. *** Caught a fatal signal (proc 0): SIGSEGV(11) NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
Another failing runs shows one more line prior to the SEGVs:
[0] (gp: 0, 0x70fc5c9e03e0, heap=0) => 42 expect=42 [2] (gp: 2, 0x70fc5c9e03e0, heap=0) => 42 expect=42 [3] (gp: 3, 0x70fc5c9e03e0, heap=0) => 42 expect=42 [1] (gp: 1, 0x70fc5c9e03e0, heap=0) => 42 expect=42 [0] (gp: 0, 0x70fc5c9e03e0, heap=0) => 421 expect=421 [3] (gp: 3, 0x70fc5c9e03e0, heap=0) => 420 expect=420 [1] (gp: 1, 0x70fc5c9e03e0, heap=0) => 422 expect=422 [2] (gp: 2, 0x70fc5c9e03e0, heap=0) => 423 expect=423 [0] (gp: 0, 0x70fc5c9e0400, heap=0) => 420 expect=420
-
These stack traces both show crashes with line numbers that appear to be in the remote-to-local (h2h copy-get) step, which I believe is a different path from the PGI crashes described in issue 421 where the loopback path was implicated. So we are probably looking at a distinct defect.
They may still be related, but there's at least a chance my in-progress work on
copy
might resolve this by chance. -
-
assigned issue to
-
assigned issue to
-
Might be resolved in pull request 327
-
- changed status to resolved
Automated testing on all of:
all confirm this appears to have resolved by pull request 327, merged on 2021-03-07
- Log in to comment
Representative backtrace from a failure of
issue421
on PPC64le. This looks to me very similar to one of the two original failure modes I reported in pull request 289.