Crashes of issue421.cpp with NVIDIA compilers on ppc64le

Issue #457 resolved
Paul Hargrove created an issue

We now have automated testing of the most recent (21.2) release of the "nvhpc" branded compilers from NVIDIA (successor to the the "PGI" brand).

The testing is showing intermittent SEGVs on the issue421.cpp test on the PPC64le platform, though not (yet?) on x86_64. Additionally, issue421c.cpp has now been seen to crash as well.

Because this is occurring opt-only, we don't ahve backtraces from the automated tests. However, I have reproduced manually and will provide backtraces in comments.

Comments (8)

  1. Paul Hargrove reporter

    Representative backtrace from a failure of issue421 on PPC64le. This looks to me very similar to one of the two original failure modes I reported in pull request 289.

    [1] #7  <signal handler called>
    [1] #8  __GI___libc_free (mem=0x90000000001013aa) at malloc.c:3102
    [1] #9  0x000000001000bde8 in upcxx::backend::gasnet::rpc_as_lpc::cleanup<true, false> () at /home/phargrov/upcxx/B-pgi-21.2/bld/upcxx.assert0.optlev3.dbgsym0.gasnet_seq.smp/include/upcxx/backend/gasnet/runtime.hpp:723
    [1] #10 _ZN5upcxx6detail33apply_variadic_as_future_dispatchIONS0_7commandIJPNS0_8lpc_baseEEE13after_executeIZZZZZNS0_4copyINS_11completionsIJNS_6rpc_cxINS_15remote_cx_eventENS_14bound_functionIZ4mainEUlvE1_JEEEEEEEEEENS0_11copy_traitsIT_E8return_tEiiPviiSK_mOSH_ENKUlvE3_clEvENKUlSK_E_clESK_ENKUlvE_clEvENKUlvE_clEvEUlvE_Lb0EXadL_ZNS_7backend6gasnet10rpc_as_lpc7cleanupILb1ELb0EEEvS4_EEEESt5tupleIJEEvEclESW_ () at /home/phargrov/upcxx/B-pgi-21.2/bld/upcxx.assert0.optlev3.dbgsym0.gasnet_seq.smp/include/upcxx/future/apply.hpp:29
    [1] #11 _ZN5upcxx6detail7commandIJPNS0_8lpc_baseEEE12the_executorIZZZZZNS0_4copyINS_11completionsIJNS_6rpc_cxINS_15remote_cx_eventENS_14bound_functionIZ4mainEUlvE1_JEEEEEEEEEENS0_11copy_traitsIT_E8return_tEiiPviiSJ_mOSG_ENKUlvE3_clEvENKUlSJ_E_clESJ_ENKUlvE_clEvENKUlvE_clEvEUlvE_XadL_ZNS_7backend6gasnet10rpc_as_lpc9reader_ofES3_EEXadL_ZNSS_7cleanupILb1ELb0EEEvS3_EEEEvS3_ () at /home/phargrov/upcxx/B-pgi-21.2/bld/upcxx.assert0.optlev3.dbgsym0.gasnet_seq.smp/include/upcxx/command.hpp:232
    [1] #12 0x000000001002ad88 in upcxx::progress(upcxx::progress_level)::{lambda(upcxx::persona&)#1}::operator() () at /home/phargrov/upcxx/src/backend/gasnet/runtime.cpp:792
    [1] #13 0x0000000010025930 in void upcxx::detail::persona_tls::foreach_active_as_top<upcxx::progress(upcxx::progress_level)::{lambda(upcxx::persona&)#1}>(upcxx::progress(upcxx::progress_level)::{lambda(upcxx::persona&)#1}&&) () at /home/phargrov/upcxx/B-pgi-21.2/bld/upcxx.assert0.optlev3.dbgsym0.gasnet_seq.smp/include/upcxx/persona.hpp:772
    [1] #14 upcxx::progress () at /home/phargrov/upcxx/src/backend/gasnet/runtime.cpp:2091
    [1] #15 0x0000000010007474 in main () at ../test/regression/issue421.cpp:44
    
  2. Paul Hargrove reporter

    Representative backtrace from issue421c on PPC64le.

    [2] #7  <signal handler called>
    [2] #8  __GI___libc_free (mem=0x3234203e3d202930) at malloc.c:3102
    [2] #9  0x000000001000bee8 in upcxx::backend::gasnet::rpc_as_lpc::cleanup<true, false> () at /home/phargrov/upcxx/B
    -pgi-21.2/bld/upcxx.assert0.optlev3.dbgsym0.gasnet_seq.smp/include/upcxx/backend/gasnet/runtime.hpp:723
    [2] #10 _ZN5upcxx6detail33apply_variadic_as_future_dispatchIONS0_7commandIJPNS0_8lpc_baseEEE13after_executeIZZZZZNS
    0_4copyINS_11completionsIJNS_6rpc_cxINS_15remote_cx_eventENS_14bound_functionIZ4mainEUlvE1_JEEEEEEEEEENS0_11copy_tr
    aitsIT_E8return_tEiiPviiSK_mOSH_ENKUlvE3_clEvENKUlSK_E_clESK_ENKUlvE_clEvENKUlvE_clEvEUlvE_Lb0EXadL_ZNS_7backend6ga
    snet10rpc_as_lpc7cleanupILb1ELb0EEEvS4_EEEESt5tupleIJEEvEclESW_ () at /home/phargrov/upcxx/B-pgi-21.2/bld/upcxx.ass
    ert0.optlev3.dbgsym0.gasnet_seq.smp/include/upcxx/future/apply.hpp:29
    [2] #11 _ZN5upcxx6detail7commandIJPNS0_8lpc_baseEEE12the_executorIZZZZZNS0_4copyINS_11completionsIJNS_6rpc_cxINS_15remote_cx_eventENS_14bound_functionIZ4mainEUlvE1_JEEEEEEEEEENS0_11copy_traitsIT_E8return_tEiiPviiSJ_mOSG_ENKUlvE3_clEvENKUlSJ_E_clESJ_ENKUlvE_clEvENKUlvE_clEvEUlvE_XadL_ZNS_7backend6gasnet10rpc_as_lpc9reader_ofES3_EEXadL_ZNSS_7cleanupILb1ELb0EEEvS3_EEEEvS3_ () at /home/phargrov/upcxx/B-pgi-21.2/bld/upcxx.assert0.optlev3.dbgsym0.gasnet_seq.smp/include/upcxx/command.hpp:232
    [2] #12 0x000000001002ae88 in upcxx::progress(upcxx::progress_level)::{lambda(upcxx::persona&)#1}::operator() () at /home/phargrov/upcxx/src/backend/gasnet/runtime.cpp:792
    [2] #13 0x0000000010025a30 in void upcxx::detail::persona_tls::foreach_active_as_top<upcxx::progress(upcxx::progress_level)::{lambda(upcxx::persona&)#1}>(upcxx::progress(upcxx::progress_level)::{lambda(upcxx::persona&)#1}&&) () at /home/phargrov/upcxx/B-pgi-21.2/bld/upcxx.assert0.optlev3.dbgsym0.gasnet_seq.smp/include/upcxx/persona.hpp:772
    [2] #14 upcxx::progress () at /home/phargrov/upcxx/src/backend/gasnet/runtime.cpp:2091
    [2] #15 0x0000000010007614 in main () at ../test/regression/issue421c.cpp:55
    
  3. Paul Hargrove reporter

    500 consecutive runs of issue421b show no failures, while the other tests would fail at lest once in ten trials. Perhaps this difference is of some value in identifying the problem.

  4. Paul Hargrove reporter

    Pre-backtrace output from 421b, as requested by @Dan Bonachea

    Test: issue421c.cpp
    Ranks: 4
    [0] (gp: 0, 0x7a39e56403e0, heap=0) => 42 expect=42
    [3] (gp: 3, 0x7a39e56403e0, heap=0) => 42 expect=42
    [1] (gp: 1, 0x7a39e56403e0, heap=0) => 42 expect=42
    [2] (gp: 2, 0x7a39e56403e0, heap=0) => 42 expect=42
    [3] (gp: 3, 0x7a39e56403e0, heap=0) => 420 expect=420
    [0] (gp: 0, 0x7a39e56403e0, heap=0) => 421 expect=421
    [1] (gp: 1, 0x7a39e56403e0, heap=0) => 422 expect=422
    [2] (gp: 2, 0x7a39e56403e0, heap=0) => 423 expect=423
    *** Caught a fatal signal (proc 1): SIGSEGV(11)
    *** Caught a fatal signal (proc 2): SIGSEGV(11)
    *** Caught a fatal signal (proc 3): SIGSEGV(11)
    NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
    NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
    NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
    *** Caught a fatal signal (proc 0): SIGSEGV(11)
    NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
    

    Another failing runs shows one more line prior to the SEGVs:

    [0] (gp: 0, 0x70fc5c9e03e0, heap=0) => 42 expect=42
    [2] (gp: 2, 0x70fc5c9e03e0, heap=0) => 42 expect=42
    [3] (gp: 3, 0x70fc5c9e03e0, heap=0) => 42 expect=42
    [1] (gp: 1, 0x70fc5c9e03e0, heap=0) => 42 expect=42
    [0] (gp: 0, 0x70fc5c9e03e0, heap=0) => 421 expect=421
    [3] (gp: 3, 0x70fc5c9e03e0, heap=0) => 420 expect=420
    [1] (gp: 1, 0x70fc5c9e03e0, heap=0) => 422 expect=422
    [2] (gp: 2, 0x70fc5c9e03e0, heap=0) => 423 expect=423
    [0] (gp: 0, 0x70fc5c9e0400, heap=0) => 420 expect=420
    
  5. Dan Bonachea

    These stack traces both show crashes with line numbers that appear to be in the remote-to-local (h2h copy-get) step, which I believe is a different path from the PGI crashes described in issue 421 where the loopback path was implicated. So we are probably looking at a distinct defect.

    They may still be related, but there's at least a chance my in-progress work on copy might resolve this by chance.

  6. Log in to comment