Regression (CRASH) on compute-pi-multi-examples

Issue #263 resolved
Paul Hargrove created an issue

Having recently merged both PR#110 and PR#116, we now have a SEGV from compute-pi-multi-examples on a few platforms not included in the manual CI performed pre-merge.
These appear to to be SMP, UDP or MPI platforms.

Todya's backtraces can be seen here for seq and here for par.

I see SEGV on Linux and WSL platforms, and an invalid free() on one macOS testers.
As I type this, not all of the results for today are done. So, there may be other failure modes I've not yet seen.
However, the links above will show new results as they are completed today.

Comments (7)

  1. Paul Hargrove reporter

    Important observation: The invalid-free failures seen on masOS Sierra are for a tester built with UPCXX_LPC_INBOX=locked. We have issue 245 for persona-example hanging in that configuration.
    So, unless that case helps elucidate the other failures, it should not be considered "representative".

    I am seeking a system where @john bachan can reproduce, and unfortunately that was the only tester with a failure that I believe he has access to.

    UPDATE: since I typed the above, the EX-dirac-ibv-pgi tester has shown the error. Dan has added the backtrace from that run to this issue.

  2. Dan Bonachea

    Here is a relevant stack trace snapshot from dirac/ibv/pgi, for archival purposes:

    [7] #7  <signal handler called>
    [7] #8  0x000000000044d287 in upcxx::detail::lpc_inbox<(upcxx::detail::intru_queue_safety)0>::burst(int)::{lambda(upcxx::detail::lpc_base*)#1} (m=0x1b9e150) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/.nobs/art/1aea07ef7e76a04150854a24409354e8509d0273/upcxx/lpc.hpp:78
    [7] #9  0x0000000000448d28 in upcxx::detail::intru_queue::burst_something (max_n=100, fn=0x7ffd39fea1c0, head1=0x1b9e150) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/.nobs/art/1aea07ef7e76a04150854a24409354e8509d0273/upcxx/intru_queue.hpp:173
    [7] #10 0x0000000000448cae in upcxx::detail::intru_queue::burst (max_n=100, fn=0x7ffd39fea1c0) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/.nobs/art/1aea07ef7e76a04150854a24409354e8509d0273/upcxx/intru_queue.hpp:158
    [7] #11 0x0000000000448e7f in upcxx::detail::lpc_inbox::burst (max_n=100) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/.nobs/art/1aea07ef7e76a04150854a24409354e8509d0273/upcxx/lpc.hpp:78
    [7] #12 0x00000000004494e6 in upcxx::detail::persona_tls (p=0xcda9f0 <upcxx::backend::master>) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/.nobs/art/1aea07ef7e76a04150854a24409354e8509d0273/upcxx/persona.hpp:720
    [7] #13 0x00000000004599c8 in upcxx::progress(upcxx::progress_level)::{lambda(upcxx::persona&)#1} (p=...) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/src/backend/gasnet/runtime.cpp:1541
    [7] #14 0x00000000004496d6 in upcxx::detail::persona_tls (fn=0x7ffd39fea3e0) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/.nobs/art/1aea07ef7e76a04150854a24409354e8509d0273/upcxx/persona.hpp:700
    [7] #15 0x00000000004570af in upcxx::progress(enum _ZN5upcxx14progress_levelE) (level=_ZN5upcxx14progress_level4userE) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/src/backend/gasnet/runtime.cpp:1531
    [7] #16 0x00000000004565e1 in upcxx::finalize () at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/src/backend/gasnet/runtime.cpp:777
    [7] #17 0x000000000040b46d in main (argc=2, argv=0x7ffd39feb918) at /home/data2/upcnightly/dirac/EX-dirac-ibv-pgi/work/dbg/upcxx/example/prog-guide/compute-pi-multi-examples.cpp:99
    
  3. Paul Hargrove reporter

    The following is sufficient to reproduce, starting from a fresh login (default modules) on Dirac.

    {phargrov@pcp-d-1 ~}$ module load pgi
    {phargrov@pcp-d-1 ~}$ module swap mpi mpi/openmpi4-pgi-19.7
    {phargrov@pcp-d-1 ~}$ export GASNET_IBV_SPAWNER=ssh GASNET_BACKTRACE=1
    {phargrov@pcp-d-1 ~}$ cd upcxx
    {phargrov@pcp-d-1 upcxx}$ git describe
    upcxx-2019.3.7-88-ga6d23f1
    {phargrov@pcp-d-1 upcxx}$ rm -rf .nobs/
    {phargrov@pcp-d-1 upcxx}$ export UPCXX_INSTALL=$(pwd)/inst-pgi
    {phargrov@pcp-d-1 upcxx}$ PATH+=:$UPCXX_INSTALL/bin
    {phargrov@pcp-d-1 upcxx}$ CC=pgcc CXX=mpicxx ./install $UPCXX_INSTALL >inst-pgi-log.txt 2>&1
    {phargrov@pcp-d-1 upcxx}$ tail -1 inst-pgi-log.txt
    UPC++ successfully installed
    {phargrov@pcp-d-1 upcxx}$ upcxx -g -network=ibv example/prog-guide/compute-pi-multi-examples.cpp
    {phargrov@pcp-d-1 upcxx}$ upcxx-run -n 2 ./a.out
    [backtrace as previously posted by Dan]
    
  4. Paul Hargrove reporter

    Not sure if this helps (perhaps is was obvious to John), but here is a slightly more precise diagnosis of the point at which the SEGV is occurring:

    --- a/src/lpc.hpp
    +++ b/src/lpc.hpp
    @@ -75,7 +75,10 @@ namespace upcxx {
    
           // returns num lpc's executed
           int burst(int max_n = 100) {
    -        return q_.burst(max_n, [](lpc_base *m) { m->vtbl->execute_and_delete(m); });
    +        return q_.burst(max_n, [](lpc_base *m) {
    +                      UPCXX_ASSERT(!!m, "NULL m");
    +                      UPCXX_ASSERT(!!(m->vtbl), "NULL m->vtbl");  // <== FAILS
    +                      m->vtbl->execute_and_delete(m); });
           }
         };
       }
    
  5. john bachan

    Thanks for the detailed bug crash and reproduction instructions. It made this one quick to resolve!

  6. Paul Hargrove reporter

    Thanks for the prompt fix, @john bachan .
    Fix confirmed on Dirac (pgi) and on Summit (clang).

    I had hoped to retry Theta (PrgEnv-llvm) where I'd seen the problem before. However, the queues are too clogged for that.

  7. Log in to comment