Clarify semantic restrictions on `UPCXX_THREADMODE=seq`

Issue #142 resolved
Dan Bonachea created an issue

Currently UPCXX_THREADMODE=seq UPCXX_CODEMODE=debug hits an assertion failure on any communication injection after calling liberate_master_persona(), even when there are no other threads.

I don't think we've ever formally documented the restrictions on UPCXX_THREADMODE=seq mode, but we really need to do that - I assumed any app that never used threads was safe to use the entire UPC++ API. Due to this bug seq mode apps cannot use some features of personas (although they are admittedly not strongly motivated).

This needs to either be fixed or documented as a specified restriction for seq mode.

Comments (8)

  1. Dan Bonachea reporter

    Failure on dirac using nightly build:

    $ env UPCXX_THREADMODE=seq UPCXX_GASNET_CONDUIT=udp upcxx -g issue142.cpp && upcxx-run -np 2 ./a.out
    Test: issue142.cpp
    Ranks: 2
    *** FATAL ERROR: 
    //////////////////////////////////////////////////
    UPC++ assertion failure:
     rank=0
     file=/usr/local/pkg/upcxx-dirac/gcc-7.3.0-p1/nightly-2018.05.02/upcxx.debug.gasnet_seq.udp/include/upcxx/backend/gasnet/runtime.hpp:161
    
    Failed condition: !1 || backend::master.active_with_caller()
    
    To have UPC++ freeze during these errors so you can attach a debugger, rerun the program with GASNET_FREEZE_ON_ERROR=1 in the environment.
    //////////////////////////////////////////////////
    
    [0] Invoking GDB for backtrace...
    [0] /usr/local/pkg/gdb/newest/bin/gdb -nx -batch -x /tmp/gasnet_18Hwz9 '/home/pcp1/bonachea/UPC/upcxx/test/regression/./a.out' 12083
    [0] [Thread debugging using libthread_db enabled]
    [0] Using host libthread_db library "/lib64/libthread_db.so.1".
    [0] 0x00007f44ba34410c in waitpid () from /lib64/libc.so.6
    [0] To enable execution of this file add
    [0]     add-auto-load-safe-path /usr/local/pkg/gcc/7.3.0-p1/lib64/libstdc++.so.6.0.24-gdb.py
    [0] line to your configuration file "/home/pcp1/bonachea/.gdbinit".
    [0] To completely disable this security protection add
    [0]     set auto-load safe-path /
    [0] line to your configuration file "/home/pcp1/bonachea/.gdbinit".
    [0] For more information about this security protection see the
    [0] "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
    [0]     info "(gdb)Auto-loading safe path"
    [0] #0  0x00007f44ba34410c in waitpid () from /lib64/libc.so.6
    [0] #1  0x00007f44ba2c1de2 in do_system () from /lib64/libc.so.6
    [0] #2  0x0000000000474fc9 in gasneti_system_redirected (cmd=0x949600 <cmd> "/usr/local/pkg/gdb/newest/bin/gdb -nx -batch -x /tmp/gasnet_18Hwz9 '/home/pcp1/bonachea/UPC/upcxx/test/regression/./a.out' 12083", stdout_fd=7) at /tmp/upcxx-nightly-dirac-gcc/bld/upcxx_install/berkeleylab-upcxx-develop/.nobs/art/b730c1ad0b03efaf006937aef2c53eb17407ceb9/GASNet-EX-collaborator-snapshot/gasnet_tools.c:1021
    [0] #3  0x0000000000475924 in gasneti_bt_gdb (fd=7) at /tmp/upcxx-nightly-dirac-gcc/bld/upcxx_install/berkeleylab-upcxx-develop/.nobs/art/b730c1ad0b03efaf006937aef2c53eb17407ceb9/GASNet-EX-collaborator-snapshot/gasnet_tools.c:1268
    [0] #4  0x00000000004760f0 in gasneti_print_backtrace (fd=2) at /tmp/upcxx-nightly-dirac-gcc/bld/upcxx_install/berkeleylab-upcxx-develop/.nobs/art/b730c1ad0b03efaf006937aef2c53eb17407ceb9/GASNet-EX-collaborator-snapshot/gasnet_tools.c:1537
    [0] #5  0x00000000004766e8 in _gasneti_print_backtrace_ifenabled (fd=2) at /tmp/upcxx-nightly-dirac-gcc/bld/upcxx_install/berkeleylab-upcxx-develop/.nobs/art/b730c1ad0b03efaf006937aef2c53eb17407ceb9/GASNet-EX-collaborator-snapshot/gasnet_tools.c:1668
    [0] #6  0x00000000004743a7 in gasneti_fatalerror (msg=0x67f12d "\n%s") at /tmp/upcxx-nightly-dirac-gcc/bld/upcxx_install/berkeleylab-upcxx-develop/.nobs/art/b730c1ad0b03efaf006937aef2c53eb17407ceb9/GASNet-EX-collaborator-snapshot/gasnet_tools.c:591
    [0] #7  0x00000000004435de in upcxx::assert_failed (file=0x675b30 "/usr/local/pkg/upcxx-dirac/gcc-7.3.0-p1/nightly-2018.05.02/upcxx.debug.gasnet_seq.udp/include/upcxx/backend/gasnet/runtime.hpp", line=161, msg=0x675af0 "Failed condition: !1 || backend::master.active_with_caller()") at /tmp/upcxx-nightly-dirac-gcc/bld/upcxx_install/berkeleylab-upcxx-develop/src/diagnostic.cpp:42
    [0] #8  0x0000000000410a62 in _ZN5upcxx7backend14send_am_masterILNS_14progress_levelE1ENS_14bound_functionIZNS_3rpcINS_11completionsIJNS_9future_cxINS_18operation_cx_eventEEEEEEOZNS_11dist_objectIiE5fetchEiEUlRSB_E_JSC_EEENS_6detail10rpc_returnIFT0_DpT1_ET_NSF_18rpc_remote_resultsISK_vE4typeEE4typeEiSL_OSH_DpOSI_EUlRNS3_ISD_JSC_EEEE_JSU_EEEEEviSR_ (recipient=1, fn=...) at /usr/local/pkg/upcxx-dirac/gcc-7.3.0-p1/nightly-2018.05.02/upcxx.debug.gasnet_seq.udp/include/upcxx/backend/gasnet/runtime.hpp:161
    [0] #9  0x000000000041003c in upcxx::rpc<upcxx::completions<upcxx::future_cx<upcxx::operation_cx_event> >, upcxx::dist_object<int>::fetch(int)::{lambda(upcxx::dist_object<int>&)#1}&&, upcxx::dist_object<int>&>(int, upcxx::dist_object<int>::fetch(int)::{lambda(upcxx::dist_object<int>&)#1}&& (upcxx::dist_object<int>&), upcxx::detail::rpc_return&&, (upcxx::dist_object<int>::fetch(int)::{lambda(upcxx::dist_object<int>&)#1}&&)...) (recipient=1, cxs=..., fn=..., args#0=...) at /usr/local/pkg/upcxx-dirac/gcc-7.3.0-p1/nightly-2018.05.02/upcxx.debug.gasnet_seq.udp/include/upcxx/rpc.hpp:213
    [0] #10 0x000000000040f99f in upcxx::rpc<upcxx::dist_object<int>::fetch(int)::{lambda(upcxx::dist_object<int>&)#1}, upcxx::dist_object<int>&>(int, upcxx::dist_object<int>::fetch(int)::{lambda(upcxx::dist_object<int>&)#1}, upcxx::dist_object<int>&) (recipient=1, fn=..., args#0=...) at /usr/local/pkg/upcxx-dirac/gcc-7.3.0-p1/nightly-2018.05.02/upcxx.debug.gasnet_seq.udp/include/upcxx/rpc.hpp:251
    [0] #11 0x000000000040f53b in upcxx::dist_object<int>::fetch (this=0x7ffc355e2590, rank=1) at /usr/local/pkg/upcxx-dirac/gcc-7.3.0-p1/nightly-2018.05.02/upcxx.debug.gasnet_seq.udp/include/upcxx/dist_object.hpp:143
    [0] #12 0x000000000040e001 in main () at issue142.cpp:29
    *** Caught a fatal signal: SIGABRT(6) on node 0/2
    Abort
    
  2. john bachan

    In auditing the code, I have discovered/recollected the following about backend=seq:

    • The runtime's AM handlers blindly assume they own the master persona, and push to its queue in the fast thread-unsafe way. Thus, any upcxx call that calls into gasnet requires the master persona. This is why we ASSERT ownership of master in so many places, which now seems like not enough (rput/rget should assert too). I think this is a good thing.

    • The runtime guard calls to gasnet_AMPoll() on the condition: if(this_thread == primordial_thread) gasnet_AMPoll(). This, when combined with the above, does not allow the primordial thread to call progress while another thread has master: silent racy corruption ensues. I am not in love with this, and would like to see the guard enhanced to also check master ownership. But that's beyond this release.

    • Shared heap allocation/deallocation ASSERTS ownership of master. Seems fine to me.

    So here's what I believe the documentation should say about restrictions present when backend=seq:

    • That only the primordial thread may hold the master persona.
    • That the following operations require the master persona (thus the primordial thread too):
      • "inter-process communicating operations" (air-quotes to compensate for absence of explicitly enumerated list):
        • Note: progress() is not included here, which is always safe from any thread any persona. Only a thread owning master drains RPCs. All calls to progress() drain inbound lpc's.
      • shared heap allocate/deallocate.

    I do not advocate munging with the ASSERT's before the release.

  3. Dan Bonachea reporter

    @john bachan : Thanks for the analysis.

    I agree we definitely should not be changing assertions for this release. Although IIUC we are basically saying that liberate_master_persona() might eventually be a runtime error in SEQ mode, because there is nothing useful that can be done with that persona aside from re-acquiring it with the same thread (so we might as well immediately signal this as a programming error).

    If you have time to draft something today to clarify these SEQ-mode restrictions in the documentation, that could be helpful to end users. The spec currently does not mention SEQ mode at all - this is solely a feature of the implementation, and I think that's a Good Thing. I'm thinking documentation for these restrictions could go as a new section in docs/implementation-defined.md?

  4. Log in to comment