Rare segfault in CCS tests in verify_segment/verify_all

Issue #572 resolved
Colin MacLean created an issue

There is a rare SEGV in CCS tests on macOS. Examples:

  1. https://socks.lbl.gov/Pagoda/upcxx/-/jobs/10613 ccs-dynamic-dynupcxx-par-opt-udp
  2. https://socks.lbl.gov/Pagoda/upcxx/-/jobs/15051 ccs-dynamic-seq-opt-udp
  3. https://socks.lbl.gov/Pagoda/upcxx/-/jobs/15222 ccs-inlib-seq-opt-udp

The cause has yet to be identified.

Comments (5)

  1. Dan Bonachea
    • edited description

    Crash stack from example 3 above:

    ...
    [0] ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    *** Caught a fatal signal (proc 0): SIGSEGV(11)
    *** NOTICE (proc 0): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
    [0] Invoking LLDB for backtrace...
    [0] /usr/bin/lldb -p 33351 -o 'bt all' -o quit
    [0] (lldb) process attach --pid 33351
    [0] Process 33351 stopped
    [0] * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    [0]     frame #0: 0x00007ff8198e7a82 libsystem_kernel.dylib`swtch_pri + 10
    [0] libsystem_kernel.dylib`swtch_pri:
    [0] ->  0x7ff8198e7a82 <+10>: retq   
    [0]     0x7ff8198e7a83 <+11>: nop    
    [0] 
    [0] libsystem_kernel.dylib`swtch:
    [0]     0x7ff8198e7a84 <+0>:  movq   %rcx, %r10
    [0]     0x7ff8198e7a87 <+3>:  movl   $0x100003c, %eax          ; imm = 0x100003C 
    [0] Target 0: (test-ccs-inlib) stopped.
    [0] Executable module set to "/Users/gitlab-runner/builds/hLH-n9g1/0/Pagoda/upcxx/test-ccs-inlib-seq-opt-udp/test-ccs-inlib".
    [0] Architecture set to: x86_64-apple-macosx-.
    [0] (lldb) bt all
    [0] * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    [0]   * frame #0: 0x00007ff8198e7a82 libsystem_kernel.dylib`swtch_pri + 10
    [0]     frame #1: 0x00007ff819921c2f libsystem_pthread.dylib`cthread_yield + 11
    [0]     frame #2: 0x00000001032b9e0c libccs-inlib.so`gasneti_bt_lldb + 300
    [0]     frame #3: 0x00000001032b6c07 libccs-inlib.so`gasneti_print_backtrace + 759
    [0]     frame #4: 0x000000010335c8f0 libccs-inlib.so`gasneti_defaultSignalHandler + 112
    [0]     frame #5: 0x00007ff819939dfd libsystem_platform.dylib`_sigtramp + 29
    [0]     frame #6: 0x00007ff824127f81
    [0]     frame #7: 0x0000000103299ecc libccs-inlib.so`upcxx::detail::persona_tls::burst_user(upcxx::persona&) + 124
    [0]     frame #8: 0x0000000103294cb7 libccs-inlib.so`upcxx::detail::progress_user() + 759
    [0]     frame #9: 0x00000001032a0c3e libccs-inlib.so`upcxx::detail::segmap_cache::verify_segment(unsigned long) + 526
    [0]     frame #10: 0x000000010328c762 libccs-inlib.so`upcxx_test2() + 242
    [0]     frame #11: 0x000000010328cfbb libccs-inlib.so`upcxx_test() + 187
    [0]     frame #12: 0x000000010304af89 test-ccs-inlib`main + 9
    [0]     frame #13: 0x000000010455952e dyld`start + 462
    [0] (lldb) quit
    
  2. Dan Bonachea

    Here is the test code for the crash stack above, from test/ccs/test.cxx:

     11 void upcxx_test2()
     12 {
     13   upcxx::experimental::relocation::enforce_verification(true);
     14   upcxx::experimental::relocation::verify_all();
     15   bool printrank = (upcxx::rank_me() == 0) || (upcxx::rank_me() == upcxx::rank_n() - 1);
     16   if (printrank)
     17     upcxx::experimental::relocation::debug_write_segment_table();
     18   print_test_header();
     19   void* handle = dlopen(XSTR(CCS_DLOPEN_LIB), RTLD_NOW);
     20   if (!handle) {
     21     fprintf(stderr, "%s\n", dlerror());
     22     exit(EXIT_FAILURE);
     23   }
     24   int (*dlopen_function)() = reinterpret_cast<int(*)()>(dlsym(handle, "dlopen_function"));
     25   int (*dlopen_cpp_function)() = reinterpret_cast<int(*)()>(dlsym(handle, "_Z19dlopen_cpp_functionv"));
     26   upcxx::experimental::relocation::verify_segment(dlopen_function);
     27   if (printrank)
     28     upcxx::experimental::relocation::debug_write_ptr(dlopen_function);
     29   auto fut1 = upcxx::rpc(0,test_segment_function);
     30   auto fut2 = upcxx::rpc(0,dynamic_linked_function);
     31   auto fut3 = upcxx::rpc(0,dlopen_function);
     32   auto fut4 = upcxx::rpc(0,dlopen_cpp_function);
    ...
    

    I believe there are two related problems here that together cause the crash:

    1. verify_segment() is specified as "Progress Level: internal", but the crash stack inside persona_tls::burst_user() reveals we are incorrectly reaching user-level progress.

      • This is due to several incorrect calls in CCS to the general future::wait(detail::future_wait_upcxx_progress_user). See backend/gasnet/runtime.cpp:1064 for an example of how to correctly wait for an internal future without user-level progress (an internal-use idiom we should probably factor for use in CCS and elsewhere).
    2. There is no explicit synchronization between the collective dlopen() on test line 19 and the RPC injection on lines 31-32 that rely on that load being complete and segment registered on target rank 0. verify_segment includes a reduce-to-all, which guarantees that all processes have run dlopen and reached the start of verify_segment, but does NOT guarantee that all processes have finished verify_segment() and registered the new code segment. This is poor programming practice for the test, and was just plain wrong before we recently removed user-level progress from verify_segment(). It shouldn't actually cause a problem now that verify_segment() should exclude user-level progress. However due to problem 1, there's a race where the incorrect user-level progress in verify_segment() can sometimes attempt to run the incoming RPC from lines 31-32 before the new segment is registered, leading to a crash.

  3. Colin MacLean reporter

    For 2, excluding user-level progress was part of my design for correctness. I was using the fact that verify_* require the master persona and that incoming RPCs are also executed by the master persona to avoid a race and a need for an exit barrier when I wrote it. I was just unfamiliar with achieving internal-only progress API and violated that condition of my design. So, as I designed things, only 1) is a bug.

  4. Log in to comment