Suspected race condition on CCS `verify_{all,segment}`

Issue #556 resolved
Dan Bonachea created an issue

Below is a crash recently seen in GitLab on macos_current-xcode on test-ccs-dynamic-dynupcxx-par-opt-udp for develop @ 4c81adf6 . Auditing the code I have reason to believe this represents a race condition in the algorithm.

[0] * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
[0]   * frame #0: 0x00007ff812734a82 libsystem_kernel.dylib`swtch_pri + 10
[0]     frame #1: 0x00007ff81276ec2f libsystem_pthread.dylib`cthread_yield + 11
[0]     frame #2: 0x0000000106266375 libupcxx.so`gasneti_bt_lldb + 309
[0]     frame #3: 0x00000001062630b7 libupcxx.so`gasneti_print_backtrace + 759
[0]     frame #4: 0x0000000106318af0 libupcxx.so`gasneti_defaultSignalHandler + 112
[0]     frame #5: 0x00007ff812786dfd libsystem_platform.dylib`_sigtramp + 29
[0]     frame #6: 0x00007ff81b546f81
[0]     frame #7: 0x000000010622f00c libupcxx.so`upcxx::detail::persona_tls::burst_user(upcxx::persona&) + 124
[0]     frame #8: 0x0000000106228c4b libupcxx.so`upcxx::detail::progress_user() + 731
[0]     frame #9: 0x0000000106246f6e libupcxx.so`upcxx::detail::segmap_cache::verify_segment(unsigned long, upcxx::entry_barrier) + 510
[0]     frame #10: 0x0000000105f93ddc test-ccs-dynamic-dynupcxx`upcxx_test2() + 252
[0]     frame #11: 0x0000000105f9462b test-ccs-dynamic-dynupcxx`upcxx_test() + 187
[0]     frame #12: 0x0000000105f93cc9 test-ccs-dynamic-dynupcxx`main + 9
[0]     frame #13: 0x000000010eaa751e dyld`start + 462

The relevant app code looks like this: (with some simplification)

  handle = dlopen(...);
  dlopen_function = dlsym(handle, "dlopen_function");
  // ... 
  upcxx::experimental::relocation::verify_segment(dlopen_function);
  // ... 
  auto fut3 = upcxx::rpc(0,dlopen_function);

which I believe is a correct/documented way to use the API.

The relevant paths of the verify_segment(ptr) algorithm look like this:

  1. entry barrier
  2. rebuild segment map
  3. find the hash for the relevant segment containing ptr
  4. reduce_all(hash, hash_equality_check_operator).wait()
  5. IF (all ranks had the hash) AND (not yet marked verified in map)
    1. advance epoch
    2. mark segment as verified in map
    3. assign "short index" and add to verified segment to segment vector at that index

The verify_all() algorithm is more complicated due to handling all the segments and potentially verifying several in a single call, but steps 4-5 have the same basic shape, ie a blocking reduction followed by non-collective work to update the segment map and segment vector.

I believe there's a race here, because nothing prevents rank i from completing steps 4-5 and sending an rpc to rank j (in this case rank 0) before rank j has left step 4 and updated the local segment vector. In particular, I suspect the crash above occurred because rank 0 was still inside the user-level progress of future::wait() in step 4 when an RPC arrived from rank i targeting the newly verified segment that has not yet completed registration at rank 0. Specifically, function_token_ms_idx::detokenize() -> segmap_cache::lookup_at_idx() will attempt to access a segment_vector_ entry that does not yet exist, reading garbage. If this happened to be debug mode it would probably fail the assertion at ccs.hpp:401:

UPCXX_ASSERT(static_cast<int64_t>(idx) < static_cast<int64_t>(segment_vector_.size()));

but since this run was opt mode it just reads garbage and crashes.

The simplest solution is probably to add an exit barrier to both verify functions after step 5 to ensure the segment metadata is globally updated before any caller returns and potentially starts injecting RPC that relies upon the updated segment metadata at the target.

Comments (1)

  1. Log in to comment