- changed status to resolved
Suspected race condition on CCS `verify_{all,segment}`
Below is a crash recently seen in GitLab on macos_current-xcode on test-ccs-dynamic-dynupcxx-par-opt-udp for develop @ 4c81adf6 . Auditing the code I have reason to believe this represents a race condition in the algorithm.
[0] * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
[0] * frame #0: 0x00007ff812734a82 libsystem_kernel.dylib`swtch_pri + 10
[0] frame #1: 0x00007ff81276ec2f libsystem_pthread.dylib`cthread_yield + 11
[0] frame #2: 0x0000000106266375 libupcxx.so`gasneti_bt_lldb + 309
[0] frame #3: 0x00000001062630b7 libupcxx.so`gasneti_print_backtrace + 759
[0] frame #4: 0x0000000106318af0 libupcxx.so`gasneti_defaultSignalHandler + 112
[0] frame #5: 0x00007ff812786dfd libsystem_platform.dylib`_sigtramp + 29
[0] frame #6: 0x00007ff81b546f81
[0] frame #7: 0x000000010622f00c libupcxx.so`upcxx::detail::persona_tls::burst_user(upcxx::persona&) + 124
[0] frame #8: 0x0000000106228c4b libupcxx.so`upcxx::detail::progress_user() + 731
[0] frame #9: 0x0000000106246f6e libupcxx.so`upcxx::detail::segmap_cache::verify_segment(unsigned long, upcxx::entry_barrier) + 510
[0] frame #10: 0x0000000105f93ddc test-ccs-dynamic-dynupcxx`upcxx_test2() + 252
[0] frame #11: 0x0000000105f9462b test-ccs-dynamic-dynupcxx`upcxx_test() + 187
[0] frame #12: 0x0000000105f93cc9 test-ccs-dynamic-dynupcxx`main + 9
[0] frame #13: 0x000000010eaa751e dyld`start + 462
The relevant app code looks like this: (with some simplification)
handle = dlopen(...);
dlopen_function = dlsym(handle, "dlopen_function");
// ...
upcxx::experimental::relocation::verify_segment(dlopen_function);
// ...
auto fut3 = upcxx::rpc(0,dlopen_function);
which I believe is a correct/documented way to use the API.
The relevant paths of the verify_segment(ptr)
algorithm look like this:
- entry barrier
- rebuild segment map
- find the hash for the relevant segment containing
ptr
reduce_all(hash, hash_equality_check_operator).wait()
- IF (all ranks had the hash) AND (not yet marked verified in map)
- advance epoch
- mark segment as verified in map
- assign "short index" and add to verified segment to segment vector at that index
The verify_all()
algorithm is more complicated due to handling all the segments and potentially verifying several in a single call, but steps 4-5 have the same basic shape, ie a blocking reduction followed by non-collective work to update the segment map and segment vector.
I believe there's a race here, because nothing prevents rank i
from completing steps 4-5 and sending an rpc to rank j
(in this case rank 0) before rank j
has left step 4 and updated the local segment vector. In particular, I suspect the crash above occurred because rank 0 was still inside the user-level progress of future::wait()
in step 4 when an RPC arrived from rank i
targeting the newly verified segment that has not yet completed registration at rank 0. Specifically, function_token_ms_idx::detokenize()
-> segmap_cache::lookup_at_idx()
will attempt to access a segment_vector_
entry that does not yet exist, reading garbage. If this happened to be debug mode it would probably fail the assertion at ccs.hpp:401:
UPCXX_ASSERT(static_cast<int64_t>(idx) < static_cast<int64_t>(segment_vector_.size()));
but since this run was opt mode it just reads garbage and crashes.
The simplest solution is probably to add an exit barrier to both verify functions after step 5 to ensure the segment metadata is globally updated before any caller returns and potentially starts injecting RPC that relies upon the updated segment metadata at the target.
Comments (1)
-
reporter - Log in to comment
Fixed in PR 442, merged @ 3307cc9