- edited description
Rare segfault in CCS tests in verify_segment/verify_all
There is a rare SEGV in CCS tests on macOS. Examples:
- https://socks.lbl.gov/Pagoda/upcxx/-/jobs/10613 ccs-dynamic-dynupcxx-par-opt-udp
- https://socks.lbl.gov/Pagoda/upcxx/-/jobs/15051 ccs-dynamic-seq-opt-udp
- https://socks.lbl.gov/Pagoda/upcxx/-/jobs/15222 ccs-inlib-seq-opt-udp
The cause has yet to be identified.
Comments (5)
-
-
Here is the test code for the crash stack above, from test/ccs/test.cxx:
11 void upcxx_test2() 12 { 13 upcxx::experimental::relocation::enforce_verification(true); 14 upcxx::experimental::relocation::verify_all(); 15 bool printrank = (upcxx::rank_me() == 0) || (upcxx::rank_me() == upcxx::rank_n() - 1); 16 if (printrank) 17 upcxx::experimental::relocation::debug_write_segment_table(); 18 print_test_header(); 19 void* handle = dlopen(XSTR(CCS_DLOPEN_LIB), RTLD_NOW); 20 if (!handle) { 21 fprintf(stderr, "%s\n", dlerror()); 22 exit(EXIT_FAILURE); 23 } 24 int (*dlopen_function)() = reinterpret_cast<int(*)()>(dlsym(handle, "dlopen_function")); 25 int (*dlopen_cpp_function)() = reinterpret_cast<int(*)()>(dlsym(handle, "_Z19dlopen_cpp_functionv")); 26 upcxx::experimental::relocation::verify_segment(dlopen_function); 27 if (printrank) 28 upcxx::experimental::relocation::debug_write_ptr(dlopen_function); 29 auto fut1 = upcxx::rpc(0,test_segment_function); 30 auto fut2 = upcxx::rpc(0,dynamic_linked_function); 31 auto fut3 = upcxx::rpc(0,dlopen_function); 32 auto fut4 = upcxx::rpc(0,dlopen_cpp_function); ...
I believe there are two related problems here that together cause the crash:
-
verify_segment()
is specified as "Progress Level: internal", but the crash stack insidepersona_tls::burst_user()
reveals we are incorrectly reaching user-level progress.- This is due to several incorrect calls in CCS to the general
future::wait(detail::future_wait_upcxx_progress_user)
. See backend/gasnet/runtime.cpp:1064 for an example of how to correctly wait for an internal future without user-level progress (an internal-use idiom we should probably factor for use in CCS and elsewhere).
- This is due to several incorrect calls in CCS to the general
-
There is no explicit synchronization between the collective
dlopen()
on test line 19 and the RPC injection on lines 31-32 that rely on that load being complete and segment registered on target rank 0.verify_segment
includes a reduce-to-all, which guarantees that all processes have rundlopen
and reached the start ofverify_segment
, but does NOT guarantee that all processes have finishedverify_segment()
and registered the new code segment. This is poor programming practice for the test, and was just plain wrong before we recently removed user-level progress fromverify_segment()
. It shouldn't actually cause a problem now thatverify_segment()
should exclude user-level progress. However due to problem 1, there's a race where the incorrect user-level progress inverify_segment()
can sometimes attempt to run the incoming RPC from lines 31-32 before the new segment is registered, leading to a crash.
-
-
- marked as blocker
- changed title to Rare segfault in CCS tests in verify_segment/verify_all
Fixing this is a release blocker.
-
reporter For 2, excluding user-level progress was part of my design for correctness. I was using the fact that
verify_*
require the master persona and that incoming RPCs are also executed by the master persona to avoid a race and a need for an exit barrier when I wrote it. I was just unfamiliar with achieving internal-only progress API and violated that condition of my design. So, as I designed things, only 1) is a bug. -
- changed status to resolved
- Log in to comment
Crash stack from example 3 above: