CCS: thread-safety race in segment verification
The internal function segmap_cache::check_verification()
is called from the various different internal tokenize()
functions by arbitrary threads while injecting RPC. The version in the 2022.3.0 release looks like this:
void segmap_cache::check_verification(uintptr_t start, uintptr_t end, uintptr_t uptr)
{
#if !UPCXXI_FORCE_LEGACY_RELOCATIONS
if (segmap_cache::enforce_verification_ && !(flag_map_[start] & (uint16_t) segment_flags::verified))
{
throw segment_verification_error(verification_failed_message(start, end, uptr));
}
#endif
}
When CCS is active and verification is enabled (on by default for debug codemode) it consults the std::unordered_map<...> segmap_cache::flag_map_
without holding the segmap_cache::mutex_
that should be protecting this access. This is a data race, and if this unsynchronized read occurs concurrently with a experimental::relo::verify_segment()
or experimental::relo::verify_all()
operation (both of which clear and rebuild the flag_map_
), the reader could observe garbage and/or empty flags, leading to a spurious verification failure exception.
The simplest fix is to acquire the mutex immediately before consulting the flag_map_
. This will slightly increase the cost of verification, but the verification feature is off by default in production mode (and we should ensure the mutex cycle is skipped when verification is off).
Recommended workaround for users of the current release is to disable verification in debug codemode, ie: upcxx::experimental::relo::enforce_verification(false)
Solving this is a release blocker.
Comments (2)
-
-
reporter - changed status to resolved
Issue 551: Fix issues with segment flags
- Thread safety issue with verification flag checking fixed by deducing
verification state from
function_token_ms
vsfunction_token_ms_idx
differentiation - Mixed static_casts and c-style casts on flags cleaned up and typing centralized to typedefs.
Resolves issue
#551→ <<cset 71821e835c17>>
- Log in to comment
Ah, yes. The design had been with the assumption that
verify_*()
would require quiescence of RPC injection. That’s also how I had designed the old level 2 cache, to avoid locking. We’ve been making some changes to not make such an assumption. Right now, I’d put this as more of an improper documentation issue for the requirements onverify_*()
, since 2022.3.0 was released with the assumption that some of these calls weren’t thread safe with RPC injection. Apart fromverify_*()
, the access is all read-only.