CCS: thread-safety race in segment verification

Issue #551 resolved
Dan Bonachea created an issue

The internal function segmap_cache::check_verification() is called from the various different internal tokenize() functions by arbitrary threads while injecting RPC. The version in the 2022.3.0 release looks like this:

  void segmap_cache::check_verification(uintptr_t start, uintptr_t end, uintptr_t uptr)
  { 
#if !UPCXXI_FORCE_LEGACY_RELOCATIONS
    if (segmap_cache::enforce_verification_ && !(flag_map_[start] & (uint16_t) segment_flags::verified))
    {
      throw segment_verification_error(verification_failed_message(start, end, uptr));
    }
#endif
  } 

When CCS is active and verification is enabled (on by default for debug codemode) it consults the std::unordered_map<...> segmap_cache::flag_map_ without holding the segmap_cache::mutex_ that should be protecting this access. This is a data race, and if this unsynchronized read occurs concurrently with a experimental::relo::verify_segment() or experimental::relo::verify_all() operation (both of which clear and rebuild the flag_map_ ), the reader could observe garbage and/or empty flags, leading to a spurious verification failure exception.

The simplest fix is to acquire the mutex immediately before consulting the flag_map_. This will slightly increase the cost of verification, but the verification feature is off by default in production mode (and we should ensure the mutex cycle is skipped when verification is off).

Recommended workaround for users of the current release is to disable verification in debug codemode, ie: upcxx::experimental::relo::enforce_verification(false)

Solving this is a release blocker.

Comments (2)

  1. Colin MacLean

    Ah, yes. The design had been with the assumption that verify_*() would require quiescence of RPC injection. That’s also how I had designed the old level 2 cache, to avoid locking. We’ve been making some changes to not make such an assumption. Right now, I’d put this as more of an improper documentation issue for the requirements on verify_*(), since 2022.3.0 was released with the assumption that some of these calls weren’t thread safe with RPC injection. Apart from verify_*(), the access is all read-only.

  2. Dan Bonachea reporter

    Issue 551: Fix issues with segment flags

    • Thread safety issue with verification flag checking fixed by deducing verification state from function_token_ms vs function_token_ms_idx differentiation
    • Mixed static_casts and c-style casts on flags cleaned up and typing centralized to typedefs.

    Resolves issue #551

    → <<cset 71821e835c17>>

  3. Log in to comment