CCS: thread-safety race in segment verification

The internal function segmap_cache::check_verification() is called from the various different internal tokenize() functions by arbitrary threads while injecting RPC. The version in the 2022.3.0 release looks like this:

  void segmap_cache::check_verification(uintptr_t start, uintptr_t end, uintptr_t uptr)
  { 
#if !UPCXXI_FORCE_LEGACY_RELOCATIONS
    if (segmap_cache::enforce_verification_ && !(flag_map_[start] & (uint16_t) segment_flags::verified))
    {
      throw segment_verification_error(verification_failed_message(start, end, uptr));
    }
#endif
  }

When CCS is active and verification is enabled (on by default for debug codemode) it consults the std::unordered_map<...> segmap_cache::flag_map_ without holding the segmap_cache::mutex_ that should be protecting this access. This is a data race, and if this unsynchronized read occurs concurrently with a experimental::relo::verify_segment() or experimental::relo::verify_all() operation (both of which clear and rebuild the flag_map_ ), the reader could observe garbage and/or empty flags, leading to a spurious verification failure exception.

The simplest fix is to acquire the mutex immediately before consulting the flag_map_. This will slightly increase the cost of verification, but the verification feature is off by default in production mode (and we should ensure the mutex cycle is skipped when verification is off).

Recommended workaround for users of the current release is to disable verification in debug codemode, ie: upcxx::experimental::relo::enforce_verification(false)

Solving this is a release blocker.

Comments (2)