Fix undocumented dependency arc involving `experimental::relo::verify_{segment,all}`

Issue #548 resolved
Dan Bonachea created an issue

experimental::relo::verify_{segment,all} holds a lock on an internal mutex, segmap_cache::mutex_ during their collective calls. This adds a dependency for releasing the mutex on the progress of other ranks. This must either be specified as a precondition for entry with the entry_barrier arguments documented or the implementation could be improved to remove this dependency and the need for an entry_barrier.

Currently our docs for experimental::relo::verify_{segment,all} look like this:

void verify_all(entry_barrier eb = entry_barrier::user)

World collective function. All processes have their segment maps compared against rank 0 for verification. Marks segments as verified if they are symmetric among all processes but does not raise an error if there are failures. Segments invalid for UPC++ RPCs can still be used indirectly within functions in valid segments. Called automatically by upcxx::init(). This function should be called after dlopen if UPC++ intends to RPC the functions contained within this library.

There's at least one minor documentation bug here, which is we don't mention the entry_barrier argument at all. Once you get past the obvious "performs the requested entry barrier upon entry", there's a more subtle question regarding what is required to be true after this barrier completes, which is part of our spec for every other function in the API with an entry_barrier argument (and the motivation for including this as part of the semantics at all).

Eg the spec for atomic_domain <T>::destroy(): (emphasis added)

After the entry barrier (§12.2) specified by lev completes, or upon entry if lev == entry_barrier::none, all operations on this atomic domain must have signaled operation completion.

and team::destroy(): (emphasis added)

After the entry barrier (§12.2) specified by lev completes, or upon entry if lev == entry_barrier::none, the operations on this team must not require internal-level or user-level progress from any persona before they can complete.

So the question becomes, what preconditions (if any) do verify_{segment,all}() require to be true once the entry barrier completes (or upon entry for entry_barrier::none)? For example, if the answer is "global quiescence of RPC", that definitely needs to be spelled out. Alternatively if we believe an entry barrier is mandatory to avoid potential deadlocks inside the implementation, then we should consider making it a permanent part of the implementation and removing the caller's ability to weaken that.

The implementation of verify_all in 2022.3.0 and current develop holds the segmap_cache::mutex_ while performing collectives (two broadcasts and a reduction), which seems like a dangerous practice; because it creates an undocumented dependency arc from the master persona performing world collectives to any other RPC activity, which potentially blocks concurrent injection of RPC from other threads. I'm relatively certain this could be used to construct a deadlock if we allow entry_barrier::none AND don't require RPC quiescence; but could be avoided by changing either of those properties, OR by improving the implementation to release the mutex while performing the collectives.

Comments (8)

  1. Colin MacLean

    The main reason for the barrier is to make legacy mode have similar collective side effects to performing the reductions.

    The code asserts that verify_* be called by the master persona. That’s missing in the documentation.

    I’m not seeing the potential deadlock. If the master persona holds the lock, RPC injections from other threads into non-cached segments will wait until the master persona is done performing the verify_*. If another thread holds the lock in order to retrieve information for creating a cache item, then the master persona waits until that has happened.

  2. Dan Bonachea reporter

    The potential deadlock looks something like this:

    Rank 0:

    Master thread:

        verify_all(entry_barrier::none); // immediately grabs the mutex
    

    Second thread:

        sleep(10); // wait until master entered verify_all
        rpc_ff(1, []() { flag = 1; }); // assume a cache miss on this segment lookup
           // RPC injection gets stuck behind the local mutex (undocumented control dependency)
    

    Rank 1:

    Master thread:

        while(!flag) upcxx::progress(); // wait for RPC arrival
        verify_all(entry_barrier::none); // never reaches here == deadlock
    

    Here the client has arranged to call verify_all collectively from the master persona on every rank. However the undocumented control dependence between the master's verify_all invocation and concurrent RPC injection on that process, combined with the point-to-point RPC control dependence leads to a deadlock cycle.

    Put another way, the current behavior of verify_all(entry_barrier::none) is to grab the mutex immediately for a period of time that is bounded only by the arrival of every other process at the operation, and there's nothing guaranteeing that will happen in bounded time while outgoing RPC from non-master threads is potentially stalled on one or more ranks.

    Note this is only possible with entry_barrier::none, because with entry_barrier::internal or entry_barrier::user we are guaranteed that none of the master personas acquire the mutex until all of them have arrived at the collective, which ensures the collectives invoked by the implementation can complete and release the mutex in bounded time.

    I think there are several potential "fixes" here. If possible I'd like to avoid requiring global RPC quiescence during verification operations, because that's an annoying requirement that is difficult to deploy/check/enforce/debug and impacts composability and programmability. So I think that leaves us with:

    1. Change the implementation of verify_* to always perform a user-level barrier on entry, and remove the entry_barrier argument (easiest to reason about correctness) , OR
    2. Upgrade the implementation to not hold the mutex during the collective calls, but only while manipulating the segmap datastructure (allow removal of entry barrier and maximizes concurrency during verification, at a cost of increased logic complexity)

    Given that we don't expect performance of these verification calls to be critical, I'd lean towards the first option which is easiest to get correct and race-free.

  3. Colin MacLean

    PR 462 unlocks the mutex when performing collectives. The holder of the lock no longer has a dependency on the progress of other ranks to release it. This removes the possibility of deadlock in situations like the example above.

    Entry barriers were removed on these functions, as they are no longer needed. There user no longer needs to stop performing actions that can cache miss the CCS cache before entry.

  4. Dan Bonachea reporter

    Fixed Issue #548

    Release mutex lock during collective operations within verify_{segment,all} to prevent potential deadlocks.

    Removed entry barriers in verify_{segment,all}

    → <<cset 9f636531c193>>

  5. Log in to comment