Fix undocumented dependency arc involving `experimental::relo::verify_{segment,all}`

experimental::relo::verify_{segment,all} holds a lock on an internal mutex, segmap_cache::mutex_ during their collective calls. This adds a dependency for releasing the mutex on the progress of other ranks. This must either be specified as a precondition for entry with the entry_barrier arguments documented or the implementation could be improved to remove this dependency and the need for an entry_barrier.

Currently our docs for experimental::relo::verify_{segment,all} look like this:

void verify_all(entry_barrier eb = entry_barrier::user)

World collective function. All processes have their segment maps compared against rank 0 for verification. Marks segments as verified if they are symmetric among all processes but does not raise an error if there are failures. Segments invalid for UPC++ RPCs can still be used indirectly within functions in valid segments. Called automatically by upcxx::init(). This function should be called after dlopen if UPC++ intends to RPC the functions contained within this library.

There's at least one minor documentation bug here, which is we don't mention the entry_barrier argument at all. Once you get past the obvious "performs the requested entry barrier upon entry", there's a more subtle question regarding what is required to be true after this barrier completes, which is part of our spec for every other function in the API with an entry_barrier argument (and the motivation for including this as part of the semantics at all).

Eg the spec for atomic_domain <T>::destroy(): (emphasis added)

After the entry barrier (§12.2) specified by lev completes, or upon entry if lev == entry_barrier::none, all operations on this atomic domain must have signaled operation completion.

and team::destroy(): (emphasis added)

After the entry barrier (§12.2) specified by lev completes, or upon entry if lev == entry_barrier::none, the operations on this team must not require internal-level or user-level progress from any persona before they can complete.

So the question becomes, what preconditions (if any) do verify_{segment,all}() require to be true once the entry barrier completes (or upon entry for entry_barrier::none)? For example, if the answer is "global quiescence of RPC", that definitely needs to be spelled out. Alternatively if we believe an entry barrier is mandatory to avoid potential deadlocks inside the implementation, then we should consider making it a permanent part of the implementation and removing the caller's ability to weaken that.

The implementation of verify_all in 2022.3.0 and current develop holds the segmap_cache::mutex_ while performing collectives (two broadcasts and a reduction), which seems like a dangerous practice; because it creates an undocumented dependency arc from the master persona performing world collectives to any other RPC activity, which potentially blocks concurrent injection of RPC from other threads. I'm relatively certain this could be used to construct a deadlock if we allow entry_barrier::none AND don't require RPC quiescence; but could be avoided by changing either of those properties, OR by improving the implementation to release the mutex while performing the collectives.

Fix undocumented dependency arc involving `experimental::relo::verify_{segment,all}`

Comments (8)

Rank 0:

Rank 1: