Deprecate/Prohibit collective calls inside progress callbacks

There is mounting evidence (eg impl issue 412, impl issue 416) that invoking functions requiring a collective call inside the restricted context (ie in callbacks invoked by user-level progress) is a Very Bad Idea.

Motivation

We don't currently prohibit this practice, but it turns out to be semantically problematic for several reasons:

It's very hard to maintain the collective ordering property in non-trivial codes. All operations specified as "collective" must be initiated according to the collective ordering property, but are NOT guaranteed to generate and signal completions in a collective order. There are two components to this:
1. Overlapped asynchronous collectives may truly complete in different orders on different processes (eg due to lack of ordering in the underlying network), leading to completion signalling in different orders, and therefore chained callbacks running in non-collective orders.
2. When two or more completions have been signaled in one process and callbacks are scheduled on each of the now-readied futures, there is no guarantee regarding the relative order in which those callbacks are executed during the next user-level progress call.
  Taken together, the consequence of these is that it's quite difficult to guarantee correct collective ordering in any code that allows more than one asynchronous callback "in-flight" at a time (on any process) that will call a collective. Basically this only works if there is a direct chain of dependencies enforcing a total order between all such callbacks (and any collectives issued synchronously outside progress).
Injection of collective calls into the progress engine breaks modularity.
1. Once any piece of code has injected an asynchronous callback into the progress engine that will invoke a collective, it has imposed a GLOBAL restriction on the subsequent actions of the master persona. So for example if the client of a library calls in to perform some asynchronous operation and the library implementation injects such a callback into the progress stream and returns control to the caller, then it's erroneous for the caller to make any further collective calls itself until it can prove that all such callbacks implementing the library have "drained". Otherwise it runs a risk of breaking the process-wide collective ordering property.
2. The problem here is the collective ordering property is (necessarily) a process-wide constraint, and asynchronous injection of collectives causes enforcement of that property to "bleed" across abstraction boundaries.
Debugging violations of the collective ordering property is very hard.
1. We do not provide any assertions to detect violation of collective ordering preconditions, and deploying such checking would be require breaking UPC++ semantics of some calls, would be very expensive (requiring global communication at every such point), and strongly perturb the behavior of the application.
2. The failure modes for breaking collective ordering preconditions commonly include deadlock and silent data corruption (eg mis-associating dist_objects across ranks). Even expert UPC++ programmers find it very difficult to track down such problems to the root cause.

This issue exists to discuss the possibility of simply prohibiting all collective calls from within the restricted context, and enforcing this with assertions (at least in DEBUG mode).

Proposal

Change Section 10.2 "Restricted Context":

User code running in the restricted context must assume that for the duration of the context all other attempts at making user-level progress, from any thread on any process, may result in a no-op every time.

Append:

Furthermore, any call that is specified as "collective" shall not be invoked by a thread while running in the restricted context.

The implementation would be adjusted to generate a fatal error during attempts to issue a collective call within the restricted context. This would happen at least for debug mode, but probably also for production mode to prevent effective creation of a "dialect" where violations of this rule are silently accepted.

Alternative deployment: we could decree this usage as obsolescent in the next release, modify the implementation to issue a non-fatal warning about deprecated functionality (with a request for feedback), and delay the hard prohibition and fatal error enforcement to a subsequent release.

Implications

All of the following would be prohibited inside callbacks running inside progress:

Initiation of barriers, reductions, broadcasts, and any collective comms added in the future
Construction of dist_object
team::split() and team::destroy() (and likely team constructors added in the future)
Construction/destroy of atomic_domain
Construction/destroy of cuda_device (and likely memory kind devices added in the future)

These new restrictions would represent a breaking change that prohibits a class of programs that are currently permitted and correct, although arguably not scalable/maintainable.

The extent to which this would impact deployed and future real codes is a matter for discussion. I'm very curious to hear about any real use cases that rely upon this functionality, and whether that usage is fundamental or could be re-factored/improved to obey the proposed restrictions.

Motivation

Proposal

Implications

Comments (14)