Deprecate/Prohibit collective calls inside progress callbacks

Issue #169 resolved
Dan Bonachea created an issue

There is mounting evidence (eg impl issue 412, impl issue 416) that invoking functions requiring a collective call inside the restricted context (ie in callbacks invoked by user-level progress) is a Very Bad Idea.

Motivation

We don't currently prohibit this practice, but it turns out to be semantically problematic for several reasons:

  1. It's very hard to maintain the collective ordering property in non-trivial codes. All operations specified as "collective" must be initiated according to the collective ordering property, but are NOT guaranteed to generate and signal completions in a collective order. There are two components to this:
    1. Overlapped asynchronous collectives may truly complete in different orders on different processes (eg due to lack of ordering in the underlying network), leading to completion signalling in different orders, and therefore chained callbacks running in non-collective orders.
    2. When two or more completions have been signaled in one process and callbacks are scheduled on each of the now-readied futures, there is no guarantee regarding the relative order in which those callbacks are executed during the next user-level progress call.
      Taken together, the consequence of these is that it's quite difficult to guarantee correct collective ordering in any code that allows more than one asynchronous callback "in-flight" at a time (on any process) that will call a collective. Basically this only works if there is a direct chain of dependencies enforcing a total order between all such callbacks (and any collectives issued synchronously outside progress).
  2. Injection of collective calls into the progress engine breaks modularity.
    1. Once any piece of code has injected an asynchronous callback into the progress engine that will invoke a collective, it has imposed a GLOBAL restriction on the subsequent actions of the master persona. So for example if the client of a library calls in to perform some asynchronous operation and the library implementation injects such a callback into the progress stream and returns control to the caller, then it's erroneous for the caller to make any further collective calls itself until it can prove that all such callbacks implementing the library have "drained". Otherwise it runs a risk of breaking the process-wide collective ordering property.
    2. The problem here is the collective ordering property is (necessarily) a process-wide constraint, and asynchronous injection of collectives causes enforcement of that property to "bleed" across abstraction boundaries.
  3. Debugging violations of the collective ordering property is very hard.
    1. We do not provide any assertions to detect violation of collective ordering preconditions, and deploying such checking would be require breaking UPC++ semantics of some calls, would be very expensive (requiring global communication at every such point), and strongly perturb the behavior of the application.
    2. The failure modes for breaking collective ordering preconditions commonly include deadlock and silent data corruption (eg mis-associating dist_objects across ranks). Even expert UPC++ programmers find it very difficult to track down such problems to the root cause.

This issue exists to discuss the possibility of simply prohibiting all collective calls from within the restricted context, and enforcing this with assertions (at least in DEBUG mode).

Proposal

Change Section 10.2 "Restricted Context":

User code running in the restricted context must assume that for the duration of the context all other attempts at making user-level progress, from any thread on any process, may result in a no-op every time.

Append:

Furthermore, any call that is specified as "collective" shall not be invoked by a thread while running in the restricted context.

The implementation would be adjusted to generate a fatal error during attempts to issue a collective call within the restricted context. This would happen at least for debug mode, but probably also for production mode to prevent effective creation of a "dialect" where violations of this rule are silently accepted.

Alternative deployment: we could decree this usage as obsolescent in the next release, modify the implementation to issue a non-fatal warning about deprecated functionality (with a request for feedback), and delay the hard prohibition and fatal error enforcement to a subsequent release.

Implications

All of the following would be prohibited inside callbacks running inside progress:

  1. Initiation of barriers, reductions, broadcasts, and any collective comms added in the future
  2. Construction of dist_object
  3. team::split() and team::destroy() (and likely team constructors added in the future)
  4. Construction/destroy of atomic_domain
  5. Construction/destroy of cuda_device (and likely memory kind devices added in the future)

These new restrictions would represent a breaking change that prohibits a class of programs that are currently permitted and correct, although arguably not scalable/maintainable.

The extent to which this would impact deployed and future real codes is a matter for discussion. I'm very curious to hear about any real use cases that rely upon this functionality, and whether that usage is fundamental or could be re-factored/improved to obey the proposed restrictions.

Comments (14)

  1. Rob Egan

    Being the chief culprit here, I’m in favor of this, and have been busy refactoring my code today. So far there has not been any performance hit and the hangs that were revealed in my stress testing have gone away.

    Most of the changes made the code cleaner and read better without all the future dependency chaining, indentation and lambda capture cruft, however in my refactoring today, I did come across 1 case where future-chaining the collectives is cleaner and simpler… in prefix reduction of a large block, it is convenient to call the prefix reduction recursively for each chunk instead of expanding all the promise dependencies which would be required for the entire interleaved workflow. Additionally, calling wait and barrier at each iteration would add latency and communication to an already complicated orchestration of rpcs that benefits from pipelining/interleaving the data set.

    https://bitbucket.org/berkeleylab/upcxx-utils/src/185bcfe6ed1aec93ea4280eb2a32136e7e436d43/include/upcxx_utils/reduce_prefix.hpp#lines-778

    So, if there is something (a persona maybe?) that implements a serialized channel that tags the ids when creating dist_objects or any other collective calls, then it might be safe to implement a strictly ordered set of collective calls within that context. Although that will still be very hard to get perfectly correct.

  2. Dan Bonachea reporter

    in prefix reduction of a large block, it is convenient to call the prefix reduction recursively for each chunk instead of expanding all the promise dependencies which would be required for the entire interleaved workflow. Additionally, calling wait and barrier at each iteration would add latency and communication to an already complicated orchestration of rpcs that benefits from pipelining/interleaving the data set.

    If I understand your example correctly, you are observing that it's problematic to recursively invoke your own collective operation (prefix_reduce) from inside restricted context, and I agree, for all the same reasons described above. For a totally ordered linear chain of collective-invoking callbacks the ordering property is feasible to maintain, but this still represents a modularity problem if control is returned to your library client while such a dependency-chain is in-flight... ie the asynchronous invocation of collective operations inside your implementation (dist_object construction) would interfere with concurrent collectives invoked by the client (including additional overlapped calls to your library) and lead to breakage.

    That being said, IF dist_object construction is the only UPC++ collective in play for this code, then another solution suggests itself - simply construct the sole necessary dist_object in the outermost enclosing call from the client (outside progress), and pass it down for reuse by all the pipelined steps in the linear chain (possibly via a hidden implicit argument or using internal wrappers). Note you might need to change the dist_object datum to be an array with one entry for each step in the pipeline, and pass the step index explicitly in RPCs, rather than relying on dist_id to match related steps. The per-step metadata stored in the dist_object seems tiny compared to the data payloads, so this seems like a reasonable solution. Thoughts?

    if there is something (a persona maybe?) that implements a serialized channel that tags the ids when creating dist_objects or any other collective calls

    There is no such serialized channel as you seem to be suggesting. dist_id's are assigned FCFS during dist_object construction using a counter in the parent team data structure (this is just the current implementation and does not represent a guarantee of future behavior). It's intentionally very lightweight and does not track any history information whatsoever. Other collective operations do not carry any kind of ordering "tag" that would provide isolation; they are implicitly ordered by the collective initiation rules, and many places throughout our stack strongly rely on correct injection ordering (even across teams) to ensure they drain without resource deadlock (eg on bounded-size scratch space buffers inside the network).

  3. Rob Egan

    Thanks Dan., You are right that, in my pipelining reduction example, enumerating the necessary dist_object construction up front should work and passing that to the collective calls should work fine, and my plan is to implement that, but I was just pointing out the one example that I had where it would be more convenient to recursively call a collective in a dependency chain.

    I’ve never looked into the dist_id creations, but your description of it does seem to me like you do already have a channel/tag/color built into upcxx by way of it deriving its FCFS counter from the parent team.

    So, playing devils advocate to keep allowing collectives inside progess callbacks, and with the caveat that splitting a team is an expensive and memory hungry operation, it does seem like creating a new team could be a safe way to call collectives with some other mechanism than blocking for to maintain the strict ordering that is required for FCFS dist_id construction. So if there was a derivative “class team_view : public team” and a lightweight team method “upcxx::team_view upcxx::team::clone_view() const” operation that returns a new team_view whose lifetime and memory are essentially the same as the parent team but the clone/view has its own unique FCFS counter, that could be a possible mechanism to safely support this.

  4. Dan Bonachea reporter

    creating a new team could be a safe way to call collectives with some other mechanism than blocking for to maintain the strict ordering that is required for FCFS dist_id construction.

    Creating a "fresh" team would create an isolated space of dist_id's - under the current implementation only, this is not a guaranteed property. However this only works because dist_object construction happens to be the only collective operation in the library that is guaranteed to never communicate.

    This approach would not help at all with all the other collective operations, which usually do communicate and could be subject to control deadlocks or resource deadlocks on the fixed-size buffer queues in the network if their initiation calls were not properly ordered. Note that teams do NOT provide isolation of collective communication orderings - the job still needs to maintain a partial order over all collective calls that is consistent with the total order of all collective calls in each process (including across teams).

    It's possible we could provide a collective abstraction to create a special case ONLY for dist_object construction (ie provide a collectively created "Foo" to generalize the current role of the team in that call). This would allow us to relax the dist_object construction ordering between distinct Foo's, and between dist_object construction and other collective operations. However even if we did this, this approach only addresses the "modularity" problem (my problem 2 above). Problems 1 and 3 of allowing such calls from progress callbacks would still remain - the client would remain fully responsible for enforcing a jobwide total order of calls to dist_object construction on each given Foo (which remains challenging to construct during progress in the presence of non-collective completions), and the penalty for getting it wrong would remain difficult to debug (mismatched dist_object names leading to silent data corruption). This still seems like a minefield to me, although we could discuss deploying it as an "advanced" feature if there is a good motivating use case.

    To my mind, dist_object was really designed to be the "root" of a collective abstraction - a synchronously created entity that then enables bootstrapping of further (possibly asynchronous) inter-process interaction that would then operate in terms of that root. I would generally encourage designs that synchronously create one dist_object per high-level logical collective entity, and (when needed) use that dist_object to store an unordered registry of other more dynamic operations/interactions that might be created and destroyed asynchronously and non-collectively. IOW the key property of a dist_object is really just the collective dist_id "name" that uses the collective total order of the job to collectively establish an isolated namespace, and one can easily create a second level of naming inside that space indexed in the terms of the operation itself (eg the step count in a pipeline, a time step index, a hash table key, etc).

    Finally, it's worth noting that the automated RPC interlock is (I believe) the only part of dist_object that leverages runtime internals, and (with some ugliness) is really just a complicated form of syntactic sugar intended to capture common usage patterns. For a client with special needs, nothing prevents one from developing a custom global naming scheme analogous to dist_object using the same types of algorithms.

  5. Rob Egan

    Okay those are all excellent points. And it does sound like this would be really hard to design and get working.

    What I am really after is a delayed collective so they can be safely done in series. I.e a reduction then broadcast (yes I know upcxx::reduce_all) or some other collective then barrier or maybe a scatter then a gather. So I was thinking all I really want is to promise a collective where the collective is constructed synchronously, but not started. Then on some condition, it is started and completes asynchronously.

    As a prototype and to test it I wrote a PromiseBarrier class:

    https://bitbucket.org/berkeleylab/upcxx-utils/src/master/include/upcxx_utils/promise_collectives.hpp

    https://bitbucket.org/berkeleylab/upcxx-utils/src/master/src/promise_collectives.cpp

    In this class, since the barriers are all effectively orchestrated by independent dist_objects, it is safe to construct them in one order then fulfill them in another order and then wait for (any fulfilled PromiseBarrier) in a third order (as long as all the ranks have the same construction order and do not choose a deadlocking order for the fullfilling & waiting).

    Since this seems to work as expected, and I was able to refactor my other algorithms that chained collectives relatively easily, I don’t think that dist_objects needs any new features or that there needs to be a serialized channel for collectives to achieve the same results. In fact this keeps the code cleaner, imo.

    I think the only thing that would be helpful is to include within the collectives libraries a mechanism to delay the execution of the collective (barrier, reduce, broadcast, etc) with the fulfillment of a promise like the class that I wrote above. So really just one more abstraction to a split-phase collective which would allow the start to happen after a later condition has been met.

    I took a quick look at upcxx::barrier’s implementation and can clearly see that gasnet does not provide an accessible way to delay the execution, so likely any promise-collectives will be harder to get to work efficiently at the lower hardware levels.

  6. Dan Bonachea reporter

    @Rob Egan thanks for the interesting input. I've taken a look at your library and I think it's a good example of the design principle I mentioned above of synchronously creating a dist_object and using it as the root for coordinating further asynchronous work. As to your library semantics, I'd like to discuss further in a separate venue.

    However I think we're getting off-topic from the question in this issue, which is what to do about our current set of collective operations in progress. Currently we do allow collectives in progress, but their correctness depends strongly on an uncheckable precondition that they are initiated in strict job-wide collective order - as outlined in the original post this requirement turns out to be challenging/problematic in practice. So the question at hand in this issue is whether to continue allowing collective calls in progress (for the hypothetical small set of programs that use them correctly), or just deprecate/disallow that usage entirely.

  7. Rob Egan

    I agree this is getting off topic, but I did want to ensure that if this proposal is implemented, it would not prohibit the design pattern of dependency chaining collectives, without excessive barrier/sync points. This issue shows that the current implementation can support this, so long as the collective data structure is not constructed within the restricted context of a progress callback.

    So not presuming that my approval is necessary, I have no objections to this proposal.

    Also, note that as a upcxx developer, having user access to detail::the_persona_tls.get_progressing()

    or something similar to use within assertions would be a nice thing to expose.

  8. Dan Bonachea reporter

    Also, note that as a upcxx developer, having user access to detail::the_persona_tls.get_progressing() or something similar to use within assertions would be a nice thing to expose.

    This query is proposed in spec issue 170 and added in impl PR 284 and spec PR 59. Please provide feedback there on that topic.

  9. Dan Bonachea reporter

    This topic was discussed in our 2020-09-16 meeting.

    The consensus was that the upcoming release will specify that invoking collectives inside progress is a deprecated/obsolescent feature that might be prohibited in a subsequent revision. The implementation will be modified to issue a warning upon the first violation at runtime, silenceable with an environment variable.

    Additionally, invoking a blocking user-level upcxx::barrier() or requesting entry_barrier::user for any collective invocation from inside progress will be specified as erroneous and immediately become a fatal error.

    I'll take care of these tasks.

  10. Dan Bonachea reporter

    Proposed specification updates now in spec pull request 61

    It turns out that other collective operations with user-level progress (notably AD construction) are also a problem within the restricted context, and have been added to the set of prohibited collectives.

    The complete set of operations now strongly prohibited (soon to be a fatal error) within progress callbacks:

    1. Blocking upcxx::barrier()
      • previously deadlocked
      • this one is fundamentally silly/indefensible
    2. upcxx::finalize()
      • previously deadlocked
      • this one is fundamentally silly/indefensible
    3. team::destroy(), atomic_domain::destroy() and cuda_device::destroy() with the default entry_barrier::user argument
      • previously deadlocked
      • This one can already be worked-around by passing argument entry_barrier::none (or entry_barrier::internal once impl issue #412 is fixed)
    4. Construction of atomic_domain and cuda_device, team::split() (and likely team constructors added in the future)
      • previously worked in some situations, but could deadlock in others: where any rank's call was control dependent on remote user progress, or even some forms of remote internal progress (eg waiting for an rpc delivery or acknowledgment).
      • These could potentially be relaxed in the future if we added an explicit entry_barrier argument to enable progress-level downgrade, but I would not advocate for that approach without a stakeholder-driven demand.

    All other collective operations (with progress internal or none) remain permitted, but deprecated (with a runtime warning).

  11. Dan Bonachea reporter

    issue #169: collective calls inside restricted context

    • Prohibit collectives with user progress from initiation in the restricted context

    • Deprecate initiation of collectives with progress level internal/none from the restricted context

    Resolves issue #169

    → <<cset caccc2e4f7be>>

  12. Dan Bonachea reporter

    As of Impl PR 426 we now unconditionally prohibit all collective operation initiation inside the restricted context:

    Initiating collective operations with a progress level of internal or none from within the restricted context (within a callback running inside progress), an action deprecated with a runtime warning since 2020.10.0, is now prohibited with a fatal error. For details, see spec issue 169.

  13. Log in to comment