Collective ordering rule with sub-teams is incomplete/insufficient

Issue #181 resolved
Dan Bonachea created an issue

I've just realized there's a problem with our spec's collective ordering requirement, which defines correctness of collective calls in the presence of sub-teams:

collective-order.png

Consider this program, run with 3 or more ranks:

#include <upcxx/upcxx.hpp>
#include <assert.h>

using namespace upcxx;

int main() {
  // --- setup code ---
  upcxx::init();
  assert(rank_n() >= 3);

  // create 2-process teams, where each proc is a member of two such teams:
  // tmA contains this rank and right neighbor (with wraparound)
  // tmB contains this rank and left neighbor (with wraparound)
  upcxx::team *tmA, *tmB;
  for (int i=0; i < rank_n(); i++) {
    if (i == rank_me()) {
      tmA = new team(world().split(i, rank_me()));
    } else if ( (i+1)%rank_n() == rank_me() ) {
      tmB = new team(world().split(i, rank_me()));
    } else {
      world().split(team::color_none, 0); // not creating a team
    }
  }
  barrier(world());

  // --- problem region ---
  barrier(*tmA); // sync with right neighbor - DEADLOCK
  barrier(*tmB); // sync with left neighbor

  // --- cleanup code ---
  barrier(world());
  upcxx::finalize();
}

This program creates pairwise teams, each rank joins a 2-member team with its left neighbor and a 2-member team with its right neighbor. Then all ranks invoke a blocking barrier on the team shared with their right neighbor. This creates a dependency cycle across the job and a guaranteed deadlock.

However, this program has not violated our specified formal (or informal) collective ordering requirement. Specifically, within the "problem region" each pair of distinct collective calls C_1 and C_2 have at most one rank in common, meaning the set: Participants(C_1) INTERSECT Participants(C_2) is either empty or singleton, making the condition vacuously satisfied. Wrt the informal "notion", no two processes both participate in a pair of distinct collectives within the problem region, so the condition is trivially satisfied. However (with 3 or more ranks) the program still deadlocks due to a lack of proper collective ordering. The problem arises because there is a dependency cycle in the collective synchronization ordering, but the cycle only arises transitively across all the ranks and not between any isolated pair of ranks.

I think this means the formalism doesn't really define a rigorous rule for avoiding deadlock via misordered collective operations. There are of course plenty of other ways to create deadlock using point-to-point operations, but this seems to undercut formula 12.1's main (only?) purpose for existence. I think fixing the formalism probably means introducing a requirement that one can construct at least one global total order over all collective calls issued by the program that remains consistent with all the partial orders defined by Precedes() on each process. Do we want to attempt that, or should we drop this formalism and replace it with something more hand-wavy?

Comments (5)

  1. Amir Kamil

    My understanding was that we deliberately avoided imposing a total order on collectives to allow asynchronous collectives on disjoint teams to be initiated in different order. For instance, if we replace the problem region in your code example with:

    auto fut1 = barrier_async(*tmA);
    auto fut2 = barrier_async(*tmB);
    when_all(fut1, fut2).wait();
    

    Then there wouldn’t be a problem, even though the collectives cannot be totally ordered.

    It seems to me that something like the following may be enough to express what we are currently missing:

    For a collective operation C, it is an error if the completion of the operation (return from synchronous collectives, operation-completion notifications for asynchronous collectives) on at least one participant has a happens-before relationship with the initiation of the operation on another participant.

    Does this capture the problematic behavior?

  2. Dan Bonachea reporter

    the following may be enough to express what we are currently missing ... Does this capture the problematic behavior?

    The additional condition Amir proposes is a step in the right direction, and is certainly a prerequisite for correctness. It would rule this particular program as erroneous, solving the problem at-hand.

    However I'm worried this might not be sufficient in general due to GASNet's current definition of "collective", which includes:

    In addition to the requirement on compatibility of collective calls over any given team, all collective calls over distinct teams must be ordered such that no deadlock would occur if all such calls were replaced by blocking barriers. A formal specification of this constraint will appear in a future revision of this document.

    This means that "asynchronous" collective initiations with internal or user progress are still technically permitted to impose barrier-like synchronization across the team before returning from injection.

    This requirement was imposed to avoid the possibility of requiring unbounded buffering inside GASNet, otherwise one could arrange to initiate an unbounded number of synchronizing collectives on one rank before the other ranks start issuing matching calls; if each in-flight collective consumes non-zero resources then that resource utilization requirement could grow without bound. I believe previously one could create deadlocks by breaking this rule, but GASNet's collective internals changed somewhat in 2018.12.0 and I'm not sure that still remains true (@Paul Hargrove knows that code best). I tried various experiments on develop and was unable to construct a program that would deadlock ibv-conduit as a result of breaking this rule, so perhaps this detail doesn't matter in practice and we should consider relaxing GASNet's restriction.

    In the interests of expediency (for the pending spec release) I suggest we add Amir's proposed condition today. That condition captures an unquestionably fundamental requirement for semantic correctness, and the original report demonstrates a semantically guaranteed deadlock that arises when the condition is violated.

    In the future we can consider further strengthening our definition to require "total-orderability", if that becomes provably necessary for engineering reasons.

  3. Paul Hargrove

    I can't claim to have sufficient recollection of the collectives implementation to answer to the possibility of deadlocking the current implementation with certainty. However, I suspect that IF it is possible to do, the issuing a sufficient number of collectives which each claim a portion of the bounded scratch space might be a means to do so. That would, IIRC, be easiest to do using the AlltoAll and AllGather in GASNet, not exposed in UPC++. Using only the operations we've spec'd in UPC++, I think it might be possible with a reduction on a large-enough user-defined data type, which would force use of scratch rather than eager, but I don't know that UPC++ even uses that mode as opposed to falling back on RPC. Even then, the 64KiB MaxMedium on ibv-conduit makes the eager/rvous protocol switch in reduction harder to reach.
    All that said, I cannot construct an actual deadlock in my head using the implementation knowledge I can recall.

    Related:
    In the context of GASNet's collectives, we've discussed the possibility of allowing one thread to initiate a single collective operation on multiple endpoints in the same team. IF that is to be done separately/sequentially (w/o introducing new APIs to accept vectors of endpoint) then I think we (I) must ensure that the (yet to be implemented) support for multiple local endpoints in a common team will need to provide a guarantee of deadlock free injection (might block for some resources, but nothing that might introduce dependency on other collective initiations). So, I think there is hope that later we'd be able to safely support async collectives in GASNet with (only?) a restriction along the lines of what Amir has proposed here.

    For the current spec release I am also in favor fo "expediently" adding Amir's additional condition, even if we don't believe it sufficient to fully resolve this issue in its full generality.

  4. Log in to comment