Specify alignment rules for collective operations

With collective operations going into the spec, we need to specify requirements for correct use of collective operations. While teams will not go into the first version of the spec, we should ensure that the rules we specify can be cleanly extended with the addition of teams. As such, the rules I suggest here do include teams, and we can remove the team-specific comments when putting them in the spec. Also, we should err on the side of being more restrictive; we can always loosen the correctness rules without breaking existing programs, but we can't strengthen them without potentially breaking code.

References

[1] A. Aiken and D. Gay. Barrier Inference. POPL'98.

[2] A. Kamil and K. Yelick. Enforcing Textual Alignment of Collectives Using Dynamic Checks. LCPC'09.

[3] A. Kamil. Single Program, Multiple Data Programming for Hierarchical Computations. PhD Thesis.

[4] P. Hilfinger, et al. Titanium Language Reference Manual, Version 2.20.

Wbat we can detect as a library

Since our implementation of UPC++ is as a standard C++ library, we are limited in what rule violations we can detect. We cannot do static analysis or instrument the code, so we cannot fully implement the checking schemes described in the references above. However, this doesn't necessarily mean we should choose a more lenient alignment scheme. We're trying to leave the door open for a UPC++-aware compiler, which could implement more detailed checking. We should choose a scheme that is easy to understand and makes it harder to write incorrect code, even if we can't fully enforce it at the moment.

What we do have control over as a library is what happens when a user calls a collective operation. The alignment detection scheme in [2] has two components: (1) tracking control flow on each thread, and (2) checking for consistency at collective operations. The tracking piece requires instrumenting the source code, which we cannot do in a library. We can, however, implement the checking part of the algorithm, which precedes every collective operation, including the implicit barrier at program exit, with a 32-bit broadcast from rank 0. This broadcast is guaranteed to complete correctly, since it is consistent across all ranks by design. The data it broadcasts is a summary of the sequence of collective operations on rank 0, and each rank compares its own sequence with rank 0's to ensure everything is aligned. If a rank is not aligned with 0, then it spits out an error describing where its alignment differs from rank 0 and then aborts the program. Instead of the sequence summary, we would just use a value that describes the current collective operation and check that it matches rank 0's. This scheme allows us to check for the minimal degree of alignment that guarantees deadlock freedom. It does not allow us to detect violations of the stricter definitions of alignment below.

Team collective operations complicate checking. Consider the following code:

team myteam = ...;
if (myrank() < 5) {
  myteam.mychild().barrier();
}

The problem with this code is that there is no way to instrument the team barrier while guaranteeing that the instrumentation does not deadlock. In the scheme above, we put a consistent broadcast before each collective operation, and the broadcast is guaranteed to complete since every rank does the same broadcast operation. In this code, this is no longer the case, since not every rank will execute the team barrier. We don't even know if every rank in the team will execute the barrier, so we can't guarantee that a team broadcast will complete if we put it before the barrier.

The solution to this problem in [3] is to require that collective operations be on the currently scoped team. This allows us to check for consistency when changing team scope. Consider the following:

team myteam = ...;
teamsplit(myteam) {
  if (global_myrank() < 5) {
    barrier();
  }
}

We can now place a global broadcast before the teamsplit, a team broadcast before the barrier, and another team broadcast when exiting the teamsplit. Even if some team members call the barrier while others don't, they all will execute a consistent broadcast operation that is guaranteed to complete. This allows us to detect the misalignment and abort before the code deadlocks.

Common rules

Extrapolating from the above scheme, the following rules are required for any definition of valid alignment.

A collective operation must be consistent across all ranks involved. This means they must agree on important arguments, such as the root rank of a rooted collective operation (e.g. broadcast, reduce), the type in a reduction, the data size, and so on. We should be careful to specify what it means for each collective operation to be consistent.
Collective operations should only be allowed on the currently scoped team. We should remove the ability to explicitly specify which team a collective operation is over. This will break HPGMG, but it's easily fixable.
Entering a teamsplit is a collective operation over the team in scope before the teamsplit. This allows us to check for alignment when changing the team scope. On the other hand, exiting a teamsplit need not be defined as a collective operation. Though we may instrument the exit with a broadcast, the exit does not need to be aligned across ranks in order to guarantee deadlock freedom.

Alignment options

The following are various options for the alignment of collectives:

1. Textual alignment

This scheme may be more aptly called contextual alignment, since it requires that the program context be consistent across all threads in a collective operation. The basic rule is the following:

If any branch of an explicit or implicit conditional contains a collective operation, then all ranks in the currently scoped team must take the same branch.

Explicit branches include if and switch statements and the ?: operator. Implicit branches include loops and virtual method calls. The following are some examples in this scheme:

void foo() {
  if (myrank() < 5) { // illegal; all ranks must take same branch
    bar();
  } else {
    bar();
  }
  if (ranks() < 5) { // legal; all ranks take same branch
    bar();
  }
  if (myrank() < 5) { // legal; no collective in any branch
    cout << "took first branch" << endl;
  }
}

void bar() {
  barrier();
}

Fully implementing this scheme requires interprocedural analysis to determine whether or not a branch contains a collective operation. In addition, a fully static implementation of this scheme requires a type system to guarantee that conditions are consistent across ranks, as in Titanium [4]. A dynamic implementation requires the ability to instrument conditionals to track which branch is taken, as in [2].

2. Structural correctness

This scheme was proposed in [1] and is somewhat more lenient than textual alignment. Every expression and statement in a program must be structurally correct. In order for a statement or expression E to be structurally correct, all ranks must execute a consistence sequence of collective operations in E. Thus, the following is legal, since the conditional as a whole and each of its substatements are structurally correct:

if (myrank() < 5) {
  barrier();
  ... // some non-collective code
  barrier();
} else {
  ... // other non-collective code
  barrier();
  barrier();
}

The following, however, is illegal, even though it won't deadlock, since neither conditional is structurally correct:

if (myrank() < 5) {
  barrier();
}
if (myrank() >= 5) {
  barrier();
}

Implementing this scheme statically requires the same analysis and type system as in textual alignment. A dynamic implementation would also be very similar to the scheme used for textual alignment.

3. Consistent sequence

The minimal requirement for deadlock freedom is that the sequence of collective operations be consistent across ranks. Let S_{i,t} be the sequence of collective operations on rank i that are over team t. Let T be the set of all teams in the program. We require

$\forall t \in T . \forall i \in t . \forall j \in t .
    |S_{i,t}| = |S_{j,t}| \wedge (\forall k \in |S_{i,t}| . S_{i,t}(k) = S_{j,t}(k))$

To summarize, any two ranks in a team t must perform the same sequence of collective operations over team t. Two collective operations are the same if they are the same collective and if their arguments are consistent.

The last thing required is to define what it means for a collective operation to be over a team t. If we don't restrict collectives to be over the currently scoped team (or if teamsplit is not a collective operation), then we must define a collective to be over team t if it is over t or any ancestors of t. To see why, suppose that ranks 0 and 1 are in team t and perform the following sequence of collectives:

Rank 0                    Rank 1
------                    ------
global_team.barrier()     global_team.barrier()
t.barrier()               global_team.barrier()
t.bcast()                 t.barrier()
global_team.barrier()     t.bcast()

This should be an inconsistent sequence. Clearly the subset of collectives over the global team is consistent. The subset over team t is only inconsistent if we define the global collectives to part of the sequence of collectives over t.

If, on the other hand, we restrict collectives to the currently scoped team and define entry to a teamsplit to be a collective operation, then the analogous sequence would be the following:

Rank 0                    Rank 1
------                    ------
global_team.barrier()     global_team.barrier()
global_team.teamsplit()   global_team.barrier()
t.barrier()               global_team.teamsplit()
t.bcast()                 t.barrier()
global_team.barrier()     t.bcast()

In this case, the subset of collectives over the global team is inconsistent, so we don't have to go out of our way to define the team subset to be inconsistent.

Summary

With collective operations, it's important to define correctness constraints on their use. While we are limited in the amount of checking we can do as a library, we do have multiple options for specifying what is required for collective operations to be correctly aligned. These options impose tradeoffs between intuitiveness and flexibility, so we need to carefully consider which option makes the most sense for UPC++.

Comments (7)