Should finalize perform termination detection? (quiescence)

Issue #53 open
Dan Bonachea created an issue

Consider the following program:

int main (int argc , char * argv []) {
  upcxx::init(); // initialize UPC ++
  { 
    auto fut = upcxx::rpc(0, []( void ) { std::cout << "one" << std::endl; });
    fut.then([]( void ) { std::cout << "two" << std::endl; });
  }
  upcxx::finalize(); // finalize UPC ++
  return 0;
}

Note this program never calls upcxx::progress, and the future returned by .then is discarded. Assume this is running with multiple ranks connected via an unordered network.

Is this program guaranteed to print anything?

What about if the user inserts a call to upcxx::barrier() before finalize?

If the answer in either case is "yes" then I don't see that guarantee in the spec. If the answer is "no", then I think the semantics of calling finalize while operations are in-flight might be underspecified.

The basic issue here is whether we provide the user with any kind of termination detection in finalize (I think the answer is currently no), or whether they are responsible for tracking all their asynchronous operations and reaching a quiescence consensus before calling finalize(). If it's the latter then the question becomes what happens when they call finalize while operations are in-flight. More generally, finalize is not currently specified as a collective call, so what happens when one rank finalizes while others are still running and injecting?

Comments (23)

  1. Amir Kamil

    In order to do termination detection, we would, at the minimum, need to:

    • Track global completion of all asynchronous operations
    • Make finalize() collective
    • Specify that finalize() makes user-level progress and blocks until global completion of all asynchronous operations on all ranks

    Of course, this would prevent finalization of a subset of ranks, so we'd have to decide if that's something we want to allow.

    We could also abstract some of the above into a separate function, such as some sort of collective that waits on global completion of all outstanding asynchronous operations. For a non-world team, this would only be able to wait for completion of asynchronous operations initiated by a member of the team.

    I think @jbachan would have a better idea of how feasible all this would be.

  2. Former user Account Deleted

    I'm against baking in termination detection solely because of rpc_ff. It would force the implementation to always send internal replies that the rpc_ff was received (or something more complicated involving message counters). I'm a big fan of termination detection, but I think it should be opt-in as a layer over our primitives.

    I think the second nicest semantics to termination detection would be that finalize is collective and forcibly quiesces the system by dropping all in-flight operations. Since gasnet1 doesn't help us with that, we would still need a full term-detection in our implementation. Not helpful.

    So thirdly, we can just say finalize is collective but UB if any comm is in-flight. All notifications sitting in progress queues will be destructed to ensure all memory hanging in futures and continuations is reclaimed.

  3. Dan Bonachea reporter

    First I should say this question was not motivated by rpc_ff. My code example does not contain that call, nor does the "hello world showing RPC" version on slide 15 of Scott's slides, which suffers from exactly the same class of bug. We could label this as "sloppy" UPC++ programming that generates UB, but I suspect it will be a very easy mistake to make in a large app.

    Due to the lack of ordering guarantees, requiring the user to ensure quiescence before calling finalize() implies it's usually an error to discard any future the library generates from any call (eg it means tracking the return value from every .then() call, unless the callback satisfies other dependencies that amount to equivalent accounting). Not a very user-friendly semantic, especially if we provide no help to detect the flaw. To make matters worse, the probability of encountering a race that generates UB by violating an implicit rule probably increases with job scale.

    To clarify, I'm definitely not advocating sending any additional steady-state messages to support termination detection (no "extra replies" needed). I believe there are actually very simple (global/collective) termination detection algorithms that send no extra steady-state messages.

    A sketch for one very simple algorithm: all ranks keep a signed counter of operations init to zero, incremented when that rank injects an operation and decremented when one is received/retired (regardless of initiator, all that matters is each injection is matched to exactly one retirement somewhere in the system). At collective quiesce, the master persona on all ranks alternates between calling progress(user); discharge() and performing a collective allreduce of the current counter values (during which progress on the local queues is temporarily suspended), until the reduction result reaches zero. This algorithm adds very little to the steady-state behavior, and there are many possible sophistications to streamline it further. Other completely-different algorithms are well-understood.

    Asking the user to implement this in a layer atop UPC++ would be significantly more complicated and ugly, because quiescence is a system-wide property and the client does not have access to the progress engine, internal queue states, or even the init/finalize reference count of the library. In particular, the user has no foolproof method to ensure he adds accounting for every rpc injection and retirement (unless we also add hooks to expose all that, and/or he wraps literally every possible call with his own complete API).

    Lastly, if we're really worried about burdening the client, the option/ability to perform termination detection could be provided either through a separate/optional collective quiesce() call (which might actually be useful for phase changes, independent of finalization), or even as an optional configure-time setting to opt-in to functional quiescence.

    I don't propose we adopt a quiescence solution this week, but longer term I think we should give it serious consideration.

  4. Former user Account Deleted

    I was wrong in saying that rpc_ff was my sole concern. And thanks for the references Dan. I have lots of experience implementing exactly the counting and reducing algorithm you sketched out. Burden of implementation is not an issue, and as you point out, neither are the overheads during steady-state. My reluctance of termination detection in finalize stems from:

    1. It's far more useful to have termination detection around as a per-epoch capability so the user can use it for the phases of async computation where they need it. This is your quiesce(), if that were available, finalize wouldn't need the smarts since they could just quiesce before finalize. And since quiesce isn't available, somehow the user is already making sure messages don't bleed between their epochs (ignoring the case of algorithms which use resources in a a single-assignment nature to be tolerant of such bleeding). Why can't they keep that up all the way to finalize?

    2. Termination detection has multiple degrees of freedom in the feature set: Is it supported over teams? Are multiple classes of messages allowed (say "probe" messages which don't keep the system alive and are silently discarded if in-flight during quiescence convergence). Can a message participate in multiple overlapping regions of detection (I actually have a use case for this)? I wouldn't presume to know the sweet spot yet. This makes me pretty sure that quiescence ought to be a layer built over our rpc. I have demonstrated the feasibility of this layering to myself, the only tricky part is making sure you send rpc's from using the right layer everywhere (as you point out).

    I guess I just feel that making finalize do detection is just a waste of our time. Algos that need this feature will want it per-epoch. Algos that don't need can bring the system to quiescence just fine.

  5. Dan Bonachea reporter

    It sounds like we should probably discuss this high-level design issue as a group, for action post-release.

    For the purposes of this week's spec, should we add text (probably in the finalize preconditions) clarifying that calling finalize with any operations in-flight (globally?) leads to undefined behavior? (at least for now)

    Also, if this is our position for this spec release (ie not providing quiesce capability, but requiring quiescence before finalize), we should consider also fixing the third example in section 5.2 and the first example in 5.5 to NOT discard the result of future::then(), since that's an unsafe practice in general under this regime (ie leads to a program that cannot be safely finalized).

  6. Dan Bonachea reporter

    c8bea5d clarifies that finalize with in-flight operations leads to undefined behavior.

    This issue remains open to consider the bigger question of whether we provide a quiesce feature (after this week).

  7. BrianS

    Finalize is collective over all ranks. The current draft spec does not provide quiescence. A future UPC++ spec might provide a quiesce feature.

    This issue is to remain open but the for V1.0 Draft 2 is not a feature.

  8. Dan Bonachea reporter
    • changed status to open

    The issue tracker filters do not include a way to show on-hold issues with open issues.

  9. Amir Kamil

    According to the current specification of finalize(), the "Remote Hello World Example" (currently in Section 8.2) has undefined behavior, since it calls finalize() when there may be in-flight ops. This is not fixable by adding a barrier to the code, but it can be fixed by specifying that finalize() performs a barrier on entry.

  10. Dan Bonachea reporter

    At the 8-9-17 Pagoda meeting we resolved that finalize() should be updated to include an implicit entry barrier. The user is still required to ensure quiescence before the last rank enters the finalize() call. This needs to be added to the spec.

    The more general question of providing quiescence tools to help the user with this problem was tabled, likely until after the September release.

  11. Log in to comment