Should finalize perform termination detection? (quiescence)
Consider the following program:
int main (int argc , char * argv []) {
upcxx::init(); // initialize UPC ++
{
auto fut = upcxx::rpc(0, []( void ) { std::cout << "one" << std::endl; });
fut.then([]( void ) { std::cout << "two" << std::endl; });
}
upcxx::finalize(); // finalize UPC ++
return 0;
}
Note this program never calls upcxx::progress, and the future returned by .then is discarded. Assume this is running with multiple ranks connected via an unordered network.
Is this program guaranteed to print anything?
What about if the user inserts a call to upcxx::barrier() before finalize?
If the answer in either case is "yes" then I don't see that guarantee in the spec. If the answer is "no", then I think the semantics of calling finalize while operations are in-flight might be underspecified.
The basic issue here is whether we provide the user with any kind of termination detection in finalize (I think the answer is currently no), or whether they are responsible for tracking all their asynchronous operations and reaching a quiescence consensus before calling finalize(). If it's the latter then the question becomes what happens when they call finalize while operations are in-flight. More generally, finalize is not currently specified as a collective call, so what happens when one rank finalizes while others are still running and injecting?
Comments (23)
-
-
Account Deleted I'm against baking in termination detection solely because of
rpc_ff
. It would force the implementation to always send internal replies that the rpc_ff was received (or something more complicated involving message counters). I'm a big fan of termination detection, but I think it should be opt-in as a layer over our primitives.I think the second nicest semantics to termination detection would be that finalize is collective and forcibly quiesces the system by dropping all in-flight operations. Since gasnet1 doesn't help us with that, we would still need a full term-detection in our implementation. Not helpful.
So thirdly, we can just say finalize is collective but UB if any comm is in-flight. All notifications sitting in progress queues will be destructed to ensure all memory hanging in futures and continuations is reclaimed.
-
reporter First I should say this question was not motivated by
rpc_ff
. My code example does not contain that call, nor does the "hello world showing RPC" version on slide 15 of Scott's slides, which suffers from exactly the same class of bug. We could label this as "sloppy" UPC++ programming that generates UB, but I suspect it will be a very easy mistake to make in a large app.Due to the lack of ordering guarantees, requiring the user to ensure quiescence before calling
finalize()
implies it's usually an error to discard any future the library generates from any call (eg it means tracking the return value from every.then()
call, unless the callback satisfies other dependencies that amount to equivalent accounting). Not a very user-friendly semantic, especially if we provide no help to detect the flaw. To make matters worse, the probability of encountering a race that generates UB by violating an implicit rule probably increases with job scale.To clarify, I'm definitely not advocating sending any additional steady-state messages to support termination detection (no "extra replies" needed). I believe there are actually very simple (global/collective) termination detection algorithms that send no extra steady-state messages.
A sketch for one very simple algorithm: all ranks keep a signed counter of operations init to zero, incremented when that rank injects an operation and decremented when one is received/retired (regardless of initiator, all that matters is each injection is matched to exactly one retirement somewhere in the system). At collective quiesce, the master persona on all ranks alternates between calling
progress(user); discharge()
and performing a collective allreduce of the current counter values (during which progress on the local queues is temporarily suspended), until the reduction result reaches zero. This algorithm adds very little to the steady-state behavior, and there are many possible sophistications to streamline it further. Other completely-different algorithms are well-understood.Asking the user to implement this in a layer atop UPC++ would be significantly more complicated and ugly, because quiescence is a system-wide property and the client does not have access to the progress engine, internal queue states, or even the init/finalize reference count of the library. In particular, the user has no foolproof method to ensure he adds accounting for every rpc injection and retirement (unless we also add hooks to expose all that, and/or he wraps literally every possible call with his own complete API).
Lastly, if we're really worried about burdening the client, the option/ability to perform termination detection could be provided either through a separate/optional collective
quiesce()
call (which might actually be useful for phase changes, independent of finalization), or even as an optional configure-time setting to opt-in to functional quiescence.I don't propose we adopt a quiescence solution this week, but longer term I think we should give it serious consideration.
-
Account Deleted I was wrong in saying that rpc_ff was my sole concern. And thanks for the references Dan. I have lots of experience implementing exactly the counting and reducing algorithm you sketched out. Burden of implementation is not an issue, and as you point out, neither are the overheads during steady-state. My reluctance of termination detection in finalize stems from:
-
It's far more useful to have termination detection around as a per-epoch capability so the user can use it for the phases of async computation where they need it. This is your
quiesce()
, if that were available, finalize wouldn't need the smarts since they could just quiesce before finalize. And since quiesce isn't available, somehow the user is already making sure messages don't bleed between their epochs (ignoring the case of algorithms which use resources in a a single-assignment nature to be tolerant of such bleeding). Why can't they keep that up all the way to finalize? -
Termination detection has multiple degrees of freedom in the feature set: Is it supported over teams? Are multiple classes of messages allowed (say "probe" messages which don't keep the system alive and are silently discarded if in-flight during quiescence convergence). Can a message participate in multiple overlapping regions of detection (I actually have a use case for this)? I wouldn't presume to know the sweet spot yet. This makes me pretty sure that quiescence ought to be a layer built over our rpc. I have demonstrated the feasibility of this layering to myself, the only tricky part is making sure you send rpc's from using the right layer everywhere (as you point out).
I guess I just feel that making finalize do detection is just a waste of our time. Algos that need this feature will want it per-epoch. Algos that don't need can bring the system to quiescence just fine.
-
-
reporter It sounds like we should probably discuss this high-level design issue as a group, for action post-release.
For the purposes of this week's spec, should we add text (probably in the finalize preconditions) clarifying that calling finalize with any operations in-flight (globally?) leads to undefined behavior? (at least for now)
Also, if this is our position for this spec release (ie not providing quiesce capability, but requiring quiescence before finalize), we should consider also fixing the third example in section 5.2 and the first example in 5.5 to NOT discard the result of future::then(), since that's an unsafe practice in general under this regime (ie leads to a program that cannot be safely finalized).
-
Account Deleted Agreed on all points.
-
reporter c8bea5d clarifies that finalize with in-flight operations leads to undefined behavior.
This issue remains open to consider the bigger question of whether we provide a
quiesce
feature (after this week). -
Finalize is collective over all ranks. The current draft spec does not provide quiescence. A future UPC++ spec might provide a
quiesce
feature.This issue is to remain open but the for V1.0 Draft 2 is not a feature.
-
- changed status to on hold
We will re-open this issue after our ECP Milestone.
-
reporter - changed status to open
The issue tracker filters do not include a way to show on-hold issues with open issues.
-
According to the current specification of
finalize()
, the "Remote Hello World Example" (currently in Section 8.2) has undefined behavior, since it callsfinalize()
when there may be in-flight ops. This is not fixable by adding a barrier to the code, but it can be fixed by specifying thatfinalize()
performs a barrier on entry. -
reporter At the 8-9-17 Pagoda meeting we resolved that finalize() should be updated to include an implicit entry barrier. The user is still required to ensure quiescence before the last rank enters the finalize() call. This needs to be added to the spec.
The more general question of providing quiescence tools to help the user with this problem was tabled, likely until after the September release.
-
reporter - changed milestone to 2018.03.31 release
- changed component to Init and Finalize
-
reporter - changed milestone to 2017.12.31 release
-
reporter - changed title to Should finalize be collective and perform termination detection? (quiescence)
- changed milestone to 2018.09.30 release
This issue was discussed in the 1/10/18 meeting and we agreed to defer it to the September release.
-
reporter - changed milestone to 2019.09.30 release
- marked as minor
- changed title to Should finalize perform termination detection? (quiescence)
This issue was triaged at the 2018-06-13 Pagoda meeting and assigned a new milestone/priority.
-
reporter - changed milestone to 2020.3.0 release
This issue was triaged at the 2019-07-24 Pagoda issue meeting and assigned a new milestone.
-
reporter - changed milestone to 2020.9.0 release
This was discussed in the 2020-02-12 meeting and deferred to next release milestone.
-
reporter - changed milestone to 2021.3.0 release
Mass roll-over of open issues to next release milestone
-
reporter - changed milestone to 2021.9.0 release
Mass roll-over of open issues to next release milestone
-
reporter - changed milestone to 2022.3.0 release
Mass roll-over of open issues to next release milestone
-
reporter - changed milestone to 2022.9.0 release
Mass roll-over of open issues to next release milestone
-
reporter - changed milestone to Deferred indefinitely
Remove from milestone, since there are no near-term plans to deploy this feature.
- Log in to comment
In order to do termination detection, we would, at the minimum, need to:
finalize()
collectivefinalize()
makes user-level progress and blocks until global completion of all asynchronous operations on all ranksOf course, this would prevent finalization of a subset of ranks, so we'd have to decide if that's something we want to allow.
We could also abstract some of the above into a separate function, such as some sort of collective that waits on global completion of all outstanding asynchronous operations. For a non-world team, this would only be able to wait for completion of asynchronous operations initiated by a member of the team.
I think @jbachan would have a better idea of how feasible all this would be.