Outgoing progress semantic guarantees too weak for practical use
wait()
should probably be defined to include a flush()
, or something similar.
Otherwise wait() on a future created by message injection could deadlock in the presence of outgoing message aggregation that is not advanced by internal progress.
In particular, the "C++14-compliant definition of wait" given in section 5.4 is probably insufficient.
Requiring wait
to include a full flush is sufficient to solve this problem, but not necessary and probably overkill (eg especially if the future argument is already ready). We just need to ensure that wait
will eventually drain any internal queues on this rank that may be blocking fulfillment of its arguments.
Comments (7)
-
reporter -
reporter - changed title to Outgoing progress semantic guarantees too weak for practical use
- marked as major
-
reporter - changed component to Progress
-
Account Deleted It was never the intention that the implementation could implicitly decide to stall things indefinitely until flushed. Indefinitely aggregating operations (bulk_rpc) were meant to be explicit, but we never spec'd any. Hopefully all comm-ops in the spec state that the operation is guaranteed to reach completion eventually with enough internal progress. Since operations requiring flush don't exist, it might be best to drop this function completely to not create confusion.
-
reporter Since operations requiring flush don't exist, it might be best to drop this function completely to not create confusion.
I'm also in favor of deleting the
flush
function and the two paragraphs quoted in comment 1 above from the spec. They just raise too many questions and concerns.If and when we add interfaces to perform explicitly aggregated communication, we can add the corresponding flush calls along with them.
-
reporter - changed milestone to 2017.09.30 release
-
assigned issue to
-
reporter - changed status to resolved
fix issue
#79: Remove flushThe upcxx::flush() operation is removed, along with related spec prose.
→ <<cset 99eca3b85698>>
- Log in to comment
Thinking more about this, I suspect it may be a more general problem with any blocking call, if we permit an implementation that allows outgoing messages to be aggregated and stalled locally over an unbounded number of internal-progress calls.
Consider:
If this rpc_ff is permitted to stall inside outgoing buffers on rank 0 until the next flush operation, I believe this code could technically deadlock, because barrier does not currently guarantee a flush. This is despite the fact this program is fully "attentive" to UPC++.
This particular case could be solved by adding an implicit flush or eventual flush-like behavior in barrier(). However the same problem arises if the barrier above is replaced by team::split(), or any other blocking collective. Similarly if it's replaced by MPI_Barrier(). Similarly if it's replaced by an acknowledgement rpc_ff with a progress spin-loop, eg:
I think the upshot of allowing unbounded outgoing message aggregation is that in general users may need a flush() call before entering any blocking call or any spin-wait loop to avoid deadlock. Do we really want users sprinkling flush() calls all over their application, even in code that is "attentive" to UPC++? If so, we definitely need to clarify that requirement, probably updating this paragraph in 9.3 (emphasis added):
An alternate solution to all these problems would be to strengthen the semantic of internal progress to guarantee that calls to internal progress will eventually attempt to inject any outgoing operations (contrary to what is stated above). The definition of "eventually" could be left as an implementation tuning parameter, provided it is finite in both time and call count.