- attached issue140.cpp
upcxx::discharge() does not discharge remote_cx::as_rpc()
Currently rput(remote_cx::as_rpc())
is implemented using initiator-side chaining in internal progress.
However, upcxx::discharge()
is not strong enough to ensure that the initiator-side chaining has completed the RPC injection step, which means if the application discharges and then lapses into inattentiveness, the operation will stall at the initiator.
This is a significant impediment to practical use of rput(remote_cx::as_rpc())
for bulk synchronous applications, as it means a "real" application may need to insert periodic internal progress calls in computation loops to ensure a prior outgoing injection makes progress.
Ideally we eventually eliminate initiator-side chaining from the common case paths of rput(remote_cx::as_rpc())
, but until/unless that covers all cases I think we need to strengthen upcxx::discharge to "do what the user means" for this situation.
Comments (6)
-
reporter -
reporter This was discussed at the 2018-05-02 developer meeting.
We resolved that discharge() should be guaranteed to flush the outgoing
remote_cx::as_rpc()
messages for any outgoing communication initiated by the current persona, even for an implementation relying upon initiator-side chaining or persona handoff (ie no further initiator-side internal progress from any local persona is required). Once fixed, the test program should be guaranteed to succeed.There are several work items here:
- Strengthen the spec with appropriate wording to require this behavior (closing the current loophole via handoff to a master persona which can return false negatives)
- Strengthen the implementation of initiator-side chaining to ensure that discharge/progress_required enforces this property.
- Convert the implementation of
rput*(remote_cx::as_rpc()
to stop using initiator-side chaining whenever possible (using AMLong and VIS target notification instead). This change is expected to both improve latency and resolve this initiator attentiveness problem, but will not be applicable to corner cases (eg with a very large RPC closure or serialized argument set).
@jdbachan is currently assigned to investigate each of these.
-
reporter - marked as major
This issue was triaged at the 2018-06-13 Pagoda meeting and assigned a new milestone/priority.
Note closely related issue
#147 -
reporter - changed milestone to 2019.03.31 release
Mass roll-over of unresolved issues to the next milestone.
-
reporter - changed milestone to 2019.09.30 release
This issue was triaged at the 2019-07-24 Pagoda issue meeting and assigned a new milestone.
Once issue
#147is resolved with an AMLong-based implementation of rput-then-rpc, there will no longer be initiator-side chaining for contiguous rput-then-rpc under certain size limits, solving the most important case of this issue (and the reproducer). However, the issue will remain forremote_cx::as_rpc()
completions on other operations such as VIS rputs andupcxx::copy()
-
reporter - changed status to resolved
Resolved in pull request #119 merged at 4e577c2
- Log in to comment
Test program demonstrating the problem.
Deadlocks on network conduits with 2+ ranks unless compiled with
-DFORCE_PROGRESS