upcxx::discharge() does not discharge remote_cx::as_rpc()

Issue #140 resolved
Dan Bonachea created an issue

Currently rput(remote_cx::as_rpc()) is implemented using initiator-side chaining in internal progress.

However, upcxx::discharge() is not strong enough to ensure that the initiator-side chaining has completed the RPC injection step, which means if the application discharges and then lapses into inattentiveness, the operation will stall at the initiator.

This is a significant impediment to practical use of rput(remote_cx::as_rpc()) for bulk synchronous applications, as it means a "real" application may need to insert periodic internal progress calls in computation loops to ensure a prior outgoing injection makes progress.

Ideally we eventually eliminate initiator-side chaining from the common case paths of rput(remote_cx::as_rpc()), but until/unless that covers all cases I think we need to strengthen upcxx::discharge to "do what the user means" for this situation.

Comments (6)

  1. Dan Bonachea reporter

    This was discussed at the 2018-05-02 developer meeting.

    We resolved that discharge() should be guaranteed to flush the outgoing remote_cx::as_rpc() messages for any outgoing communication initiated by the current persona, even for an implementation relying upon initiator-side chaining or persona handoff (ie no further initiator-side internal progress from any local persona is required). Once fixed, the test program should be guaranteed to succeed.

    There are several work items here:

    1. Strengthen the spec with appropriate wording to require this behavior (closing the current loophole via handoff to a master persona which can return false negatives)
    2. Strengthen the implementation of initiator-side chaining to ensure that discharge/progress_required enforces this property.
    3. Convert the implementation of rput*(remote_cx::as_rpc() to stop using initiator-side chaining whenever possible (using AMLong and VIS target notification instead). This change is expected to both improve latency and resolve this initiator attentiveness problem, but will not be applicable to corner cases (eg with a very large RPC closure or serialized argument set).

    @jdbachan is currently assigned to investigate each of these.

  2. Dan Bonachea reporter

    This issue was triaged at the 2019-07-24 Pagoda issue meeting and assigned a new milestone.

    Once issue #147 is resolved with an AMLong-based implementation of rput-then-rpc, there will no longer be initiator-side chaining for contiguous rput-then-rpc under certain size limits, solving the most important case of this issue (and the reproducer). However, the issue will remain for remote_cx::as_rpc() completions on other operations such as VIS rputs and upcxx::copy()

  3. Log in to comment