Performance bug in behavior of rput(..., remote_cx::as_rpc(...)) with "bare" remote_cx

Issue #455 resolved
Dan Bonachea created an issue

I just happened to be reading the code in rput.hpp looking for something else and came across this unfortunate decision in the implementation of rput(..., remote_cx::as_rpc(...)):

308       // We handle absence of source_cx as assuming synchronous completion as
309       // opposed to asynchronous (with an ignored notification) since this (naked
310       // remote_cx) is a bizarre thing to ask for. So bizarre that I feel its more
311       // likely a user mistake rather than they actually have a source buffer
312       // that they feel safe giving over to us forever.
313       static constexpr bool src_now = !want_src || Traits::src_is_sync;

IIUC the effect of this is that if a user issues rput with only remote_cx (no source_cx and no operation_cx), the code currently silently promotes this rput into an eventual call to gex_AM_RequestLong(GEX_EVENT_NOW), which (on several native networks) means the RMA component blocks the injecting thread for a synchronous wire-level round-trip of the RDMA on the network to establish source completion, rather that returning immediately after injection as encouraged by specification.

I think this was a naive/poor decision that was never carefully analyzed, and would come as a surprise to users who are trying to implement an efficient fire-and-forget boundary exchange, which is exactly what this rput use case is intended to enable. Contrary to the text of the comment, there are absolutely algorithms like halo exchange that want to use rput(..., remote_cx::as_rpc(...)) as a signalling put, but might use some enclosing quiescence synchronization (eg at the end of a bulk synchronous time step) to establish arrival of all the rput remote_cx's and infer the source buffers are safe to overwrite. This is a slight "bending" of our spec (remote_cx doesn't technically guarantee source_cx) but should not be problematic in practice.

We should discuss how this could/should be improved.

Temporarily assigning to Paul for the purpose of discussion.

Recommended workaround for the performance bug is to request source_cx and/or operation_cx in addition to remote_cx for rput, to avoid this case.

Comments (7)

  1. Paul Hargrove

    Hmm. First reaction is "ugh".

    Second is realization that this might already be performing "OK" for small-enough payloads on some conduits. On aries and ibv we use a "packed Long" up to some threshold, which provides synchronous LC "for free". On aries we have true local completion, and have in the past determined it precedes remote by a measurable amount of time. On ibv we use a pre-pinned bounce buffer for Long payloads up to 64KB precisely because AM Long cares only about LC and never about RC.

    Next thought is that use of NPAM Long and GASNet-allocated buffer (with EX develop, but not yet stable or memory_kinds, IIRC) is now an efficient approach up to "some size". Work in progress for our next release will correctly advertise that size, and enforce it as the maximum. This work will additionally remove an extra copy in ibv-conduit relative to the reference implementation.

    However, beyond the limit of NPAM Long (or if use of NPAM is impractical) we lack a "fire and forget" for AM Long payloads. The closest I can think of would be an ugly hack of using GEX_EVENT_GROUP and simply neglecting to call gex_NBI_Wait() until upcxx::finalize() (or maybe at barriers?). Of course if we go that route, then there is probably little or no motivation to use NPAM Long even for sizes where it is valid.

    All of the previous is based on just looking at the GASNet-EX calls one might make. There is also the option to have the UPC++ runtime manage a temporary buffer it owns.

  2. Dan Bonachea reporter

    The source buffer of the rput is user-owned memory containing the contiguous source data. So NPAM-GB/Long isn't really relevant unless we want to additionally force a memcpy in the Prepare/Commit interval, which (assuming the payload fits in LUBMedium) I think makes it functionally equivalent to an FPAM/Long on a conduit with a packed-long optimization. If the payload exceeds the NPAM-GB limit then we are forced (as you say) to use FPAM. NPAM-CB seems no better or worse than FPAM. So in summary I don't see how NPAM really helps (at least for the native conduits, and the reference ones have synchronous LC for AMLong anyhow).

    I agree we don't have a direct way to express "fire-and-forget" AM Long LC, which is why I wanted your thoughts on this. I agree use of GEX_EVENT_GROUP is slightly ugly, but seems a reasonable resolution to get the desired behavior and should perform better on native conduits for sizes beyond the packed-long threshold. Those longer payloads are exactly where we do NOT want to be stalled for wire-level completion of the RDMA, although I guess it's debatable how much this arises in real applications.

  3. Paul Hargrove

    FPAM Long places a limit on the packed-long optimization which defaults to much less than the Max Medium, because it is tuned to optimize a flood b/w test, not to minimize the delay to LC. So, there is a meaningful difference relative to FPAM:

    Ibv: MAX_MEDIUM=64KiB vs MAX_PACKED_LONG=4012
    Aries: MAX_MEDIUM=8128 vs MAX_PACKED_LONG=2KiB(KNL) or 3KB(Haswell)

    In both cases there is roughly a 4:1 ratio between the respective limits.

    Note that is we were to implement the GEX_FLAG_LC_COPY_YES flag, then I'd imagine one place it would have an effect should be forcing the use of packed-long and bounce buffers to an extent they might not otherwise - ignoring the packed-long threshold being one example. So, maybe that is work in GEX we may consider motivated by this issue?

  4. Dan Bonachea reporter

    Assuming we accept spec PR 68 that changes remote_cx to strongly guarantee source_cx, then I think we are well-justified in fixing this performance bug and removing the silent upgrade to fully-blocking synchronous source completion on the data payload, which should eliminate the hidden round-trip synchronous latency currently paid for this case in put payloads over a few KB.

  5. Log in to comment