Fix empty transfer ambiguities (count=0 RMA)

Issue #141 resolved
Dan Bonachea created an issue

Many of our RMA functions take an explicit size_t count argument (or equivalent), and include semantic wording such the following:

Precondition: The addresses in [src,src+count) and [dest,dest+count) must not overlap.

Initiates an operation to transfer and store the count items of type T beginning at src to the memory beginning at dest

Remote (completion): Indicates completion of the transfer of the values, implying readiness of the target buffer [dest,dest+count).

In most(all?) cases, the preconditions do not prohibit the case of count == 0, or equivalent means of specifying an "empty transfer". However it's not always clear what semantics are required for this degenerate case. For example, consider the following call:

promise <> prom;
auto fut = 
upcxx::rput<int>(nullptr, global_ptr<int>(nullptr), // src, dst
                 0, // count
                 operation_cx::as_future() |
                 source_cx::as_promise(prom) |
                 remote_cx::as_rpc([]() { cout << "RC" << endl; })
);

I don't think anyone would mistake this for implying a real data transfer (that part of the wording is clear) but it raises a number of questions regarding other aspects of the operation:

  1. Does this call violate the overlap precondition? What if this is upcxx::copy and both addresses are global_ptr<int>(nullptr)?
  2. Is fut.ready() representing this empty transfer still required to report non-ready until at least the next user-level progress?
  3. Similarly, is the call still guaranteed to increment the promise dependency count to represent this empty transfer, and not decrement it until the next user-level progress?
  4. Is the remote completion still guaranteed to eventually run, and (perhaps more importantly) where does it run?

I think we can devise reasonable answers to all of these questions, this issue pertains to the fact that those need to be explicitly specified to avoid ambiguity.

I believe all the following functions in the current draft suffer from these types of degenerate empty-transfer ambiguities:

  • upcxx::rput(count=0) (bulk overload)
  • upcxx::rget(count=0) (bulk overload)
  • upcxx::{rput,rget}_irregular() (several independent ways to say "empty transfer")
  • upcxx::{rput,rget}_regular() (several independent ways to say "empty transfer")
  • upcxx::reduce_{one,all}(count=0) (bulk overload)
  • upcxx::broadcast(count=0) (bulk overload)
  • upcxx::copy() (pull request #18)

One possible resolution to all these ambiguities is to add preconditions everywhere requiring these transfers to always be non-empty. However this seems overly simplistic and I'm worried it may result in corner-case UB (or extraneous checking) in layered user codes that don't think about the empty-transfer case.

Comments (10)

  1. Amir Kamil

    I’d like to hear @john bachan 's opinion on what would be easiest to implement. But here are my thoughts:

    1. Null pointers should be prohibited as either argument, even if the count is 0. This matches the behavior of std::memcpy (https://en.cppreference.com/w/cpp/string/byte/memcpy).
    2. Yes. In general, source/operation completion events should not occur until the next user-level progress.
    3. Yes.
    4. Yes, at dest.where().
  2. Dan Bonachea reporter

    @Amir Kamil I agree that your proposal makes sense as the guiding principles to fix these ambiguities, provided @john bachan confirms this seems reasonable to implement. Obviously we need to iterate on actual spec language.

    I'd also advocate we use language modeled after C memcpy to make it clearer that 0-byte transfers where non-null src==dst do NOT violate the overlap prohibition. I don't like our current reliance on the reader's interpretation of overlap between empty ranges of memory. C's language:

    If copying takes place between objects that overlap, the behavior is undefined.

  3. Dan Bonachea reporter

    On a closely related topic, Scott noticed the spec doesn't appear to explicitly state that remote completion for normal (non-empty) rputs enlists the RPC on the master persona at dest.where(). This should probably be explicitly stated in each remote completion bullet as part of fixing these ambiguities.

    Regarding @Amir Kamil 's proposal above, I think we still have an ambiguity to solve for rput_(ir)regular with empty sequences and remote_cx::as_rpc. Unless we entirely prohibit empty vector/index put sequences (which do not supply any destination pointers), this still leaves us without a well-defined place to run the remote completion RPC.

  4. Amir Kamil

    I suggest we either prohibit remote_cx::as_rpc in the case of an empty VIS transfer, or we say that the location at which the completion runs is unspecified in that case. (The latter would require us to run the completion somewhere, unlike the former.)

  5. Amir Kamil

    My interpretation of the C++ memory ordering for reduce and broadcast is that the call cannot be a no-op in the case of count=0 due to the implied synchronization. Is that what we want?

  6. Dan Bonachea reporter

    My interpretation of the C++ memory ordering for reduce and broadcast is that the call cannot be a no-op in the case of count=0 due to the implied synchronization.

    I agree that's what we want - a zero-byte broadcast and zero-byte reduce_one are both valid ways to synchronize a team (one-to-many and many-to-one) that are not conveniently provided elsewhere.

    However since you mention it, I have a nitpick with the reduce para:

    C++ memory ordering: With respect to all threads participating in this collective, if a thread receives the results of the collective (the calling thread on the root process in the first and third variants; all calling threads in the second and fourth variants), all evaluations which are sequenced-before that thread’s invocation of this call will have a happens-before relationship with all evaluations sequenced after the operation-completion notification actions

    I think "that thread's" should be "each participating thread's" - ie we are stating a synchronization property that crosses threads, not just pointing out ordering within a single thread. Even then the para is a bit awkward because only the portion about operation-completion notification refers to evaluations on the "receiving" threads - the set of evaluations sequenced before the reduce call includes all threads on the team, even if they don't get a result.

  7. Amir Kamil

    I propose the following:

    C++ memory ordering: With respect to all threads participating in this collective, all evaluations which are sequenced-before any thread’s invocation of this call will have a happens-before relationship with all evaluations sequenced after the operation-completion notification actions (future readying, promise fulfillment, or persona LPC enlistment) on the threads that receive the results of the collective (the calling thread on the root process in the first and third variants; all calling threads in the second and fourth variants).

    @john bachan @Dan Bonachea Thoughts?

  8. Dan Bonachea reporter

    @Amir Kamil your proposed ordering wording sounds like a huge improvement and addresses my concern.

  9. Log in to comment