Examine/Optimize handling of RPC argument serialization for l-value refs

Optimize handling of RPC argument serialization for l-value refs

In the 2020-07-29 meeting we had some discussion over RPC serialization behavior with various different input reference types.

John explained the runtime constructs a bind object that wraps up everything going over the wire for the RPC, which is then serialized. RPC takes its arguments by universal reference. We believe for input arguments passed by r-value reference (T&&), the user argument is moved into the bind object and later serialized from there. However for all other cases (including the common cases of T& and T const &) we believe the bind object is actually copying the user argument data, and then serializing from the copy, regardless of source_cx specification. This is problematic for several reasons:

Correctness: it means the input argument must be CopyConstructible. We currently specify this requirement for as_rpc but NOT for rpc/rpc_ff. As a result, the program below fails to typecheck the first call to rpc when T is not CopyConstructible
Performance. We'd like to avoid copying the user data into the bind object, which should always be unnecessary (and might in fact be very costly). Instead, when given an l-value reference the bind object should save a reference to the input argument, and serialize though that reference, removing one data copy.

Now that we have real, generalized Serialization, I think it's inappropriate for RPC argument serialization to continue requiring CopyConstruction. We've provided an entire concept and general infrastructure dedicated specifically to handling the semantic copy between an object and its serialized representation, so RPC should be using only that and not introducing a side-channel reliance on another form of data copying that the client might not consider appropriate for his types.

When implementing this optimization, we need to carefully still ensure that serialization is finished consuming the input argument data before signalling source_cx completion.

@Amir Kamil agreed to look into this, as time allows.

#include <upcxx/upcxx.hpp>
#include <iostream>
#include <assert.h>

struct T {
  int x;
  T(int v) : x(v) {}
  T(T && other) : x(other.x) {} // MoveCons
#if COPYABLE
  T(T const & other) : x(other.x) {} // CopyCons
#endif
  UPCXX_SERIALIZED_VALUES(x);
};

int main() {
  upcxx::init();

  T t(42);

  upcxx::rpc(0, [](T const &t) { assert(t.x == 42); }, t).wait(); // confusing type error
  upcxx::rpc(0, [](T const &t) { assert(t.x == 42); }, std::move(t)).wait(); // works
  upcxx::rpc(0, [](T const &t) { assert(t.x == 42); }, T(42)).wait(); // works

  upcxx::barrier();
  if (!upcxx::rank_me()) {
    std::cout << "SUCCESS" << std::endl;
  }

  upcxx::finalize();
  return 0;
}

Clarifying timing of move/serialization for r-value ref inputs

Tangentially related - In the 2020-07-29 meeting we identified an important case that we think may need to be revisited:

sf,of = rpc(0, source_cx::as_future() | operation_cx::as_future(), 
                     T() ); // arg as r-val reference
// arg is now "dead" as far as the compiler is concerned
sf.wait();

In this case the runtime would need to immediately move or serialize the rvalue reference input (whose lifetime might be ending), because deferring move or serialization past return would be unsafe. Thinking more about this case, I think the current spec rules this to be a user error, because the caller did not ensure the lifetime of the input remained valid until the source completion event.

I see several potential resolutions:

Leave it as-is (i.e., consider this an invalid program) to allow for the move and serialization to happen later. Should probably still clarify the spec to make this explicit.
Add a special case that effectively says that r-value input arguments are conservatively always at least moved before return, allowing them to die in the caller.
Add a special case that effectively says that r-value input arguments are conservatively always handled as if source_cx::as_buffered() was requested (meaning both move AND serialization happen before return), also allowing them to die in the caller.

IIUC @john bachan's explanation of the current implementation, the move is happening before injection return anyhow (although serialization might be deferred?) - ie option 2, regardless of source_cx specification. So specifying the freedom to delay the move (option 1) might not actually buy us anything useful in practice, and creates a landmine for the user if we ever change the implementation to exploit it. The stronger option 3 might require implementation changes, and restricts our freedom in ways we might want to avoid (delaying serialization based on buffer availability is an important future direction).

Optimize handling of RPC argument serialization for l-value refs

Clarifying timing of move/serialization for r-value ref inputs

Comments (2)