Our RPC is currently designed to minimize overheads for passing arguments or captures of small pass-by-value POD types.
Currently we lack a good story for how a UPC++ application can efficiently pass dynamically-sized arguments (think a few KB) to an RPC. Specifically, if the user has a payload of data stored in a reference type that needs to be accessible to the RPC in an anonymous location at the target (akin to GASNet AMMedium), we don't provide a good way to express that. This use case is important because it is the one-sided analogue to a message send.
IF the initiator has previously marshalled a landing zone at the target it could use the rput-then-rpc feature we'll soon be exposing, but this issue deals with the case where the initiator has NOT reserved a landing zone at the target suitable for an rput (eg due to an irregular comm pattern or scalability constraints), and the overhead of a rendezvous round-trip to establish that would exceed the benefit from an RDMA transfer.
The only technique that comes close in the current implementation is to rely on the undocumented Serializability of std::vector<T>
to pass the data argument, however that has several serious performance drawbacks: (1) if the input data is not already in a vector (or is a subset of a vector), the application needs to create a vector and copy the data once before initiating RPC, and (2) the implementation at the target will perform dynamic allocation and copy the data upon arrival while deserializing into a std::vector
, even if the RPC callback does not care about the std::vector
container and just needs access to the data elements. These problems arise because std::vector
is an "owning" container.
I think what we eventually want here is a Serializable, non-owning container that can be used to describe input data in-place on the initiator, and provide the arrived data in-place in the network buffer at the target. The goal would be the data elements are copied exactly once during Serialization at the initiator (from their location in app data structures to the outgoing network buffer), and made available to the RPC at the target directly from the incoming network buffer (no target-side copies).
It's possible that once we fully specify and implement user-provided Serialization, the user would be able to express the needed Serializable container as a type they construct. Until then, it may be worth exposing a type to fill this specific usage case, which seems likely to be important/useful for app-driven communication pipelining. This feature might even be used to satisfy the upcoming March 18 milestone for "Accumulate APIs".
Here's a sketch of what I'm thinking:
std::list<T> applist = ...
// app has T data elements in a (possibly non-contiguous) container
// that it wants to send via RPC with minimal payload copies
upcxx::rpc([](serialized_container<T> packedlist) {
// target side gets object containing iterators
for (T &elem : packedlist) { // traverse elems stored in incoming network buffer
process(elem);
}
}, serialized_container<T>(applist.begin(), applist.end()));
// RPC initiator "describes" the input data using iterators
Ideally the code above copies the T elements exactly once on the initiating rank (during serialization into the outgoing message buffer), and one or zero copies at the target rank (iterator provided to the RPC runs over the T elements packed into message buffer that was enlisted at the target).
The same interface could also be used to send any (possibly irregular) selected subset of elements from an existing container with minimal payload copies, eg:
std::vector<double> vec = ...
upcxx::rpc(func,
serialized_container<T>(vec.begin()+10, vec.begin()+1000);
// send RPC with vector elements [10]..[999]
upcxx::rpc(func,
serialized_container<T>(std::find_if(vec.begin(), vec.end(), ElementFilter), vec.end()));
// send RPC with vector elements satisfying ElementFilter predicate
template<typename T>
struct upcxx::serialized_container<T> {
typedef T value_type;
}
template<typename T, typename SrcIter> // constructor
serialized_container<T>::serialized_container(SrcIter &begin, SrcIter &end);
Preconditions:
- begin and end must satisfy the
InputIterator
C++ concept.
*std::declval<SrcIter>()
has a return type convertible to T const
, for some Serializable type T
.
Semantics:
- Construct a
serialized_container<T>
to describe the data in [begin,end)
.
- The resulting object contains a reference to the begin/end iterators, which must remain live/usable until this object is destroyed (ie after it is used for serialization).
template<typename T, typename RPCArgIter> // accessors
RPCArgIter serialized_container<T>::begin();
RPCArgIter serialized_container<T>::end();
Preconditions:
*this
was passed as an argument to an callback invoked by the UPC++ runtime to a persona on the stack of the calling thread.
Semantics:
- Return iterators point to the first and past-the-end elements (respectively) of
*this
.
- RPCArgIter satisfies C++ concept ContiguousIterator.
- The elements have type
T&
, and are copies of the elements provided by the iterators at construction time, in the same order.
- The RPCArgIter iterators returned by these accessors only remain valid for the lifetime of
*this
(which ends when exiting the dynamic scope of the RPC call).
Possible extensions:
- Single-argument constructor that takes any
CT
with CT::begin()
and CT::end()
members that satisfy the 2-arg constructor
[cr]begin()
/[cr]end()
member functions at the RPC side
T *data()
member function at the RPC side to access the packed elements directly
At the 12-13 Pagoda meeting we resolved to adopt a modified version of @akamil 's proposal.
Amir will be writing up the details, here is the sketch copied from the notes: