Specify serializable_view (aka “Accumulate APIs") to support efficiently passing dynamically-sized arguments to RPC

Issue #108 resolved
Dan Bonachea created an issue

Our RPC is currently designed to minimize overheads for passing arguments or captures of small pass-by-value POD types. Currently we lack a good story for how a UPC++ application can efficiently pass dynamically-sized arguments (think a few KB) to an RPC. Specifically, if the user has a payload of data stored in a reference type that needs to be accessible to the RPC in an anonymous location at the target (akin to GASNet AMMedium), we don't provide a good way to express that. This use case is important because it is the one-sided analogue to a message send.

IF the initiator has previously marshalled a landing zone at the target it could use the rput-then-rpc feature we'll soon be exposing, but this issue deals with the case where the initiator has NOT reserved a landing zone at the target suitable for an rput (eg due to an irregular comm pattern or scalability constraints), and the overhead of a rendezvous round-trip to establish that would exceed the benefit from an RDMA transfer.

The only technique that comes close in the current implementation is to rely on the undocumented Serializability of std::vector<T> to pass the data argument, however that has several serious performance drawbacks: (1) if the input data is not already in a vector (or is a subset of a vector), the application needs to create a vector and copy the data once before initiating RPC, and (2) the implementation at the target will perform dynamic allocation and copy the data upon arrival while deserializing into a std::vector, even if the RPC callback does not care about the std::vector container and just needs access to the data elements. These problems arise because std::vector is an "owning" container.

I think what we eventually want here is a Serializable, non-owning container that can be used to describe input data in-place on the initiator, and provide the arrived data in-place in the network buffer at the target. The goal would be the data elements are copied exactly once during Serialization at the initiator (from their location in app data structures to the outgoing network buffer), and made available to the RPC at the target directly from the incoming network buffer (no target-side copies).

It's possible that once we fully specify and implement user-provided Serialization, the user would be able to express the needed Serializable container as a type they construct. Until then, it may be worth exposing a type to fill this specific usage case, which seems likely to be important/useful for app-driven communication pipelining. This feature might even be used to satisfy the upcoming March 18 milestone for "Accumulate APIs".

Here's a sketch of what I'm thinking:

Usage cases

std::list<T> applist = ...  
  // app has T data elements in a (possibly non-contiguous) container
  // that it wants to send via RPC with minimal payload copies

upcxx::rpc([](serialized_container<T> packedlist) { 
  // target side gets object containing iterators
  for (T &elem : packedlist) {  // traverse elems stored in incoming network buffer
     process(elem);
  }
}, serialized_container<T>(applist.begin(), applist.end())); 
   // RPC initiator "describes" the input data using iterators

Ideally the code above copies the T elements exactly once on the initiating rank (during serialization into the outgoing message buffer), and one or zero copies at the target rank (iterator provided to the RPC runs over the T elements packed into message buffer that was enlisted at the target). The same interface could also be used to send any (possibly irregular) selected subset of elements from an existing container with minimal payload copies, eg:

std::vector<double> vec = ...

upcxx::rpc(func, 
   serialized_container<T>(vec.begin()+10, vec.begin()+1000); 
   // send RPC with vector elements [10]..[999]
upcxx::rpc(func, 
    serialized_container<T>(std::find_if(vec.begin(), vec.end(), ElementFilter), vec.end())); 
   // send RPC with vector elements satisfying ElementFilter predicate

Initial strawman spec:

template<typename T>
struct upcxx::serialized_container<T> {
  typedef T value_type;
}

template<typename T, typename SrcIter> // constructor
serialized_container<T>::serialized_container(SrcIter &begin, SrcIter &end);  

Preconditions:

  • begin and end must satisfy the InputIterator C++ concept.
  • *std::declval<SrcIter>() has a return type convertible to T const, for some Serializable type T.

Semantics:

  • Construct a serialized_container<T> to describe the data in [begin,end).
  • The resulting object contains a reference to the begin/end iterators, which must remain live/usable until this object is destroyed (ie after it is used for serialization).
template<typename T, typename RPCArgIter> // accessors
RPCArgIter serialized_container<T>::begin();
RPCArgIter serialized_container<T>::end();

Preconditions:

  • *this was passed as an argument to an callback invoked by the UPC++ runtime to a persona on the stack of the calling thread.

Semantics:

  • Return iterators point to the first and past-the-end elements (respectively) of *this.
  • RPCArgIter satisfies C++ concept ContiguousIterator.
  • The elements have type T&, and are copies of the elements provided by the iterators at construction time, in the same order.
  • The RPCArgIter iterators returned by these accessors only remain valid for the lifetime of *this (which ends when exiting the dynamic scope of the RPC call).

Possible extensions:

  • Single-argument constructor that takes any CT with CT::begin() and CT::end() members that satisfy the 2-arg constructor
  • [cr]begin()/[cr]end() member functions at the RPC side
  • T *data() member function at the RPC side to access the packed elements directly

Official response

  • Dan Bonachea reporter

    At the 12-13 Pagoda meeting we resolved to adopt a modified version of @akamil 's proposal.

    Amir will be writing up the details, here is the sketch copied from the notes:

    template<typename T, SrcIter>  // free func, creates a serializable view (usually on sender)
    serializable_view<T, SrcIter> serialized(SrcIter begin, SrcIter end);
      // SrcIter satisfies ForwardIterator
      // For trivially copyable T, user "encouraged" to ensure std::distance(begin,end) 
      //  has constant complexity (possibly by adding a specialization)
    
    template<typename T, typename IterType = RPCArgIter<T>>
    class serializable_view {
    public:
      // accessors for the view (usually invoked at receiver)
      IterType begin();
      IterType end();
    };
    // RPCArgIter<T> provided to RPC callback satisfies RandomAccessIterator
    
    std::list<double> applist = ...
    
    upcxx::rpc([](serializable_view<double> packedlist) { 
      // target side gets object containing iterators
      for (double &elem : packedlist) {  // traverse elems stored in incoming network buffer
         process(elem);
      }
    }, serialized(applist.begin(), applist.end()));  
       // user need not specify <T> here, by relying on {Input,Forward}Iterator::value_type
    

Comments (13)

  1. Dan Bonachea reporter

    We discussed this proposal at the end of the 12-6-17 meeting, and everyone still present (Amir, Dan, Paul, Brian, Mathias) indicated they liked the idea - essentially giving us a hopefully-easy piece of useful functionality we can use to minimally satisfy the "Accumulate API" milestone (although nothing prevents later addition of other complementary features).

    @jdbachan - what are your thoughts on delivering some version of this proposal, especially regarding expected implementation effort?

  2. john bachan

    This seems doable, but there are some thorns I would like to address.

    The first is that the constructor is templated on SrcIter, but serialzed_container<T> is not. When the upcxx runtime decides it wants to serialize the data, since the type of the container has been erased, serialized_container will need to contain a function pointer (or virtual, same thing) AND stash the SrcIter values into an object on the heap. This is so upcxx has the state necessary to perform the actual serialization without knowing the underlying iterator type. I find that virtual call and extra heap allocation undesirable. They could be done away with by making the serialized_container also be templated on SrcIter. This has problems I will address later. Alternatively, serialized_container could do the serialization directly in the constructor while SrcIter's type is available, but that would require serialized_container to hold a temporary buffer (since there is no AM buffer allocated yet) which incurs an extra data copy (the thing we're most afraid of).

    So it looks like ideally serialized_container should have the signature serialized_container<T,SrcIter>. The problem with that, is the type on the receiver's side will be different since the items are no longer members of SrcIter-like container, but instead RPCArgIter. Usually the args the user passes to rpc are passed to the given function as-is with tedious details of serialization conveniently hidden. But now serialization is in the user's face, since the type they put it (serialized_container<T,SrcIter>) isn't the type they get out (serialized_container<T,RPCArgIter>). That is probably acceptable, but it doesn't play nice with the serialization framework, which only supports a pipeline which maps values of T to bytes and back. Mapping T to bytes but getting some other type U really complicates our code since now generic types like vector<T> and pair<A,B> also have to worry about the deserialized type being vector<U> etc. Not impossible, just a lot more implementation template code goop.

    If we can agree we want the type asymmetry of serialized_container's type changing on the receiver, then I think a much better place to put that goop would be into upcxx::bind template code. This is a layer that gets called by rpc on the func and args before they are passed to serialization (it's job is primarily to handle the dist_object& to dist_id on-the-wire magic). It doesn't currently support asymmetry, but I could make it, and it would be easier than doing serialization (packing) outright.

    Notice this: all of these problems just go away if we restrict serialized_container<T> such that T must be trivially serializable and SrcIter must be T* into contiguously packed array. Now we can use the same type on the receiver side to point into the AM buffer, and no virtual function on the sender side since the data is already in a valid-to-transmit format. This also covers most of our users where T = double.

  3. Dan Bonachea reporter

    @jdbachan - thanks for you analysis, you're correct I had not noticed those thorns.

    Notice this: all of these problems just go away if we restrict serialized_container<T> such that T must be trivially serializable and SrcIter must be T* into contiguously packed array.

    I strongly disagree with that suggestion, especially the second part - in particular I really do want to support efficient serialization of non-contiguous containers (ie my first usage case). Otherwise we could trivially introduce a much simpler (non-iterator) interface that just took a T*. We might also want that, but I'd really like to also accommodate the additional generality.

    I think we can skirt the problems you mention by adjusting the user-facing syntax to call a factory function instead of a constructor, ie replace the public constructor with:

    template<typename T, typename SrcIter>  // object factory
    serialized_container<T> serialize_container(SrcIter begin, SrcIter end);
    

    and tweak the rpc input argument to call the factory instead of the constructor:

    upcxx::rpc([](serialized_container<T> packedlist) { 
      // target side gets object containing iterators
      for (T &elem : packedlist) {  // traverse elems stored in incoming network buffer
         process(elem);
      }
    }, serialize_container<T>(applist.begin(), applist.end()));  // acts like a constructor but actually a factory
    

    I believe this can handle the type-erasure problem you outlined in a way that is minimally invasive for the user and only adds the overhead of a single virtual function call (in particular, the overhead remains constant with varying payload size). It also preserves the normal RPC argument type-matching property. I think everything else in the user-facing API just works in the natural way using this approach.

    I'm attaching a prototype I tossed together that demonstrates these ideas. There are probably minor optimizations and stylistic changes that could/should be made inside the implementation, but hopefully it's enough to convey the idea. I'm also not particularly attached to the serialize(d)_container token naming, if someone has a better suggestion.

  4. john bachan

    Actually @bonachea you misunderstood me. I am not in favor of erasure because of the added virtual function call (actually two, there needs to be a virtual destructor for the heaped object too, your proof-of-concept code misses this because it leaks objects).

    Providing the erasure is easy, as your example proves. But, it actually improves nothing compared to your original strawman. Constructors are allowed to take type parameters beyond those of their owning class:

    template<typename T>
    struct serialized_container {
      struct thunk {
        virtual ~thunk() = default;
        virtual void serialize_me() = 0;
      };
      template<typename SrcIter>
      struct thunk_impl: thunk {
        SrcIter a, b;
        void serialize_me() { ... }
      };
    
      std::unique_ptr<thunk> thk_; // pointer that deletes on destruction
    
      template<typename SrcIter>
      serialized_container(SrcIter a, SrcIter b):
        thk_{new thunk_impl<SrcIter>{a,b}} {
      }
    };
    
    // both work
    serialized_container<T> foo1 = serialized_container<T>{vec.begin(), vec.end()};
    serialized_container<T> foo2{vec.begin(), vec.end()};
    

    Another reason I don't like erasure is because, as your implmentation demonstrates, serialized_container needs to be two-moded: API's for sender side and receiver side. Internally we'll need a flag indicating which side we're on, because that encodes which fields are valid (which will probably be in a union of two structs). The destructor, for instance, will have to sniff the flag to determine if its on the sender side, because in that case the heaped thunk will need to be deleted. I would rather this be handled statically with distinct sender and receiver types.

    I am advocating that the upcxx::bind handle the type asymmetry.

  5. Amir Kamil

    I would argue that the type asymmetry is a feature rather than a bug, since the sender and receiver do different operations. I would advocate for a further separation of the two types. Something like the following on the sender:

    template<typename T, SrcIter>
    struct serializable_view {
      // no public interface
    };
    
    template<typename T, SrcIter>
    serializable_view<T, SrcIter> serialized(SrcIter begin, SrcIter end); // requires *begin to be convertible to T
    template<typename Container>
    serializable_view<Container::value_type, Container::const_iterator> serialized(const Container &container);
    

    Then the only operation supported by a serializable_view<T, SrcIter> would be to pass it to a UPC++ communication op.

    On the receiving end, we would have a type such as serialized_container<T> that implements the Container concept.

  6. Dan Bonachea reporter

    @jdbachan : FWIW I think you are vastly overestimating the cost of virtual function dispatch in modern compilers and architectures. In most cases a virtual dispatch is a difference of two memory loads (likely hot in cache) and an indirect jump instruction (which should be well-predicted if it matters). I don't think we should shy away from a design containing a small constant number of virtual function dispatches if it provides a significant improvement in the user experience and the primary use is part of an operation that is already orders of magnitude more expensive (sending an RPC across the network).

    That being said, I do like @akamil 's proposal -- the main downside there seems to be the end user has to be educated about this new/different special case RPC argument type (in addition to the other special cases for dist_object and team). With all current RPC arg types (including dist_object, team and my originally proposed serialized_container), we at least preserved the property the argument type was the same at the RPC injection and RPC callback (even if the implementation was doing specialized "magic" underneath). However since in Amir's suggested usage the casual user probably never explicitly names the serializable_view<T, SrcIter> type, they might not notice the discrepancy.

  7. Dan Bonachea reporter

    I wrote a simple microbenchmark that measures a number of interesting common operations in our current (develop head d29da42) UPC++ implementation, and added John's prototype serialized_container implementation from above. The output below shows representative results compiled with g++ 7.2.0 in -O3 mode running over ibv-conduit on two nodes of Dirac (Linux/Xeon 2.4GHz).

    The virtual function call (a call to sc_obj.thk_->serialize_me(), whose body I filled in with a simple counter increment), costs under 3 ns on this platform (total cost from call to return including measurement overhead -- the incremental cost attributable to virtual dispatch is significantly less). Meanwhile a 4KB memcpy (a representative payload for this interface) costs around 121ns (40x longer), and the cost of a synchronous "loopback" 0-argument RPC that runs empty user code is about 1.12us (373x longer). A similar RPC to a peer across the IBV fabric is 4.72us (1573x longer).

    I acknowledge the costs may differ somewhat in a real application compiled against a real implementation (where the static optimization quality and instruction pipeline prediction may degrade), but I don't think we're talking about a heavyweight cost here.

    Running misc performance test with 10000000 iterations...
                                        Total time    Avg. time
                                        ----------    ---------
    
     Serial overhead tests:
     measurement overhead              0.024039 s    0.002404 us
     serialized_container constructor  0.563990 s    0.056399 us
     virtual call direct               0.024090 s    0.002409 us
     virtual call indirect             0.032426 s    0.003243 us
     direct function call              0.028154 s    0.002815 us
     memcpy(4KB)                       1.216699 s    0.121670 us
    
     Local UPC++ tests:
     gasnet_AMPoll                     0.615940 s    0.061594 us
     upcxx::progress                   1.615507 s    0.161551 us
     self.lpc(noop0)                   5.272861 s    0.527286 us
     self.lpc(lamb0)                   5.358035 s    0.535804 us
     upcxx::rpc(self,noop0)           11.223028 s    1.122303 us
     upcxx::rpc(self,noop8)           12.016220 s    1.201622 us
     upcxx::rpc(self,lamb0)           11.070189 s    1.107019 us
     upcxx::rpc(self,lamb8)           11.733648 s    1.173365 us
     upcxx::rput<double>(self)         5.579026 s    0.557903 us
    
     Remote UPC++ tests:
     upcxx::rpc(peer,noop0)           47.285721 s    4.728572 us
     upcxx::rpc(peer,noop8)           59.978561 s    5.997856 us
     upcxx::rpc(peer,lamb0)           47.713990 s    4.771399 us
     upcxx::rpc(peer,lamb8)           58.883915 s    5.888391 us
     upcxx::rput<double>(peer)        24.266550 s    2.426655 us
    
  8. Dan Bonachea reporter

    At the 12-13 Pagoda meeting we resolved to adopt a modified version of @akamil 's proposal.

    Amir will be writing up the details, here is the sketch copied from the notes:

    template<typename T, SrcIter>  // free func, creates a serializable view (usually on sender)
    serializable_view<T, SrcIter> serialized(SrcIter begin, SrcIter end);
      // SrcIter satisfies ForwardIterator
      // For trivially copyable T, user "encouraged" to ensure std::distance(begin,end) 
      //  has constant complexity (possibly by adding a specialization)
    
    template<typename T, typename IterType = RPCArgIter<T>>
    class serializable_view {
    public:
      // accessors for the view (usually invoked at receiver)
      IterType begin();
      IterType end();
    };
    // RPCArgIter<T> provided to RPC callback satisfies RandomAccessIterator
    
    std::list<double> applist = ...
    
    upcxx::rpc([](serializable_view<double> packedlist) { 
      // target side gets object containing iterators
      for (double &elem : packedlist) {  // traverse elems stored in incoming network buffer
         process(elem);
      }
    }, serialized(applist.begin(), applist.end()));  
       // user need not specify <T> here, by relying on {Input,Forward}Iterator::value_type
    
  9. Amir Kamil

    How will we be handling memory management in serializable_view? When the user provides the iterator, it is a non-owning container, but for the upcxx-generated one on the remote, it needs to have some form of ownership over the memory so that the memory is eventually freed.

  10. Dan Bonachea reporter

    but for the upcxx-generated one on the remote, it needs to have some form of ownership over the memory so that the memory is eventually freed.

    My proposal was that the data type on the RPC receiver is non-owning, and we specify the lifetime of the storage is delimited by the dynamic lifetime of the RPC invocation that receives it (ie similar to GASNet AMMedium semantics). If the app needs to preserve the data beyond the return of the RPC callback, it should copy the elements to application-owned storage.

  11. Dan Bonachea reporter

    It's been decided this API will satisfy the March 18 STPM17-7 milestone deliverable:

    Implement new “Accumulate” APIs; write or update the corresponding specification; and provide corresponding tests and/or examples.

    The proposed specification appears in Spec pull request #2

  12. Log in to comment