Serialization: Full Featured

Issue #136 resolved
john bachan created an issue

Goals of HPC caliber serialization:

  • Fast memory movement of trivial types. This means respecting alignment of types so that word granular movement in and out of the byte stream is possible.

  • No unnecessary indirection, esp function dispatch (i.e. no virtual serialize()/deserialize()). Template facilities should be used so that as much dispatch as possible is done at compile time.

  • Nearly all cases I have encountered thus far in HPC have a cheap way of computing a tight upper bound on the amount of bytes they need in the stream. When this is available, serialization can have its buffer allocated once up front, and then none of the writes into the buffer during object traversal need to worry about possibly growing the buffer (which would incur a conditional branch at every byte/word appended). We do not want to make this optimization impossible.

  • But still support types that do not have tight and cheap upper bounds available. This requires dynamically growing serialization buffers.

  • Allow the cheap upper bound to be specified at compile time if possible, and runtime otherwise. Knowing the bound at compile times means I don't have to allocate from the heap and instead it can go on the stack or inline in some other heap object.

  • Support std types out of the box.

  • Provide syntactic sugar for structs of serializable types (like UPCXX_REFLECTED is currently doing for Steve).

  • Support types which can't be default constructed. Boost fails here.

  • Support types which can't be modified after construction (like having const member fields set by the constructor). Boost fails here too.

  • Retain boost's pattern of serialization hooks accepting the byte stream argument as an unconstrained type parameter.

  • For convenience, allow multiple forms of template dispatch. The two that come to mind:

    • Template class specialization of a class in the upcxx namespace. This is the most general as it allows users to supply serialization support for types for which they are not the author.
    • Hooks found via ADL, like friend functions of the user's class. Requires authorship of the type or its enclosing namespace. EDIT: I'm changing my proposal to be a nested class that has the same name and expectations as the above specialization.
  • Asymmetric types. Boost fails. Deserialzing a serialized T may produce a different type U. upcxx::view<T,Iter> works this way: no matter what Iter is going in, the deserialized view always has the special network buffer iterator for Iter.

  • boost serializable types should work out of the box through some adaptor layer.

Here's a crack at some of the above but missing upper bounds and type asymmetry.

struct UserType {
  Foo foo;
  Bar bar;

  struct serialization {
    // put "me" into "writer"
    template<typename Writer>
    static void serialize(Writer &w, UserType const &me) {
      w.push(me.foo);
      w.push(me.bar);
    }

    // pull "me" out of "reader" by constructing into provided storage which will be
    // guaranteed big enough.
    template<typename Reader>
    static UserType* deserialize(Reader &r, void *spot) {
      // Order of these matters, which is why they can't be done inline in the constructor
      // call below (argument evaluation order is implementation defined in c++).
      Foo foo = r.pop<Foo>();
      Bar bar = r.pop<Bar>();
      return ::new(spot) UserType{std::move(foo), std::move(bar)};
    }
 };
};

// Template specialization based, user opens our namespace
namespace upcxx { 
  template<>
  struct serialization<UserType> {
    // same contents as UserType::serialization example above
  };
}

Let's add upper bound support. The above example didn't specify upper bound so the Writer passed to serialize would be the dynamically resizing buffer. Once an upper bound is guaranteed, upcxx will instantiate with a different Writer that elides the buffer growing logic.

// More helper types for upper-bound support
namespace upcxx {
  template<typename T>
  struct serialization_complete {
    // This class is how someone "calls" serialization. It is NOT specialized by users. For
    // convenience, the "serialization" class can be missing members (which will obtain
    // default behavior). This class is how a user accesses "serialization" but with all
    // defaults filled in.
  };

  // Note: types like storage_size<auto,auto> aren't legal C++, but you know what I mean.

  // storage_size is a pair of size_t's for the size and alignment of a memory block.
  // Has `cat` operation for computing size of concatenated type (this involves
  // padding arithmetic). The neat thing is that we use templates to track if
  // the size and/or alignment is a compile-time known thing, and the `cat`
  // operation propagates as much compile-time info through the type as it can.
  template<std::size_t static_size = std::size_t(-1), /* default is runtime */,
           std::size_t static_align = std::size_t(-1) /* default is runtime */>
  struct storage_size {
    // Each of these is constexpr static iff their corresponding template parameter is
    // not -1. Runtime fields are only needed if the value isn't statically known.
    /* constexpr static */ std::size_t size, align;

    // Computes storage size of concatenation: this + that. Preserve as much static
    // knowledge as possible.
    storage_size<auto,auto> cat(storage_size<auto,auto> that);

    // sugar
    template<typename T>
    constexpr storage_size<auto,auto> cat_size_of() {
      return this->cat(storage_size_of<T>());
    }

    // sugar
    template<typename T>
    storage_size<auto,auto> cat_ubound_of(T const &x) {
      return serialization<T>().ubound(*this, x);
    }

    // Replicate this space for `n` contiguous elements.
    storage_size<auto,auto> array(std::size_t n);
    template<std::size_t n>
    storage_size<auto,auto> array();
  };

  constexpr storage_size<0,1> empty_storage_size = {};

  // Like quiet-NaN for storage_size. Useful when types which don't have a cheap
  // upper-bound return this.
  constexpr storage_size<std::size_t(-2), std::size_t(-2)> invalid_storage_size = {};

  template<typename T>
  constexpr storage_size<sizeof(T),alignof(T)> storage_size_of() {
    return {};
  }
};

// user provided upper-bound example
namespace upcxx {
  template<>
  struct serialization<UserType> { // open our class like before
    // given upper bound size of existing stream (prefix), return that with my upper bound
    // added to it
    static auto ubound(
        storage_size<auto,auto> prefix,
        UserType const &me) {
      return prefix.cat_ubound_of(me.foo).cat_ubound_of(me.bar);
    }

    // ...`serialize()` and `deserialize()` just like before...
  };
}

So here's what std::vector<T> would like with C++14 for deduced auto return types:

namespace upcxx {
  // Provided helper:
  //  - For T where serialization<T>::ubound(...) returns purely-static storage_size then
  //    we just return that.
  //  - Otherwise: return invalid_storage_size
  template<typename T>
  constexpr storage_size<auto,auto> static_serialization_ubound_of();

  template<typename T>
  struct serialization<std::vector<T>> {
    // This is elegant. If T doesn't have a purely-static ubound, then our return type will be
    // the invalid storage_size (thanks to its quite NaN-like behavior), meaning we can't be
    // cheaply bounded. If T does have purely-static ubound then this will return a non-static
    // storage_size since there is a runtime dependency to get vector size. Ultimately effect:
    // vector<char> is ubound'ed, vector<vector<char>> is not.
    template<typename PrefixSize>
    static auto ubound(PrefixSize prefix, std::vector<T> const &me) {
      return prefix
        /*space for length*/.cat_ubound_of<std::size_t>()
        /*space for elements*/.cat(upcxx::static_serialization_ubound_of<T>().array(me.size()));
    }

    template<typename Writer>
    static void serialize(Writer &w, std::vector<T> const &me) {
      // length
      w.push<std::size_t>(me.size());
      // elements
      for(T const &elt: me)
        w.push<T>(elt);
    }

    template<typename Reader>
    static std::vector<T>* deserialize(Reader &r, void *spot) {
      std::vector<T> *vec = ::new(spot) std::vector<T>;
      // length
      std::size_t n = r.pop<std::size_t>();
      vec->reserve(n);
      // elements
      for(std::size_t i=0; i != n; i++)
        vec->push_back(r.pop<T>());
      return vec;
    }
  };
}

Finally, generalize std::vector<T> for asymmetry support.

namespace upcxx {
  template<typename T>
  struct serialization<std::vector<T>> {
    // `ubound` and `serialize` are same as above

    // internal helper
    using deserialized_T = typename upcxx::deserialized_of<T>::type;

    // Special. Announce that deserialized type differs. If user omits then
    // symmetry assumed.
    using deserialized_type = std::vector<deserialized_T>;

    template<typename Reader>
    static deserialized_type* deserialize(Reader &r, void *spot) {
      std::vector<deserialized_T> *vec = ::new(spot) std::vector<deserialized_T>;
      // length
      std::size_t n = r.pop<std::size_t>();
      vec->reserve(n);
      // elements
      for(std::size_t i=0; i != n; i++)
        vec->push_back(r.pop<T>()); // pop a T, not a deserialized_T, return will still be deserialized_T
      return vec;
    }
}

And we also need the default (non-specialized) implementation of upcxx::serialization_complete<T> to query upcxx::is_definitely_trivially_serializable<T>::value and short-circuit to byte copies if user enables it.

The above is almost general enough for the user to implement something like upcxx::view efficiently. What's lacking is the ability to skip over a serialized T in a reader without doing the work of building it and throwing it away. But this isn't a show stopper for upcxx::view<UserType> since I can bake in "skip" support for known types, and other types (like UserType) the view will know to prefix with an extra byte-count on the wire so it knows how much to jump by for the current element, or if I detect that std::is_trivially_destructible<UserType> is true then I can elide the extra byte count and just deserialize and discard without performance concerns.

And for the super simple struct of serializable fields we want something like UPCXX_REFLECTED which already exists in the implementation, I propose a rename:

struct UserType {
  Foo foo;
  Bar bar;

  // Expands to *something impl-defined* that enables the most general junk:
  //   - Upper bounded if possible as sum of field upper bounds.
  //   - Asymmetry *not possible*. All fields must be symmetric.
  // UB if user uses this and also registers more serialization hooks.
  UPCXX_SERIALIZED_FIELDS(foo, bar)
};

Proposed Reader/Writer concepts. The API marked ADVANCED is optional, just needed to implement upcxx::view.

struct WriterConcept {
  // invoke serialization of T
  template<typename T>
  void push(T const&);

  // Push sequence to buffer. Guaranteed to produce same buffer as
  // doing `push<T>()` in a loop. Possible internal optimizations:
  //  - Uses bulk memcpy when T is trivially serializable and iterator is contiguous.
  //  - When T has ubound(), growable buffer allocates bigger than standard chunk
  //    and can do elements with overflow checks elided.
  template<typename Iter>
  void push_sequence(Iter elts, std::size_t n);

  // Allocate contiguous uninitialized storage in buffer. Growing writers must ensure
  // this pointer remains stable for lifetime of writer. Naive buffer-doubling is therefor
  // not a possible implementation for a growing writer.
  template<std::size_t S, std::size_t A>
  void* place(storage_size<S,A> size);

  // *** ADVANCED ONLY (upcxx::view) ***

  // total bytes written so far. 
  std::size_t size() const;
};

struct ReaderConcept {
  // deserialize T1 (given input serialized was T) from buffer and return it. Requires T1 be moveable.
  template<typename T, typename T1 = typename serialization_complete<T>::deserialized_type>
  T1 pop();

  // deserialize T1 (given T), construct into storage (for all cases including when T1 not moveable) and return constructed pointer.
  template<typename T, typename T1 = typename serialization_complete<T>::deserialized_type>
  T1* pop_into(void *storage);

  // deserialize T1 elements into uninitialized storage array just as a loop of
  // `pop_into<T>()` would. Return pointer to first element.
  template<typename T, typename T1 = ...goop...>
  T1* pop_sequence_into(void *storage, std::size_t n);

  template<std::size_t S, std::size_t A>
  void* unplace(storage_size<S,A> size);

  // *** ADVANCED ONLY (upcxx::view) ***

  // Current byte position of reader
  char* head() const;
  // Byte offset to add to current position
  void jump(std::intptr_t delta);
};

Comments (13)

  1. BrianS

    Some of these things are easier to do if C++ adopts a Reflection API, would tht be correct? We are discussing reflection in the standards committee. I just don't know if the current TS would allow upc++ to do these operations for a user automatically. Reflection is often used for serialization.

  2. john bachan reporter

    Reflection would only make our default strategy for serialization more intelligent, and it might obviate the need for that macro UPCXX_SERIALIZED_FIELDS. The rest of the complexity remains.

  3. Dan Bonachea

    A few "goals" I'd like to see added, as mentioned in our call last week:

    • Serialization routines for type T should not need to make any assumptions about the representation of fields of type S. In particular, catting the ubounds of fields should look the same regardless of the serialization properties (eg TriviallySerializable, or possibly even basic type) of the field type.

    • Support writing an adapter to allow serialization of existing types that support Boost serialization (potentially with some reduction in performance). This is to help ensure interoperability with existing library types, where it may be impractical or impossible for the UPC++ app author to implement our style of serialization for those types (eg because it requires access to private fields of types whose implementation they are not allowed to modify, or knowledge of internal representation the user lacks).

    Re: ubound():

    I think we need to discuss further regarding whether ubound should be taking a prefix argument. The stated motivation for this design was to reduce external padding between types, but for types containing fields with variable alignment requirements I suspect this strategy often just trades external for internal padding, leading to no net savings. Either way we need to advertise a "best practice" for the order that users should append fields to the stream to reduce wasted bytes on the wire, when there are no other ordering constraints. We agree the serialization "order" of constituent fields needs to be insensitive to the prefix, and probably invariant for most types. Conventions like "largest alignment first" minimizes internal padding (and for that reason is common strategy for in-memory layout), but John indicated he was thinking "smallest alignment first".

    Re WriterConcept::place():

    Growing writers must ensure this pointer remains stable for lifetime of writer. Naive buffer-doubling is therefore not a possible implementation for a growing writer.

    Would it be sufficient to say the void * is only guaranteed to remain valid until the next Writer::push operation? Alternatively, perhaps instead of a void *, return a wrapped pointer that is implemented as a reference to the Writer and an offset into the Writer's buffer, so the wrapped pointer remains valid for the lifetime of the Writer, but the wrapped-ptr deference operation returns a void * that is only valid until the next push

    Thinking further (and partially based on our discussion last week) perhaps the Writer::push operations should similarly return a wrapped pointer to the bytes in the storage buffer, allowing modification of serialized bytes later in the serialization routine (eg to record the dynamic properties of some irregular object as discovered during the serialization we just performed, when that information is not normally stored in the main object and hence not known until the end of serialization).

  4. john bachan reporter

    Re re ubound: This API is a little awkward but I still believe it the best. We should first arrive at consensus that we don't want unnecessary fragmentation if we can avoid it. If the API were simplified to storage_size ubound(T const&), then the following case demonstrates unwanted fragmentation this would run afoul of (I'm using struct syntax to convey the abstraction boundaries of memory layout):

    struct bloated {
     char a;
     // [...padding inserted here...]
     struct embedded { // nested struct analogous to nested ubound() in a `cat` call
      char b;
      int c;
     } bc;
    }; // sizeof(bloated) == 3*sizeof(int)
    
    struct lean {
     char a;
     char b;
     int c;
    }; // sizeof(lean) == 2*sizeof(int)
    

    The API storage_size ubound(storage_size prefix, T const&) naturally achieves the lean layout by threading the prefix accumulator through the call stack of ubounds. What makes it awkward is that the type signature is too permissive: The space of all functions meeting that signature is larger than the space we desire, hence our need for an "insensitivity" constraint demanding that the ordered list of types cat'ed inside ubound be invariant wrt prefix. A type signature that is more precise would for ubound to just take a T const& and return an ordered list of storage_size's for the caller to cat in their own context: e.g. std::vector<storage_size> ubound(T const&), no extra constraint needed since the caller's context is unavailable. But now programmers are accumulating lists of storage_size's with list concatenation instead of just a single storage_size and storage_size cat. Passing std::vector as the list type is clearly inefficient and makes the template trick of tracking values at compile time when possible totally infeasible. This can be remedied by using a heterogenous type like std::tuple<storage_size<?,?>...> ubound(T const&) and users using std::tuple_cat to to paste together lists of storage size's:

    auto ubound(bloated const &x) -> decltype(...) {
      return std::tuple_cat(serialization<char>::ubound(a),
             std::make_tuple(serialization<bloated::embedded>::ubound(x.bc)));
    }
    
    // compared to passing prefix explicitly
    auto ubound(storage_size<?,?> pre, T const &x) -> decltype(...) {
      return pre.cat_ubound_of(a).cat_ubound_of(x.bc);
    }
    

    We would want to add some sugar to make the tuple code less noisy. Also we would need to add a "reduction" function that takes in a tuple of storage_size's and cats them all together so the user could, for instance, multiply that by a length to get the size of an array of T. This seems like a lot more spec work hence my preference for explicit prefix passing.

  5. john bachan reporter

    Re re WriterConcept. Generalizing the API to permit buffer doubling is possible as you demonstrate, but it complicates things to support an implementation that seems always inferior to one that threads fixed size chunks together in a linked list and compacts them once at the end. They have the same big-O behavior, but chunks are much friendlier on modern malloc's which tend to use size specific slabs for a exponentially spaced size-classes below ~16K (this is the behavior of recent glibc, jemalloc, and tcalloc). Also, the runtime could elide compacting the chunks before injection if the conduit supported fragmented input buffers (gasnet doesn't but I've heard ibv does?). The only advantage of buffer doubling I can think of is that it maps to realloc which conceivably could do the growing more efficiently by eliding the copy when the buffer adjoins vacant space, but since we're growing exponentially I'de expect that to be a very rare occurrence. Summarily, the generalization of place's return value doesn't seem worth it.

    Regarding push returning a byte pointer, I vote no. For those bytes to be modifiable by the caller, the caller would require some knowledge about how the type was serialized, which in general they don't have. We could make some guarantee like std::is_trivial<T> types always byte-copy themselves onto the wire, but then that would rule out an implementation doing fancy things like compressing integers (which do tend to have lots of zeros in their more significant bits). This is why place exists. If I need to throw down some bytes "now", bot won't know what their contents will be until "later" (after serializing my sub-objects) then I should do this:

    void serialize(Writer &w, T const &x) {
      // UB: size_t *delta = (size_t*)w.place(storage_size_of<size_t>());
      // UB: for explanation see: https://whereswalden.com/tag/memcpy/
      size_t *delta = ::new(w.place(storage_size_of<size_t>())) size_t;
      size_t size_before = w.size(); // total bytes currently in w
      w.push(x.foo);
      w.push(x.bar);
      size_t size_after = w.size();
      *delta = size_after - size_before;
    }
    
    // use the delta to skip over the whole T without deserializing foo and bar
    void skip(Reader &r) {
      size_t delta;
      memcpy(&delta, r.unplace(storage_size_of<size_t>()), sizeof(delta));
      r.jump(delta);
    }
    

    We could consider adding sugar (esp to remove the UB pitfall, but probably only on the "unplace" side), but it's so rarely needed it doesn't seem worth the bother and could always come later. The above code is readable without it.

  6. john bachan reporter
    • edited description

    Added boost compatibility goal. Dan's other goal of types not needing to know serialization characteristics of sub-objects seems too obvious to add.

    Modified API:

    • ADL lookup removed. Replaced with nested serialization class which mimics the upcxx::serialization specialization. User can use one or the other, but not both.

    • Removed upcxx::raw_storage<T> since it's no longer needed to disambiguate ADL.

    • Added upcxx::serialization_complete<T> as uniform way to invoke serialization. upcxx::serialization<T> is for registration only.

    • deserialize and Reader::pop_into now return T* which will match the void* going in just like placement new returns its memory pointer upcasted: T* = ::new(void*) T;. Disregarding the return of placement new and just reinterpret casting the void* is actually UB. C++17 fixes this with T* std::launder(T*).

  7. Dan Bonachea

    Re: Spec development:

    @jdbachan - you are probably unaware that when you edit the initial issue comment, the email notification we get contains a giant and completely unreable diff, and there is no revision history. Since we seem to be narrowing in on a design, I'd like to request that we move that proto-spec to a wiki or repo pull request, where we have proper change control tools. It can remain as a stand-alone file in markdown format for now if that's most convenient, we can perform the conversion to latex as a later step in the pull request.

    Re: goals:

    I'd like to add one more:

    • Debuggability - the ability to have a diagnostic mode that reports an error when serialization methods behave inconsistently

    With three separate methods (serialize/deserialize/ubound) involved in serialization of a type's fields, there are three blocks of user code that all need agree on the serialized representation. For example, it would be very helpful if it was possible for debug mode to diagnose an error when serialization exceeds a provided ubound. Similarly, it would be nice to diagnose when serialize() and deserialize() disagree on the number/order/types of push/pop place/unplace operations (possibly implemented by appending a hash to the debug packet) ; use of Reader::jump would probably need to disable this checking for at least the remainder of the current deserialize() invocation, unless we come up with more clever consistency metadata.

    Re: ubound:

    My main concern is not the prefix argument itself (which is inelegant, but tolerable). I'm more worried about advertising a best-practice of serializing fields in order of increasing alignment, because it can lead to pathological wastes of internal padding in some cases, eg:

    struct mytype {  // assume all fields have alignof(T)==sizeof(T)
      int64_t  i8;
      int32_t  i4;
      int16_t  i2[2];
      int8_t   i1;
    };
    

    This type has 17 bytes of actual data. When laid out in memory as shown above (decreasing alignment, a common programming practice ref1 ref2), there is no padding between fields (although sizeof contains trailing external padding). If serialized into an initially-empty/maximally-aligned serialization buffer in decreasing alignment order, this type should similarly incur no internal padding and comprise 17 bytes of data on the wire (for simplicity assume this is the entire payload, meaning the trailing external padding can be dropped). However if the fields are serialized in increasing alignment order into a maximally aligned prefix, it would incur 7 bytes of internal padding, a 41% overhead that cannot be recovered.

    This is what I mean by saying we are potentially trading external padding for internal padding. It's true a less-aligned prefix may recover some of that loss, but should we complicate our design for that possibility? Even your "lean" example above contains 2 bytes of non-recoverable internal padding - the globally optimal ordering would place the int field first to eliminate internal padding and thus consume the 6 required bytes in the stream instead of 8 (assuming no subsequent object requiring external padding). The optimal field-ordering packing strategy probably depends on a combination of the prefix alignment and field alignments - I don't think an optimal solution is achievable with any simple fixed-order rule (especially while preserving data abstraction boundaries), so we're really just debating about which cases we think are most common/important. It's possible the serialization infrastructure could re-arrange field packing order on-the-fly following a deterministic algorithm (backfilling padding with later smaller fields, possibly even crossing abstraction boundaries), but that would have runtime overhead and complicate skipping.

    Lets plan to discuss this further tomorrow..

  8. Log in to comment