Provide a dist_array class

Issue #23 resolved
BrianS created an issue

A user can get something like a distributed array object that is scalable with another function for dist_object

template <class T>
class dist_object {
public:
  future<global_ptr<T> > operator[] (rank_t neighbor); 
};

Then you can use RDMA to elements in the remote arrays via the global_ptr and our support for pointer arithmetic. put to the fifth element in the remote array. The user is responsible for holding the global_ptr locations they need.

rput(15, gptr+5);

if users want us to manage the global_ptr table, then

template<class T, size_t N>
class dist_array: public dist_object<std::array<T,N> >
{
   public:
   future<> rput(int rank, int element, T val); //check map, fill map it needed, issue rput return future.
   .
protected
  std::map<int, global_ptr<std::array<T,N> > > m_map;
};

but that might get abused. like with std::vector we could have a reserve like function that tells dist_array to populate it's m_map all up front. It depends on the user's use case. A user is free to interpret the linear sequence of ranks and N elements as one big array, which starts looking more like shared_array Maybe a class like this is in the tutorials first and see if people cut and paste it into their applications. The augmentation of dist_object seems innocuous.

Official response

  • Dan Bonachea

    Here is the current strawman proposal from the AHM:

    namespace upcxx { namespace extras {  // namespace upcxx::extras
    
      template<class T, size_t BS = 0>
      struct dist_array {
    
        // allocate and construct a dist_array of given size/blocking
        // collective over team, but NOT a barrier
        // may eventually add hints to control cache behavior
        dist_array(size_t count, size_t block_size=0,
                   upcxx::team &team = upcxx::world());
    
        // construct an aliased view over dist_array other
        // using a different blocksize
        // collective over other.team(), but NOT a barrier
        template<size_t BS2>
        dist_array(dist_array<T, BS2> &other, size_t block_size=0); 
    
        // indexing operation - asynchronously returns gptr to element
        // cache hit *IS* permitted to return a ready future (common case)
        future<global_ptr<T>> pointer_to(size_t index) const;
    
        // returns base pointer to the slice of elements with
        // affinity to the caller (who must be in this.team())
        T* base_me() const;
    
        // the count of elements with affinity to the caller
        // (who must be in this.team())
        size_t count_me() const;
    
        size_t size() const; // global number of elements
        team &team() const; // The associated team
        dist_id id() const; // TBD: how to convert from id to array?
      };
    }} // namespace
    

    Key features:

    • Written as a contrib library atop UPC++, in a different namespace
      • Does not rely upon UPC++ internals, can be written by any competent UPC++ programmer
      • Gives us the option to change our mind and deploy a completely different distributed array interface later
    • 1-D distributed array with arbitrary blocksize
    • Blocksize can be provided statically (as template parameter BS) or dynamically in the constructor.
    • Default (0) means pure blocked, ie ceil ( count / ranks_n() )
      • this elides one dynamic mod operation (by ranks_n()) per indexing op
      • static blocksize may allow the other mod operation (by blocksize) in indexing to be constant-folded away or strength-reduced by the optimizer
      • Initial implementation may only support the default layout (Kathy’s requirement)
    • Blocks of the array have a guaranteed layout (same as UPC) allowing well-defined construction of contiguous local views of all local elements, and alternate blocksize views over the same elements
    • Still need to add C++ iterators for the global index space
    • We may want add a version of pointer_to that avoids the future-construction overhead for the common case of cache hit
      • Either a "blocking" version that returns global_ptr<T>
      • Or possibly a non-blocking version that returns global_ptr<T> for cache hit and nullptr on cache miss

Comments (14)

  1. Former user Account Deleted

    We could do the operator[] extension. Are we sure we want to? The rput/rget to the gp<T> would only enjoy hw acceleration for trivially serializable T, thus dynamic length containers are out (this motivates Brian's use of std::array above). And the big question: what value is it to have a global_ptr to a scalar? We were thinking for metadata bootstrapping, but this seems like extra trips on the wire. In the gp case you rpc over to get the address, rpc that back, then rget from the gp address after the peer has signalled you its time to do so (possibly a barrier). Without gp, the user rpc's over, waits on future, rpc's back contents as result. Sooo much better.

    The inheritance seems innocuous, except I was thinking dist_object has pointer semantics and is cheap to copy. The reason is dist_id<T>::when_local() returns a future<dist_object<T>> which can't be the original dist_object, obviously, and I don't want it to be future<dist_object<T>&> (or *) since the original dist_object may have died and only the in-filght rpc's are keeping it alive thanks to a distributed consensus algorithm (like garbage collection).

    The fix: dist_array should probably hold a dist_object internally as a field and abandon the pointer semantics.

  2. Dan Bonachea
    The rput/rget to the gp<T> would only enjoy hw acceleration for trivially serializable T, thus dynamic length containers are out
    

    I think if it can generate a gp that can be used to access elements from a (fixed-length) array of doubles, then that's exactly what you need for bootstrapping something like a regular HALO exchange. Each rank uses the dist_object once at startup to construct gps referencing the border cells of its neighbors, then all the communication in the iterations can use pure zero-copy RDMA to fill ghost cells, with no RPC anywhere in the critical path. A more irregular app like an AMR might use the dist_object to fetch a gp more frequently (eg after a regridding) but can still amortize that one RPC over multiple subsequent zero-copy RDMA transfers.

  3. BrianS reporter

    That's sort of what I'm going for. A user can up-front the bootstrapping, then pound the critical path. When possible, pound the critical path with as much hardware support as is reasonable.

    At the very least, halo exchange should be rput/rget on mdspan. In Mutligrid we hit the same ghost transfer dozens of times.

  4. Dan Bonachea

    This issue was triaged at the 2018-06-13 Pagoda meeting and assigned a new milestone/priority.

    We noted this feature has previously been requested by our stakeholders, but the exact requirements were unclear, We also noted this could be provided as a utility library atop UPC++.

    We resolved to approach stakeholders (esp HipMer) to decide on priority for this.

  5. Dan Bonachea

    Here is the current strawman proposal from the AHM:

    namespace upcxx { namespace extras {  // namespace upcxx::extras
    
      template<class T, size_t BS = 0>
      struct dist_array {
    
        // allocate and construct a dist_array of given size/blocking
        // collective over team, but NOT a barrier
        // may eventually add hints to control cache behavior
        dist_array(size_t count, size_t block_size=0,
                   upcxx::team &team = upcxx::world());
    
        // construct an aliased view over dist_array other
        // using a different blocksize
        // collective over other.team(), but NOT a barrier
        template<size_t BS2>
        dist_array(dist_array<T, BS2> &other, size_t block_size=0); 
    
        // indexing operation - asynchronously returns gptr to element
        // cache hit *IS* permitted to return a ready future (common case)
        future<global_ptr<T>> pointer_to(size_t index) const;
    
        // returns base pointer to the slice of elements with
        // affinity to the caller (who must be in this.team())
        T* base_me() const;
    
        // the count of elements with affinity to the caller
        // (who must be in this.team())
        size_t count_me() const;
    
        size_t size() const; // global number of elements
        team &team() const; // The associated team
        dist_id id() const; // TBD: how to convert from id to array?
      };
    }} // namespace
    

    Key features:

    • Written as a contrib library atop UPC++, in a different namespace
      • Does not rely upon UPC++ internals, can be written by any competent UPC++ programmer
      • Gives us the option to change our mind and deploy a completely different distributed array interface later
    • 1-D distributed array with arbitrary blocksize
    • Blocksize can be provided statically (as template parameter BS) or dynamically in the constructor.
    • Default (0) means pure blocked, ie ceil ( count / ranks_n() )
      • this elides one dynamic mod operation (by ranks_n()) per indexing op
      • static blocksize may allow the other mod operation (by blocksize) in indexing to be constant-folded away or strength-reduced by the optimizer
      • Initial implementation may only support the default layout (Kathy’s requirement)
    • Blocks of the array have a guaranteed layout (same as UPC) allowing well-defined construction of contiguous local views of all local elements, and alternate blocksize views over the same elements
    • Still need to add C++ iterators for the global index space
    • We may want add a version of pointer_to that avoids the future-construction overhead for the common case of cache hit
      • Either a "blocking" version that returns global_ptr<T>
      • Or possibly a non-blocking version that returns global_ptr<T> for cache hit and nullptr on cache miss
  6. john bachan

    I would like to investigate an alternate implementation, one not over dist_object, to remove the dependence on remote attentiveness for address lookup. Though we are caching addresses, that won't help random algorithms. If we build a data structure that is inattentive-tolerant and scalable we would be offering something the rest of our library does not. This is not a dis of dist_object, which I maintain is still excellent for managing distribution when rpc's are the best communication mechanism.

    So I propose an API possibly similar to the one Dan just posted, but with the following change:

    • Construction becomes an expensive collective to internally distribute the base pointers. At small scale this would be the familiar non-scalable all-to-all that our users usually do. At large scale it would be a sparse-all-to-all in a scheme like the one Dan/Paul plan to use to make team rank translation scalable. This collective need not be blocking, we could just roll the asynchrony of waiting for the initial all-to-all (at least just the part we need for the current lookup) into the future.

    At the heart of this data structure lies a more fundamental building block that solves the unscalable metadata problem. I would like to consider breaking this out as a fundamental upcxx data structure. Semantically its a read-only co-array. Co-array because only 1 value per rank. Readonly permits us to cache and replicate each rank's value across all peers (unscalable for small scale), or just some log(N) peers (scalable).

    template<typename T> // T is trivially serializable
    class dist_meta {
    public:
      dist_meta(team&, T local_value); // communicating collective construction
    
      future<T> operator[](intrank_t peer);
      T const& local_value() const; // & into shared memory
    
      team& team() const;
    
     // NOT NECESSARY: could always wrap in a dist_object<dist_meta<T>> to make rpc-translatable
      dist_meta_id<T> id() const; // like dist_object, participates in rpc translation
    };
    
    // dist_array<T> would then be implemented trivially atop dist_meta<global_ptr<T>>
    
  7. Dan Bonachea

    John and I already discussed this further in person, so just recording additional points here.

    The main observation is that nothing in the dist_meta class John proposes requires special support that isn't already present in the UPC++ public interface for correctness or performance.

    Specifically:

    • The proposed constructor can already exchange metadata densely at small scale (probably using a series of broadcast(team) calls, or eventually using a gather_all(team)) .
    • At large scale, the constructor uses rpc to exchange global pointers to metadata stored in a shared heap with a scalable number of peers (at least two).
      • That metadata is used to bootstrap later one-sided lookups on cache miss using rget to traverse the distributed data structure.
      • On RDMA-capable networks these lookups will not be sensitive to remote attentiveness
      • (on non-RDMA networks it will rely on remote internal progress, just like any other scheme).
    • In all cases dist_meta should probably use the UPC++ shared-memory bypass features to share the metadata structures between co-located peers (ideally managed using lock-free data structures implemented using upcxx::atomic_domain)

    All this being said, dist_meta might still be a useful design pattern to embody (ie useful for more than just dist_array), but so far nothing prevents all of this from living in the contrib directory.

  8. Dan Bonachea

    both spec and implementation have been merged for the upcoming release.

    Optimization and tuning continues, and new issues should be opened for future enhancements or problems.

  9. Log in to comment