dist_object should provide an accessor for copies of other ranks' portion of the dist_object

Issue #89 resolved
Amir Kamil created an issue

The following is likely to be a common bootstrapping paradigm:

dist_object<global_ptr<T>> pointers(my_gptr);
auto f = rpc(some_rank, [](global_ptr<T> gptr) {
  return gptr;
});
...
wait(f);
global_ptr<T> remote = f.result();

We should provide a method for doing the RPC to obtain a copy of another rank's piece of the dist_object. Suggestion:

template<typename T>
future<T> dist_object<T>::get(intrank_t rank) const; // or operator[] or at()

Preconditions: rank must be a valid ID in the team associated with the dist_object. T must be Serializable. The dist_object must not have been destroyed on rank.

Returns a future representing the value of rank's portion of the distributed object.

We should also consider adding an accessor to obtain a reference to the team over which the dist_object was created.

Comments (6)

  1. Amir Kamil reporter

    When discussing dist_objects with @yelick, she mentioned that she would like us to provide convenience mechanisms for reading and writing remote pieces of the dist_object. She would also like this to turn into RDMA ops where possible. So in addition to get() above (which we should probably rename to rget()), we should provide an rput() method as well:

    template<typename T>
    future<> dist_object<T>::rput(intrank_t rank, T value);
    

    Other things we should consider:

    • Specify that the underlying datum of a dist_object lives in the shared segment and provide a method for obtaining a global_ptr to it.
    • Implement a cache of the translation table, so only the first rput/rget targeting a remote rank requires an AM, with subsequent ops using RDMA.
    • Provide some type of support for arrays, so that dist_object can implement a coarray. Then provide versions of rput/rget that can source or target a rank and dist_object. It's not clear what this should look like.

    The higher-level point is that we want to make dist_object as easy to use as possible and provide good performance, without giving up scalability.

  2. Dan Bonachea

    All of this sounds worth providing to users, but we should consider whether it belongs in a different class. dist_object methods currently never require communication and can be guaranteed to use constant space (per process) and run in constant time, which seems like a nice property to maintain. The class is minimal but fast and can be used to efficiently build higher-level constructs that may be more expensive.

    Perhaps there should be a dist_array<T> which is implemented over the dist_object API (probably as a dist_object<global_ptr<T>>) that adds communication operations and caching? This class's methods would often require communication, and could consume non-trivial space, depending on the caching mechanism. It could even expose algorithmic options to adjust the space/time/scalability tradeoff (eg use a collectively-constructed full directory for jobs under P processes, and a cache with Q entries for larger jobs). It should also provide the global naming capability of dist_object (and RPC tie-in) either via inheritance or wrappers.

  3. Former user Account Deleted

    So this is fetch(). I think it went unnoticed, but fetch() does exist as a method on dist_object in the implementation. Can be removed if we prefer the free function form upcxx::fetch but that's subject to our using namespace upcxx calamity of occupying useful words which other libraries might take offense too (see how future.wait() was preferred to upcxx::wait(future)).

    The RDMA into a dist_object capability is not on the table since that would require us to internally address the bootstrapping problem of sharing the global_ptrs, which is what we're solving with dist_object. As Dan said, lets keep this thing focused.

  4. Log in to comment