Two Sided Messages

Issue #24 resolved
Former user created an issue

Thinking of the metadata bootstrap problem. Two sided messages are the way to go when you know:

  1. Who you need info from.
  2. Who needs info from you.
  3. That the communication pattern isn't so dense as to warrant a collective algorithm.

In this scenario a one-sided scheme like dist_object would add an extra trip. The receivers have to explicitly request (via rpc) data from the sender even though by point 2 you should have already known who that was going to be.

The API I propose for two sided is:

template<typename Tag, typename Val>
struct tag_matcher {
  tag_matcher(team&);
  tag_matcher(tag_matcher const&) = delete;

private:
  using table_type = std::unordered_map<Tag,promise<Val>>;
  dist_object<table_type> _table;

public:
  void send(intrank_t peer, Tag tag, Val val);
  future<Val> receive(Tag tag);

  // implementations:
  void send(intrank_t peer, Tag tag, Val val) {
    upcxx::rpc_dist(peer,
      [](dist_object<table_type> tab, Tag tag, Val val) {
        tab->operator[](tag).fulfill_result(val);
      },
      _table, tag, val
    );
  }

  future<Val> receive(Tag tag) {
    return _table->operator[](tag).get_future();
  }
};

Used as such:

tag_matcher<NSEW_t, int> tm;

tm.send(peers[NORTH], SOUTH, north_count);
tm.send(peers[EAST], WEST, east_count);

future<int> south_count = tm.receive(SOUTH);
future<int> west_count = tm.receive(WEST);

upcxx::wait(when_all(south_count, west_count));

Comments (12)

  1. Dan Bonachea

    It's cute that one can implement a (weak) message passing abstraction in a few lines of RPC (and possibly an interesting pedagogical example for UPC++) but I don't think send/recv is a style of programming we want to encourage (at least for critical path comms). I would add a two additional caveats to your three points about when message passing works best:

    1. Both sides in each pair know the size of the payload to be sent (or at least an efficient upper-bound)
    2. Neither side is capable of knowing both the source and destination address.

    One of the key ways that one-sided comm abstractions can outperform message passing (ie the main property that Pagoda is banking on to "win") is that when initiators can provide full metadata information, the transfer can usually be completed fully asynchronously in RDMA hardware with no remote CPU involvement or extra data copies.

    MPI implementations have to work very hard to overcome this deficiency, doing crazy things like tag matching in hardware and offloading messaging rendezous. GASNet does not and will never expose hardware acceleration for the tag matching used in two-sided messaging, so writing a message-passing abstraction like this in UPC++ is unlikely to be competitive with just making the analogous MPI send/recv call, at least for non-trivial payloads. Even for a scalar payload, the additional overhead of RPC dispatch and use of a generalized data structure for tag matching in your example may lose relative to well-tuned MPI implementations. With a large and/or variable-length payload, this statically eager, multi-copy protocol is almost certainly the wrong algorithm.

    As a side note, this example also fails to enforce unique/ordered delivery, which is usually part of a sane messaging API (ie consider what might happen when the send/recv calls in the example code execute in a loop body containing no other synchronization points). This can of course be remedied, at the cost of additional overhead.

  2. BrianS

    I'm not sure having a user-level tag matcher device is necessary. bootstrapping is usually a problem of getting to see inside my this object on the other ranks to access the required global_ptr so that I can later pound the global_ptr with either big data or high rate operations. the bootstrapping is something we want to amortize, and deliver global_ptrs that are the fast channel when possible. asking a remote dist_object for it's member global_ptr seems easier. User-space tag matching is just tedious even when you give people good tools. If we pay a round trip cost to set up an amortized operation it is likely not a big cost. Perhaps we keep two-sided in our hip pocket until we discover bootstrapping is the performance bottleneck?

  3. Former user Account Deleted reporter

    This is the problem I'm trying to solve: given the tools gasnet provides, what is the most efficient and intuitive mechanism to distribute metadata so that users can later proceed with one-sided communication.

    I think this solution nails it for the case of nearest neighbors with small metadata.

    Dan, you comments helped very little with my mission. Used in the correct context, this incarnation of two-sided messages use one AM per message to do the transmission and two hashtable lookups on the remote side (one for send arrival, the other for receive). How do you suggest I do better than that?

    Only one message may ever be sent to a given choice of (tag_matcher,rank,key). Enforcing that would be cheap and easy in the debug build. Ordering is out the window. Dan, maybe you prefer I call this something besides two-sided messaging?

    @bvstraalen, I think this is vastly easier to use than the round trip rpc's of dist_object. There are no remotely invoked lambdas! That will please a lot of users. And the fact that the tag type is generic means applications can use a tagging scheme that's comfortable for them (like (i,j,k) tuples) and not have to do the tedium of mapping everything to the integer space.

  4. b

    This seems pretty similar to HPX's channel/receive_buffer that is used for things like decentralized async ghost zone exchange. They seem pretty isomorphic to me.

    The channel primitive is pretty fundamental in HPX so I am onboard with including this John.

    I think people are jumping to conclusions here because the word tag was used.

    Dan: RE: ordered delivery - why?

    TL;DR I support this.

  5. Scott Baden

    I think it is safe to assume that we will amortize the cost of bootstrapping, as it is a way of implementing the executor model. I-E fails if the meta-data is changing too frequently but I think that the common case for us will be that it doesn't change too frequently. If we can express bootstrapping through UPC++ features that we've already proposed, I would be in favor of going this route, as Brian mentioned. We can always change our minds. HPX receive buffers rely on mutual exclusion, and presumably our solutions will incur some equivalent cost

    We are performing a sparse gatherv and perhaps we can take advantage of some common cases

    https://htor.inf.ethz.ch/publications/img/hoefler-sparsecolls.pdf

    Running at scale, even 5 x 5 x 5 stencil neighborhood would be sparse, though I don't know how the network will view that.

    But I think we we should stay way using the term "2 sided" as that will confuse our users.

  6. BrianS

    I would like to put "two-sided" or "channels" or the desired abstraction into V2.0 There are indeed maximal performance designs that are two-sided. In Chombo we choreograph two-sided asynchronous communication. From the MPI side it is optimal, but on the application side we replicate meta data and perform a LOT of redundant computation to manage this. With current generation interconnect it is worth it, since chip clocks stayed far ahead of network latency. As Dennard scaling ends and HPC interconnect latencies approach microseconds this trade-off is less obvious.

    For trivial compute geometry (PDE in a cube) the 27 neighbor special case can be profiled and optimized. I agree with Scott. There are specific computational patterns we need to enable. We currently lack a mechanism to amortize domain-specific patterns. That is an optimization I would like to see laid over our RMA library.

  7. Former user Account Deleted reporter

    I'm all for deferring 2-sided to a later spec version since it is separate from core, but would like to reemphasize that I see it mostly as a programmer productivity feature when sharing metadata (global_ptr's, sizes, etc) which is needed before RDMA's can happen. The nice things are that it is devoid of function shipping, devoid of the race condition on dist_object creation, and it generalizes a universally understood idiom from MPI. There are apps where needing to know which receives to post poses no additional burden on their metadata facilities (mesh-based PDE's), hence I would like to be as accommodating as possible.

  8. Log in to comment