RPC targetting via (team, rank)

Issue #21 resolved
BrianS created an issue

referencing a remote team: When do we want it, how do we do it? One place we might need it is to access the atomic domain of a team to issues a remote atomic inside an rpc call. Anything else? maybe not. I would hope a user would not issue a collective call inside an rpc. Perhaps in a continuation? Compute a bunch of values, when they are ready, take their norm across your team, compute your next stable timestep, execute further in time....we are starting to look like a scheduler or at least a DAG builder again.

atomic domain creation might require a collective construction. Collectives might require state data to be created at team creation time.

Comments (20)

  1. Dan Bonachea

    I was assuming that RPC send would name the target thread by a tuple of (team, rank) - to support modular programming so you can write a library to compute something in parallel (possibly using RPCs) over a team provided by the caller, without that library needing to know or care about the global id's of the threads that are members of the provided team. This encapsulation of the naming for thread subsets is one of the main motivations for teams (in addition to the more obvious ones of team collectives and team allocation).

    In this case any RPC lambda that sends an additional RPC within its body needs a way to reference a team while running at the remote target.

  2. Former user Account Deleted

    Can I ask for some of the justifications why gasnetex is going with the tupled API (team,ix) instead of the more traditional flat (global-ix) API for peer to peer ops? Considering that we aren't exposing multiple AM addressable endpoints per rank, flat might still make the most sense.

  3. Dan Bonachea

    Actually the (team, rank) API is the more 'traditional" in that it's exactly what MPI is using, and we are adopting it for many of the same underlying reasons. This EX decision has already been finalized.

  4. Former user Account Deleted

    That's great that it's finalized, but it would still be a big help if you could share the motivating concerns that brought you to this decision. I would like to walk through the same mental exercise for upcxx.

  5. Former user Account Deleted

    To comment @bvstraalen points,

    Like block-cyclic arrays, team's are collectively constructed distributed objects. User's who write algorithms leveraging rpc's in a team will need a universal id (or at least universal with the team) for naming the team so that local instances can be found. Since teams are definitely going into the spec, this probably places hi priority on resolving how we deal with distributed objects. But I vote that team be a subclass of dist_object, so we should head back to issue 17 and nail that down.

    My gut tells me to avoid including teams in rpc's and other p2p ops since it would force the user to deal with shipping around team identifiers, or putting a bunch of upcxx::world_team() calls in unnecessarily (of course there's always default values for function arguments). Also, (team,ix) names can alias each other non-trivially. Flat integers dont, and they're really convenient for putting as keys in hashtables, serializing to bytes, etc.

  6. Dan Bonachea

    Not sure this is the right forum to discuss EX design.

    However some GASNet-client-facing reasons:

    1. The most important user-facing reason why p2p comms use teams is to provide encapsulation of the naming for thread subsets for composability and modularity of user-written library code, as described in my first comment.
    2. Makes it easy for an application to re-order its ranks (or a subset of its ranks) to match the problem.
    3. It's especially nice for divide and conquer, for example after creating a team for your sub-cube in a 3D domain decomposition you can trivially iterate over the other members of your partition without any fancy/error-prone arithmetic to compute absolute node indexes.
    4. Apps that use subset collectives already need a team for naming their subset, and it makes sense for p2p comms to use the same rank-naming abstraction.
    5. It enables "flexible" jobs, where a team creation can alter the number of p2p addressable ranks. This is crucial for fault recovery, as it provides the ability to "shrink" a job and/or renumber ranks during recovery from a node failure. (not part of UPC++'s current roadmap, but EX is designing for the future)
    6. More importantly for current ECP goals, it gives a natural way to express "growing" the number of p2p addressable ranks beyond the primordial team. We expect UPC++ to use exactly this mechanism to create a team including the GPU endpoints (which UPC++ will need to use at least under the covers in p2p operations that want to take advantage of GPUDirect acceleration). It will also be needed if UPC++ ever creates endpoints for threads (eg an endpoint for a hidden progress thread).
    7. Provides natural expression of the local endpoint to be used for a comm, (although this is not fundamental).

    Inside GASNet it also makes alot of sense for various reasons I won't go into here.

  7. Dan Bonachea

    Finally GASNet clients who aren't interested in any of the above benefits can put the primordial team handle in a global variable and macro it away to simulate the global-id behavior (this is exactly what the GASNet-1-over-EX compatibility layer will do).

    UPC++ could take that approach, at least at the user-level, but I'd encourage you to at least consider exposing the (team,rank) naming for the productivity benefits I enumerated.

    As far as the indexing benefits of a simple integer, any team-specific data structures can be indexed simply by the rank integer. The same is true for an app that chooses not to use teams. Also given an appropriate team representation, UPC++ could generate a unique integer id that corresponds to each team, and possibly even use that as the user-level team identifier (so the key is 2 ints). Alternatively, UPC++ could provide a utility that maps a (team,rank) back to the rank in the primordial team.

  8. BrianS reporter

    If the local team is lambda captured for an rpc, then it needs to be serializable. The team_t in GASNet is the likely desired value to transmit. Which means we still need the lookup on the remote side if we want to reference the correct team, or the atomic domain within that team.

  9. Dan Bonachea
    Brian said:
    The team_t in GASNet is the likely desired value to transmit. 
    

    This won't work for the same reason that you can't send an MPI_Communicator on the wire. team_t names the local representative of the global team, and is not meaningful on another process, even if the data structure was serializable (which it won't be). What you need is a globally-meaningful team name, which I think UPC++ will need to construct within team construction.

  10. Former user Account Deleted

    I was thinking of this syntax:

    // in upcxx::
    struct team {
      // returns primordial index of given team-mate index
      intrank_t operator[](intrank_t index) {...};
    };
    
    // user code
    upcxx::rpc(my_team[ix], <lambda>);
    
  11. Former user Account Deleted

    My proposal for the team API. I thought carefully about how much of dist_object team leverages. We want to maximize this so dist_object utilities (like rpc_dist) immediately apply to teams.

    namespace upcxx {
    
    // User can see, but should never utter. I still want this type in the spec since
    // we can reduce the semantics of teams to dist_object</*some-type*/> and save
    // us a lot of redundant semantic description.
    struct team_state;
    
    using team = dist_object<team_state>;
    using team_id = dist_id<team_state>;
    
    struct team_state {
    private:
      team_state(); // upcxx runtime only
    public:
      ~team_state();
      team_state(team_state const&) = delete; // cant copy
      team_state(team_state&&); // move is ok
    
      intrank_t rank_n() const;
      intrank_t rank_me() const;
    
      // to world index
      intrank_t operator[](intrank_t mate_index) const;
      // from world index
      intrank_t from_world(intrank_t world_index) const;
    
      // not necessary, provided by dist_object<team_state> (aka team)
      //team_id id() const;
    
      // split_strided: a restricted form of mpi_comm_split that can be implemented
      //   without communication.
      // Collective over parent (this).
      // All ranks must supply same argument values.
      // To translate this to a mpi_comm_split(color,key):
      //   color = (rank_me() / div) % mod
      //   key = rank_me()
      // General mapping from (color,key) back to (div,mod) not possible.
      // For non-rectangular sized teams, this will put all the remainder ranks in a
      //   dangling team on the end. We'll probably want an alternative that packs
      //   some number of leading subteams with a +1 rank.
      // Yes, return is a dist_object by value. Thank you move semantics.
      team split_strided(intrank_t div, intrank_t mod);
    
      // we will likely want more forms of split...
    };
    
    // specialize dist_object<team_state> (aka team)
    template<>
    struct dist_object<team_state> {
      // Includes everything from regular dist_object<T> with T=team_state
    
      // Then re-exposes all of team_state as regular methods so users can use dot "."
      // instead of arrow "->".
      intrank_t operator[](intrank_t mate) const {
        return (*this)->operator[](mate);
      }
      // etc...
    };
    
    // upcxx:: free functions
    team& world();
    intrank_t rank_me(); // world implicit
    intrank_t rank_n(); // world implicit
    
    // barriers
    void barrier(team& = world()); // sync
    future<> barrier_async(team& = world(); // async
    
    // Async barriers are less common then sync barriers. Users expecting barrier() to be sync
    // when it isn't (actually returns future) would have bugs where this code:
    //   upcxx::barrier(); // user accidentally drops future
    // would be terribly misleading.
    
    // more collectives
    // ...
    

    We want team construction to be cheap whenever possible. Users that employ teams just to manage the index space for point-to-point calls would ideally experience no communication overhead for using teams. How far towards this goal will gasnet_team_t take us?

    Supposing that gasnet_team_t construction is expensive (communicates), I see two mitigation strategies:

    1. If gasnet team construction is only collective over the subteam, then we can defer gasnet_team_t construction until the first collective call on the upcxx::team. All point-to-point would be done with the primordial gasnet_team_t.

    2. If gasnet team construction is split-phase, then we can defer blocking for its construction until the first collective call on the upcxx::team. Again, point-to-point is done against the primordial team.

  12. Dan Bonachea

    Paul and I discussed this proposal briefly this week, a few notes:

    1. GASNet team construction (eg team split, team create,etc) will always be collective over the parent team, just like in MPI and for the same semantic reasons. The special case of "additive" team construction will probably be collective over the primordial team0.
    2. Except for a few special cases, team construction will generally require communication (usually something like a small payload gather-all over the parent team).
    3. Once a subteam is constructed, using it for collectives or p2p should be just as efficient as using team0.
    4. Bottom line, teams are somewhat expensive to construct and have a non-empty memory footprint (eg dedicated network-level buffers) - they are intended to amortize overheads into construction that are then leveraged to optimize many uses in the critical path. They should be lighter-weight than an MPI communicator (mostly because they avoid overheads needed to implement message-passing isolation), but thinking of them as very "cheap" objects is probably the wrong model.
    5. Team construction will not be split-phase (at least in the near-term), due to the implementation and semantic complexities that would entail. In fact, team construction where the parent team's local endpoint is multi-threaded is likely to enforce mutual exclusion to prevent semantic ambiguity, although that's probably not an issue for UPCXX's intended usage.

    Other comments:

    1. We should eventually discuss how you plan to implement team.operator[] and from_world, since that needs to be done carefully at scale. We may expose a gasnet-level query to perform rank translation, but in general it will need the ability to communicate at scale (eg it wouldn't be callable in AM handler context).
    2. Your proposed barrier_async looks suspiciously like a clock. Hopefully the semantics will prohibit one thread from issuing a second outstanding async barrier over a given team before waiting on the future of the first. GASNet's async barriers will definitely require that.
  13. Former user Account Deleted

    Response to "Other comments":

    1. For now I was hoping to restrict upcxx teams to those constructed from bjiective maps between (parent_ix) <-> (sub_team, sub_ix). These maps would have low information content (just a few stride vals, offsets, etc.) so replicating them everywhere is cheap. Nesting teams would compose these maps. Mapping to primordial/ancestral indices is now simple.

    2. I would like multiple barrier_async's to be overlappable, perhaps that is a clock. I was hoping named barriers with pollable handles would give me this nicely.

  14. Amir Kamil

    We should also consider adding team scopes, as in Titanium and adopted by DASH and HCAF. This supports the use case of handing a team of ranks off to a library, which then treats it as the world, without requiring the library to actually be team aware. It also makes it easier to avoid erroneous use of team collectives.

    With team scoping, it would make sense to add an implicit team argument to p2p ops, which would default to the current team. This is what we prototyped in UPC++ v0.1.

  15. Former user Account Deleted

    Seems to me the semantics of this get real gnarly in the presence of RPC's and continuations.

  16. Paul Hargrove

    GASNet-EX team support has reached the point that we have initial specification and implementation of the team support analogous to UPC++'s team::from_world. One important observation made is that team::from_world has progress level "unspecified between none and internal". However, the corresponding GASNet-EX interface specification explicitly reserves the right to perform blocking communication. That would make use of team::from_world in a callback a problem. Side-stepping that is a (strong, IMHO) motivation to add interfaces for rpc taking a (team,rank) pair (which GASNet-EX will support by the end of this month).

  17. Dan Bonachea

    This issue was triaged at the 2018-06-13 Pagoda meeting and assigned a new milestone/priority.

    In a discussion after the meeting, Dan Paul and John agreed we should add these overloads in the upcoming release of the spec/impl, even if the current implementation is the trivial one-liner. The rationale is we believe (team,rank) naming will be important for the performance of apps mixing RPC with teams at exascale, and we want to provide app programmers with a scalable interface immediately, even if the scalable implementation is deferred.

  18. Log in to comment