Document UPC++ equivalent to MPI_COMM_SELF

Issue #360 resolved
Max Grossman created an issue

I’m pretty sure the answer is no and I can’t find any existing issues, but please correct if I’m wrong.

This is mostly an issue when porting legacy MPI code that uses MPI_COMM_SELF.

Today, I’m using this logic to create an equivalent team, which I believe should work:

self_team = upcxx::world().split(upcxx::world().rank_me(), 0);

Comments (13)

  1. Dan Bonachea

    I think you've provided both a question and the answer. I can confirm we do not have a macro or predefined team called ..._SELF", but the one-liner you provided constructs such a team. As written it's not a drop-in replacement, but with a small adjustment you can do something like this:

    #include <assert.h>
    #include <upcxx/upcxx.hpp>
    using namespace upcxx;
    
    team *team_self_p;
    #define UPCXX_TEAM_SELF (*team_self_p)
    
    int main() {
       init();
       team_self_p = new team(local_team().split(rank_me(),0));  // setup a self-team
    
       // use the new team
       assert(UPCXX_TEAM_SELF.rank_me() == 0);
       assert(UPCXX_TEAM_SELF.rank_n() == 1);   
       barrier(UPCXX_TEAM_SELF); 
       dist_object<int> dobj(world().rank_me(), UPCXX_TEAM_SELF);
       assert(dobj.fetch(0).wait() == world().rank_me());   
    
       finalize();
    }
    

    You've marked this issue as a proposal, but haven't presented a strong argument for providing syntactic sugar around this easily-constructed team. To my eyes MPI_COMM_SELF itself is pretty weakly motivated, and mostly involves features like message-passing, ordered delivery and I/O, none of which are relevant to our library.

    I'll also note that under our current GEX architecture, providing this sugar would impose a small but non-zero memory and time cost on upcxx::init() calls for all applications, and I don't see it as sufficiently motivated.

  2. Max Grossman reporter

    I think those are all fair push backs. Though MPI_COMM_SELF isn’t commonly used, I do suspect users will run in to this again in the future when porting other MPI applications – hopefully this issue can serve as informal guidance on how to implement UPCXX_TEAM_SELF. Is there another document where it would be appropriate to stick this answer? The programming guide?

  3. Dan Bonachea

    I agree that documenting this informal guidance for users could be of interest to a limited audience.

    I think the potential audience is narrow enough that this seems like the kind of thing that might belong in a FAQ, and perhaps that FAQ might eventually become a section in guide.

    We actually have the beginnings of a FAQ in our wiki, but it's currently incomplete/outdated and thus intentionally orphaned. If someone had available cycles to update this FAQ and make it ready for user consumption, the Q&A in this issue would be a great addition.

  4. Rob Egan

    So my $0.02 as a developer, if such a concept of upcxx::self_team() is adopted in the API then it could explicitly implement some of the optimizations for rpc calls to rank_me() which Steve brought up in slack recently, such as calling the lambda directly and immediately, and a new, special self type of upcxx::view::iterator that passes by reference and does not require a serialization copy, etc.

    Essentially the proposal would be to have the upcxx API implement these optimizations within the rpc* functions, instead of the programmer having to implementing them every time a rpc is called in the application, saving lines of code for the developer and avoiding the pitfalls of manually optimizing for the self case.

    Dan’s comments in that slack thread state that such transformations “would be prohibited by the library API semantics (i.e., because it's not transparent)”, but if it is explicitly documented and implemented for upcxx::self_team, where every possible target_rank is rank_me, then it is a small jump to also perform the transformation for every case where the target_rank is the calling process (potentially enabled only by env option or in release builds).

  5. Rob Egan

    And one more slightly tangential comment, if considering adding additional teams like “self_team” to the spec for upc++, this should only be done if (and only if) the memory overhead of constructing such a team can be a small constant factor, instead of proportional to the scale of the job. I can imagine that, with some careful work, this “self_team” could be implemented with nothing more than a place-holder and specialized implementation of upcxx::team, but if it has the same overhead as any other user-created teams, then it would not be worth it, in my opinion.

    That being said, if it is possible to make specialized implementations of a team, without this excessive overhead, then this “self_team” and possibly even a “node_team” (i.e. team::split(local_team().rank_me(), world().rank_me()), could both be very useful as something baked into the spec

  6. Dan Bonachea
    • changed component to Teams

    @Rob Egan thanks for your thoughts, we should discuss these design possibilities in more detail, possibly in a different venue. I think there are at least two orthogonal design issues here:

    1. The desire for "lightweight" creation of teams - Rob mentions the possibility of the runtime expanding local_team() with a self_team() or node_team() or others. We could specify the creation of additional special-case teams at UPC++ init time, but I'd much rather add a lighter-weight team constructor (not based on team::split()) that the application can use to construct whatever team it wants with overhead that scales with the new team size instead of the parent team size (as with split()). This already exists in GASNet (we added it to fix the scalability problems with local_team() creation) and I'd like to explore the best way to expose this directly up to UPC++ application code.

    2. Mechanisms to modify/abridge RPC serialization/invocation semantics for the special case of "send-to-self". This requires an in-depth discussion of how exactly the semantics would change and how the application would request the modified behavior.

    Regarding #2, I should note that one can already get pretty close to what I expect is "the best we can do" by replacing:

    auto f = upcxx::rpc(upcxx::rank_me(), callable, args...)
    

    with

    auto f = upcxx::master_persona().lpc([]() { return callable(args...); });
    

    this bypasses all the serialization overheads and enqueues the callback to run during the next master progress (with a future for completion that can be elided by using lpc_ff instead). The final semantic step would be of course to just synchronously invoke the callback:

    callable(args...);
    

    These are already available.

    If we are going to add syntactic sugar to emulate one or both of these semantic behaviors, I'd need to be convinced that it provides value relative to the existing formulations shown above. I'd also heavily prefer a syntax that does NOT require adding a dynamic "is this loopback" (or equivalently, "is this a special team") branch to the existing RPC critical path, unless the caller has explicitly opted-in to that extra branch with something statically detectable (eg a new completion argument, a new team subclass, or a different function name).

    CC: @Steven Hofmeyr

  7. Dan Bonachea

    As of 2021.9.0, we now have team::create() which can implement a self-team more efficiently than team::split().

    Here is the updated example:

    #include <assert.h>
    #include <upcxx/upcxx.hpp>
    using namespace upcxx;
    
    team *team_self_p;
    #define UPCXX_TEAM_SELF (*team_self_p)
    
    int main() {
       init();
       team_self_p = new team(local_team().create(std::vector<int>{rank_me()}));  // setup a self-team
    
       // use the new team
       assert(UPCXX_TEAM_SELF.rank_me() == 0);
       assert(UPCXX_TEAM_SELF.rank_n() == 1);
       barrier(UPCXX_TEAM_SELF);
       dist_object<int> dobj(world().rank_me(), UPCXX_TEAM_SELF);
       assert(dobj.fetch(0).wait() == world().rank_me());
    
       finalize();
    }
    

    This issue remains to document this.

  8. Rob Egan

    for completeness you should include “delete team_self_p;” before finalize(), right? And if one, say instead had a scoped unique_ptr<team> team_self_p, would it cause an error if it were to be destroyed after the finalize()?

  9. Dan Bonachea

    for completeness you should include “delete team_self_p;” before finalize(), right?

    That's optional, but sure. Although before finalize() it would need to be UPCXX_TEAM_SELF.destroy(); delete team_self_p; to avoid an error.

    And if one, say instead had a scoped unique_ptr<team> team_self_p, would it cause an error if it were to be destroyed after the finalize()?

    C++ destruction after finalize is not a problem. However allowing the unique_ptr to destruct the team before finalize() without a call to team::destroy() is an error.

    From the spec:

    team.png

  10. Log in to comment