Document UPC++ equivalent to MPI_COMM_SELF
I’m pretty sure the answer is no and I can’t find any existing issues, but please correct if I’m wrong.
This is mostly an issue when porting legacy MPI code that uses MPI_COMM_SELF.
Today, I’m using this logic to create an equivalent team, which I believe should work:
self_team = upcxx::world().split(upcxx::world().rank_me(), 0);
Comments (13)
-
-
reporter I think those are all fair push backs. Though MPI_COMM_SELF isn’t commonly used, I do suspect users will run in to this again in the future when porting other MPI applications – hopefully this issue can serve as informal guidance on how to implement UPCXX_TEAM_SELF. Is there another document where it would be appropriate to stick this answer? The programming guide?
-
I agree that documenting this informal guidance for users could be of interest to a limited audience.
I think the potential audience is narrow enough that this seems like the kind of thing that might belong in a FAQ, and perhaps that FAQ might eventually become a section in guide.
We actually have the beginnings of a FAQ in our wiki, but it's currently incomplete/outdated and thus intentionally orphaned. If someone had available cycles to update this FAQ and make it ready for user consumption, the Q&A in this issue would be a great addition.
-
- changed component to Documentation
-
- changed milestone to 2021.3.0 release
Mass roll-over of open issues to next release milestone
-
So my $0.02 as a developer, if such a concept of upcxx::self_team() is adopted in the API then it could explicitly implement some of the optimizations for rpc calls to rank_me() which Steve brought up in slack recently, such as calling the lambda directly and immediately, and a new, special self type of upcxx::view::iterator that passes by reference and does not require a serialization copy, etc.
Essentially the proposal would be to have the upcxx API implement these optimizations within the rpc* functions, instead of the programmer having to implementing them every time a rpc is called in the application, saving lines of code for the developer and avoiding the pitfalls of manually optimizing for the self case.
Dan’s comments in that slack thread state that such transformations “would be prohibited by the library API semantics (i.e., because it's not transparent)”, but if it is explicitly documented and implemented for upcxx::self_team, where every possible target_rank is rank_me, then it is a small jump to also perform the transformation for every case where the target_rank is the calling process (potentially enabled only by env option or in release builds).
-
And one more slightly tangential comment, if considering adding additional teams like “self_team” to the spec for upc++, this should only be done if (and only if) the memory overhead of constructing such a team can be a small constant factor, instead of proportional to the scale of the job. I can imagine that, with some careful work, this “self_team” could be implemented with nothing more than a place-holder and specialized implementation of upcxx::team, but if it has the same overhead as any other user-created teams, then it would not be worth it, in my opinion.
That being said, if it is possible to make specialized implementations of a team, without this excessive overhead, then this “self_team” and possibly even a “node_team” (i.e. team::split(local_team().rank_me(), world().rank_me()), could both be very useful as something baked into the spec
-
- changed component to Teams
@Rob Egan thanks for your thoughts, we should discuss these design possibilities in more detail, possibly in a different venue. I think there are at least two orthogonal design issues here:
-
The desire for "lightweight" creation of teams - Rob mentions the possibility of the runtime expanding
local_team()
with aself_team()
ornode_team()
or others. We could specify the creation of additional special-case teams at UPC++ init time, but I'd much rather add a lighter-weight team constructor (not based onteam::split()
) that the application can use to construct whatever team it wants with overhead that scales with the new team size instead of the parent team size (as withsplit()
). This already exists in GASNet (we added it to fix the scalability problems withlocal_team()
creation) and I'd like to explore the best way to expose this directly up to UPC++ application code. -
Mechanisms to modify/abridge RPC serialization/invocation semantics for the special case of "send-to-self". This requires an in-depth discussion of how exactly the semantics would change and how the application would request the modified behavior.
Regarding
#2, I should note that one can already get pretty close to what I expect is "the best we can do" by replacing:auto f = upcxx::rpc(upcxx::rank_me(), callable, args...)
with
auto f = upcxx::master_persona().lpc([]() { return callable(args...); });
this bypasses all the serialization overheads and enqueues the callback to run during the next master progress (with a future for completion that can be elided by using
lpc_ff
instead). The final semantic step would be of course to just synchronously invoke the callback:callable(args...);
These are already available.
If we are going to add syntactic sugar to emulate one or both of these semantic behaviors, I'd need to be convinced that it provides value relative to the existing formulations shown above. I'd also heavily prefer a syntax that does NOT require adding a dynamic "is this loopback" (or equivalently, "is this a special team") branch to the existing RPC critical path, unless the caller has explicitly opted-in to that extra branch with something statically detectable (eg a new completion argument, a new team subclass, or a different function name).
CC: @Steven Hofmeyr
-
- changed milestone to 2021.9.0 release
-
assigned issue to
Mass roll-over of open issues to next release milestone
-
- changed title to Document UPC++ equivalent to MPI_COMM_SELF
- changed component to Documentation
- marked as task
As of 2021.9.0, we now have
team::create()
which can implement a self-team more efficiently thanteam::split()
.Here is the updated example:
#include <assert.h> #include <upcxx/upcxx.hpp> using namespace upcxx; team *team_self_p; #define UPCXX_TEAM_SELF (*team_self_p) int main() { init(); team_self_p = new team(local_team().create(std::vector<int>{rank_me()})); // setup a self-team // use the new team assert(UPCXX_TEAM_SELF.rank_me() == 0); assert(UPCXX_TEAM_SELF.rank_n() == 1); barrier(UPCXX_TEAM_SELF); dist_object<int> dobj(world().rank_me(), UPCXX_TEAM_SELF); assert(dobj.fetch(0).wait() == world().rank_me()); finalize(); }
This issue remains to document this.
-
for completeness you should include “delete team_self_p;” before finalize(), right? And if one, say instead had a scoped unique_ptr<team> team_self_p, would it cause an error if it were to be destroyed after the finalize()?
-
for completeness you should include “delete team_self_p;” before finalize(), right?
That's optional, but sure. Although before
finalize()
it would need to beUPCXX_TEAM_SELF.destroy(); delete team_self_p;
to avoid an error.And if one, say instead had a scoped unique_ptr<team> team_self_p, would it cause an error if it were to be destroyed after the finalize()?
C++ destruction after finalize is not a problem. However allowing the
unique_ptr
to destruct theteam
beforefinalize()
without a call toteam::destroy()
is an error.From the spec:
-
- changed status to resolved
issue
#360: Document UPC++ equivalent to MPI_COMM_SELFAdd an example for creation of a singleton team.
A similar example has also been added to the FAQ.
Resolves issue
#360.→ <<cset c7f67770056b>>
- Log in to comment
I think you've provided both a question and the answer. I can confirm we do not have a macro or predefined team called
..._SELF"
, but the one-liner you provided constructs such a team. As written it's not a drop-in replacement, but with a small adjustment you can do something like this:You've marked this issue as a proposal, but haven't presented a strong argument for providing syntactic sugar around this easily-constructed team. To my eyes MPI_COMM_SELF itself is pretty weakly motivated, and mostly involves features like message-passing, ordered delivery and I/O, none of which are relevant to our library.
I'll also note that under our current GEX architecture, providing this sugar would impose a small but non-zero memory and time cost on upcxx::init() calls for all applications, and I don't see it as sufficiently motivated.