UPC++ Version 1.0
Last November we froze the main UPC++ repository as part of a transition phase. Today we would like to inform you of coming changes. The current status of the main UPC++ repository is that it will remain frozen in maintenance-only mode.
By June 30, 2017, we will post a specification for UPC++ that differs substantially from version 0.1. With even greater vigor than before, UPC++ will offer performance delivered by the hardware. It will leverage GASNet-EX to deliver on this commitment. UPC++ is a high-productivity communication library. It will be designed to interoperate smoothly and efficiently with MPI, OpenMP, CUDA and AMTs.
UPC++ is also a sounding board for new ideas that may be incorporated in C++20 and beyond, or influence the direction of the efforts. UPC++ v1.0 will deploy new capabilities, some of which were experimental in v0.1, remove some and modify others. The table at the end of this document lists the UPC++ features for v0.1 (left) and planned additions, deletions and changes coming in v1.0.
As always UPC++ exposes a PGAS memory model, including 1-sided communication (RMA and RPC). However, there are two major changes. These changes reflect a design philosophy that encourages the UPC++ programmer to directly express what can be implemented efficiently (ie without a need for parallel compiler analysis)
Most operations are non-blocking, and the powerful synchronization mechanisms encourage applications to design for aggressive asynchrony.
All communication is explicit - there is no implicit data motion.
What New Features are Coming?
Futures, promises and continuations. Futures are central to handling asynchronous operations: RMA and RPC. Futures are free-standing in that they do not depend on other parts of the library. Whereas v0.1 used an event-based mechanism for expressing task dependencies, v1.0 relies on a continuation-based model instead.
Progress guarantees. Because UPC++ has no internal service threads, the library makes progress only when a core enters an active UPC++ call. UPC++ v1.0 will have a more well-defined progress semantics than v0.1, especially in multithreaded scenarios.
Remote atomics were experimental in v0.1 and did not necessarily utilize available hardware support. Any available hardware support will now be leveraged, and the user will see significant performance benefits in certain combinations of hardware and applications. Remote atomics will use the C++11 memory model and free function API. We will restrict atomics to fetch and add in the near term, but may consider others later.
Teams are a mechanism for grouping ranks, and are similar to MPI_Group. Teams play a role in collective communication and also in storage allocation. Initially, we plan to support barriers and reductions for specialized types supported in hardware. Others (such as the vector ‘v’ variants of alltoall) will be added over time.
Distributed objects. UPC++ will enable a C++ object of any type to be made into a distributed object, with one instance on every rank of a team. RPC can be used to scalably access remote instances within a team.
Memory kinds. UPC++ will support global operations on memory with different kinds of access methods or performance properties, such as GPUs, HBM, NUMA and NVRAM, while providing a uniform interface for transfers between such memories.
|Version 0.1||Version 1.0|
|Futures, Continuations, Promises||✔|
|Events||✔||Subsumed by futures, continuations, promises|
|Put and Get||✔||✔|
|Distributed 1D Arrays||✔||Subsumed by distributed objects|
|Global Pointer Dereference||✔ (Implicit blocking)|
|Memory Kinds (e.g. GPU)||✔|
|Shared Scalar Variables||✔ (Little use)|
|Non-Distributed MD Arrays||✔ ndarray prototype|
|Progress Guarantees||✔||✔ More rigorous|
What Will Be Removed from UPC++?
In developing UPC++ v1.0 we also strove for simplicity and we have removed some obsolete features present in v0.1:
Multidimensional arrays (local only). We plan to interoperate with 3rd party solutions for multidimensional arrays.
Distributed shared arrays - this functionality has been subsumed by generalized distributed objects, which provide a more scalable solution.
Blocking communication (e.g. implicit global pointer dereference)