Clone wiki

upcxx / FAQ


Below are some Frequently Asked Questions (FAQs) for UPC++.

All answers refer to the latest release of the UPC++ implementation and specification, which are available at the UPC++ home page.


What's the best way to get started learning UPC++?

For training materials to get started using UPC++, see the UPC++ Training Page

What systems support UPC++?

UPC++ runs on a variety of systems, ranging from laptops to custom supercomputers. The complete list of officially supported platforms (and C++ compilers) appears in the UPC++ INSTALL document. This currently includes systems with x86-64, ARM64 and PowerPC64 architectures running Linux, macOS or Windows (with WSL). It's also been successfully used on other POSIX-like systems, despite not being officially supported.

How do I download/install UPC++ for my system?

There are several ways to access a working UPC++ environment:

Installs on the major DOE centers

We maintain public installs of UPC++ at several DOE centers.

See UPC++ at NERSC, OLCF, and ALCF

Download and install from source

UPC++ supports a wide range of UNIX-like systems and architectures, ranging from supercomputers to laptops, including Linux, macOS and Windows (via WSL).

Visit the UPC++ download page for links to download and install UPC++ on your own system.

Docker containers with UPC++

We maintain a Linux Docker container with UPC++.

Assuming you have a Linux-compatible Docker environment, you can get a working UPC++ environment in seconds with the following command:

docker run -it --rm upcxx/linux-amd64
The UPC++ commands inside the container are upcxx (compiler wrapper) and upcxx-run (run wrapper), and the home directory contains some example codes. Note this container is designed for test-driving UPC++ on a single node, and is not intended for long-term development or use in production environments.

The container above supports the SMP and UDP backends, and does not include an MPI install. If you additionally need MPI support, there is also a larger "full" variant of the container that includes both UPC++ and MPI:

docker run -it --rm upcxx/linux-amd64-full
For details on hybrid programming with MPI, see Mixing UPC++ with MPI.

Is UPC++ available in any containers?

We provide minimal Docker containers intended for test-driving the latest UPC++ release (see above answer).

The E4S project provides a variety of full-featured containers in Docker, Singularity, OVA and EC2 that include UPC++, along with the entire E4S SDK. See documentation at the E4S website for usage instructions.

How should I cite UPC++ in publications?

The UPC++ publications page provides the recommended general citations for citing UPC++ and/or GASNet-EX, along with a BibTex file you can import directly to make it easy.

We're especially interested to hear about work-in-progress publications using UPC++, please contact us to start a discussion!

Futures and completion

Are UPC++ futures thread-safe?

No. See this guide section for more details.

Does a barrier synchronize RMA operations (rput/rget)?

No. UPC++ provides per-operation completions to this end. See this guide section

Will calling sleep() flip a future to the ready state?

No. Futures returned by communication operations will only become readied during upcxx::progress() or other library calls with user-level progress such as future::wait(). No amount of chrono sleeping will flip a future even if the transfer has actually completed.

When I conjoin futures using when_all(), in what order do the operations complete? Is there a possibility of a race condition?

Futures progress sequentially on their owning thread and in unspecified order. UPC++ futures are not a tool for expressing parallelism, but instead non-deterministic sequential behavior. Futures allow a thread to react to events in the order they actually occur, since the order in which communication operations complete is not deterministic. While UPC++ does support multiple threads within each process, and each thread can be managing its own futures, the same future must never be accessed concurrently by multiple threads. Also, UPC++ will never "spawn" new threads. See the section "Futures and Promises" of the UPC++ specification.

Remote Procedure Call (RPC)

When I send a lambda expression to RPC, can I safely use lambda captures?

The short answer is: NEVER use by-reference lambda captures for RPC, and by-value captures are ONLY safe if the types are TriviallySerializable (TriviallyCopyable). See this guide section for more details.

Can I safely send a pointer to static data variables and dereference them in an RPC callback?

In general this is NOT guaranteed to work correctly. In particular, systems using Address space layout randomization (ASLR) (a security feature which is enabled by default on many Linux and macOS systems) will load static data and possibly even code segments at different randomized base addresses in virtual memory for each process in the job.

Note your callbacks can safely refer to static/global variables by name, which will be resolved correctly wherever the callback runs. For variables with a dynamic lifetime, the recommended solution is to use a dist_object wrapper; sending a dist_object reference as an RPC argument automatically ensures the callback receives a reference to the corresponding local representative of the dist_object.

Do RPC callbacks incur concurrency or thread-safety issues?


Remote Procedure Call (RPC) callbacks always execute sequentially and synchronously inside user-level progress for the thread holding the unique "master persona" (a special persona that is initially granted to the thread calling upcxx::init() on each process). User-level progress only occurs inside a handful of explicit UPC++ library calls (e.g. upcxx::progress() and upcxx::future::wait()).

For more details, see this guide section.

Can you make an RPC call from within a RPC callback? When I try to wait on the nested RPC I get an error/hang.

You can absolutely launch RPC, RMA or other communication from within an RPC callback. However you cannot wait on the completion of communication from within the RPC callback. If you run in debug mode you'll get an error message explaining that, e.g.: "You have attempted to wait() on a non-ready future within upcxx progress, this is prohibited because it will never complete."

So if you launch communication from RPC callback context you need to use UPC++'s other completion and asynchrony features to synchronize completion instead of calling future::wait(). There are many different ways to do this, and the best choice depends on the details of what the code is doing. For example, one could request future-based completion on that op and either stash the resulting future in storage outside the callback's stack frame, or better yet schedule a completion callback on the future using future::then() with a continuation of whatever you would have done after wait (more details). There are also operations (like rpc_ff) and completion types (remote_cx::as_rpc) that signal completion at the remote side (more details), so you perform synchronization at the target instead of the initiator.

Does upcxx do any optimization if I happen to call an RPC destined to rank_me()? Do I need to explicitly do anything to make that more efficient?

There is an optimization applied to all targets in the local_team() (shared-memory bypass avoids one payload copy for large payloads using the rendezvous-get protocol), but nothing specific to same-process "loopback" (rank_me).

If you think same-process/loopback RPCs comprise a significant fraction of your RPC traffic, you should probably consider that serialization semantics require copying the argument payload at least once, even for a loopback RPC. Also, the progress semantics require deferring callback invocation until the next user-level progress, so if the input arguments are hot in-cache then you may lose that locality by the time the RPC callback runs. There is also some intrinsic cost associated with callback deferment that makes it significantly more expensive than a synchronous direct function call, even for lightweight RPC arguments.

So if the RPC arguments are large (e.g., a large upcxx::view) or otherwise expensive to serialize (e.g., lots of std::string or other containers requiring individual fine-grained allocations), and/or are called with very high frequency, then you might consider converting a loopback RPC into a direct function call, ideally passing the arguments by reference (assuming the callback can safely share the input argument data without copying it). Such a transformation would be prohibited by the library API semantics (i.e., because it's not transparent), but a user with knowledge of the application and callback semantics could do it safely in many circumstances and potentially get a big win under the right circumstances.


How should I handle serialization of user-defined classes?

Several of the STL container types are automatically serialized, as are TriviallyCopyable types. For more complicated types there are several mechanisms for customizing serialization - see this guide section. Note that if you are transferring a container of data elements "consumed" by the callback, you should consider using view-based serialization to transfer those elements to reduce extraneous data copies (see spec for details).

I have defined some types that are not trivially copyable and thus need to be serialized. When I try to RMA an instance of my class, static_assert for TriviallySerializable concept fails.

The error you are getting refers to TriviallySerializable because the RMA operation you are attempting does not support the serialization you seek. Operations which are intended to be accelerated by network RMA hardware (such as rput, rget) can only move flat byte sequences, hence we assert the type given can be meaningfully moved that way. RPC's are not restricted in the types they transmit since the CPU is always involved, and we do indeed accept and serialize most std containers when given as RPC arguments. There are also now interfaces for defining custom serialization on your own types for use with RPC - see this guide section.

How does the efficiency of passing a upcxx::view<T> argument to RPC compare to other mechanisms for passing the same TriviallySerializable T elements to an RPC callback?

Background info on view-based serialization. A view argument over a TriviallySerializable type adds 8 bytes of serialized data payload to store the number of elements. Aside from that there are no additional overheads to using a view. There are no incremental costs associated with direct access to the network buffer from the RPC callback at the target.

If your alternative is to send those elements in a container like a std::vector, std::list, std::set, etc those also all add 8 bytes of element count in the serialized representation (std::array notably does not, as the element count is static info), but these containers all also add additional element data copies and allocation calls to construct the container at the target. So a view should never be worse than a container over the same elements, and can be significantly better.

If your alternative is to send each element as an individual unboxed RPC argument, that might be comparable to a view for a very small number of elements. But the view should quickly win once you amortize the extra 8 bytes on-the-wire and reap the efficiency benefit of passing the arguments to the callback function via pointer indirection instead of forcing the compiler to copy/move each element to the program stack (beyond the point where they all fit in registers).

Remote Memory Access (RMA)

What are the benefits of using upcxx::rput and upcxx::rget RMA communication?

The main benefits are:

  1. The data transfer is performed in a zero-copy manner on most HPC-relevant hardware (e.g. Cray networks or InfiniBand). Fewer payload copies means less CPU overheads tied up in communication, and these costs are especially relevant when the payload is large.
  2. On HPC-relevant hardware, the transfer is one-sided down to the hardware level. Specifically, the data transfer is offloaded to the network hardware on the remote end, meaning there are no communication delays associated with involvement of the remote CPU.

These same benefits also apply to upcxx::copy when applied to host memory or device memory with native memory kinds support.

Are data transfers still one-sided when using RMA with remote completion?

Yes, with a caveat. UPC++ allows RMA transfers (rput and copy) to request target-side remote completion notification, sometimes known as "signaling put". This feature is described in this guide section.

Here's a simple example:

double *lp_src = ...;
global_ptr<double> gp_dst = ...;
upcxx::rput(lp_src, gp_dst, count, 
            remote_cx::as_rpc([](t1 f1, t2 f2){ ... }, a1, a2));

The RMA data payload transfer (from lp_src to gp_dst) remains fully one-sided and zero-copy using RDMA hardware on networks that provide that capability (eg. Cray Aries and InfiniBand).

The RPC arguments (a1, a2 and the lambda itself) are subject to serialization semantics, so that part of the payload is not zero-copy; however in a well-written application this is typically small (under a few cache lines) so the extra copies are irrelevant to performance. Also as usual, the RPC callback will not execute until the target process enters UPC++ user-level progress.

What happens if there are conflicting RMA accesses to the same memory location?

If two or more non-atomic RMA operations concurrently access the same shared memory location with no intervening synchronization and at least one of them is a write/put, that constitutes a data race, just like in thread-style programming. This property holds whether the accesses are performed via RMA function calls (e.g. upcxx::rput) or via deference operations on raw C++ pointers by a process local to the shared memory. An RMA data race is not immediately fatal, but any values returned and/or deposited in memory will be indeterminate, i.e. potentially garbage. Well-written programs should use synchronization to avoid data races.

UPC++ also provides remote atomic operations that guarantee atomicity in the presence of conflicting RMAs to locations in shared memory. These can be used in ways analogous to atomic operations in thread-style programming to ensure deterministically correct behavior.

Memory Kinds and GPU support

How does UPC++ support GPU accelerators?

UPC++ provides the Memory Kinds interface which enables RMA communication between GPU memories and host memories anywhere in the system, all within a convenient and uniform abstraction. UPC++ global_ptr's can reference memory in host or GPU memory, local or remote, and the upcxx::copy() function moves data between them.

For more details see this guide section

What GPUs does UPC++ support?

The UPC++ Memory Kinds interface currently supports:

  • CUDA-compatible devices, including all recent NVIDIA-branded GPUs
  • ROCm/HIP-compatible devices, including all recent AMD-branded GPUs

On systems with Mellanox-branded InfiniBand network hardware, this additionally includes native RMA support implemented using offload technologies such as GPUDirect RDMA and ROCmRDMA. Forthcoming releases will add memory kinds supporting GPUs manufactured by Intel, and native RMA support for GPU memory on additional networks.

How can I share GPU device memory between UPC++ and other libraries?

UPC++ provides several options for constructing a device segment, which is managed by a upcxx::device_allocator object. The simplest way to construct a device segment is the upcxx::make_gpu_allocator() factory function, which is discussed in this guide section.

By default, device segment creation allocates a segment of the requested size on the device (i.e. UPC++ performs the allocation). However there is also an argument that optionally accepts a pointer to previously allocated device memory (e.g. allocated using cudaMalloc or some other GPU-enabled library).

Once constructed, the device_allocator abstraction can optionally be used to carve up the segment into individual objects using device_allocator::::allocate<T>().

Alternatively you may construct a UPC++ device segment around a provided region of device memory that already contains objects and data (e.g. created using a different programming model). Constructing a device_allocator in this case allows UPC++ to register the device memory with the network for RMA and provides the means to create a global_ptr referencing the device memory that can be shared with other processes.

A local pointer into the device segment can be converted into a global_ptr of appropriate memory kind via device_allocator::to_global_ptr(), and back again via device_allocator::local(). The global_ptr version of the pointer can be shared with other processes and used in the upcxx::copy communication calls, and the local version of the device pointer can be passed to GPU compute kernels and other GPU-aware libraries.

Are UPC++ Memory Kinds compatible with CUDA Managed Memory (aka Unified Memory)?

CUDA Managed Memory (sometimes referred to as "Unified Memory") is a feature that allows memory allocated in a special way to transparently migrate back-and-forth between physical pages resident in GPU memory and system DRAM, based on accesses. This migration uses page fault hardware to detect accesses and perform page migration on-demand, which can have a substantial performance cost.

The upcxx::device_allocator constructors either allocate a (non-managed) device segment in GPU memory or accept a pre-allocated device-resident memory segment. In both cases the entire device segment must be resident in GPU memory. When native Memory Kinds are active (enabled by default for supported networks) the pages comprising the cuda_device memory segment are pinned/locked in physical memory and registered with the network adapter as a prerequisite to enabling offloaded RDMA access (e.g. GPUDirect RDMA) on behalf of remote processes. This currently precludes Managed Memory on device segments, as it prevents migration of pages in the memory kinds segment (prohibiting direct/implicit access from the host processor). The host processor can explicitly copy data to/from a memory kinds device segment using upcxx::copy, or (for segments it owns) using an appropriate cudaMemcpy call.

Nothing prevents use of CUDA Managed Memory for objects outside a UPC++ device segment.

Running UPC++ Programs

Is a UPC++ rank a thread or a process? How does UPC++ allow one rank to directly access the memory of another rank?

Like MPI, a UPC++ rank is a process. How they are created is platform dependent, but you can count on fork() being a popular case for non-supercomputers. Typically, UPC++ processes use process shared memory (POSIX shm_***) to communicate within a node. Several other OS-level shared memory bypass mechanisms are also supported - see GASNet documentation on "GASNet inter-Process SHared Memory (PSHM)" for details.

How can I control the size of the shared heap segment?

The shared heap segment size for each process is determined at job creation and remains fixed for the lifetime of the job. The easiest way to control the shared segment size is the upcxx-run -shared-heap=HEAPSZ argument, where HEAPSZ can be either an absolute per-process size (e.g., -shared-heap=512MB, -shared-heap=2GB, etc) or a percentage of physical memory (e.g., -shared-heap=50%, which devotes half of compute-node physical memory to shared heap, divided evenly between co-located processes).

For details on upcxx-run see this guide section

What is the strategy for using a debugger with UPC++?

See docs/debugging for tips on debugging UPC++ codes.

How can I use Valgrind with UPC++?

See docs/debugging for information about using the Valgrind instrumentation framework in UPC++ programs.

Can a UPC++ program dynamically add more processes?

Like MPI and other SPMD models, the maximum amount of potential parallelism has to be specified at job launch. Dynamically asking for more processes to join the job is not possible, but asking for more than you need up front and "waking" them dynamically is doable. Consider having all but rank=0 spinning in a loop on upcxx::progress(). When rank 0 wants to offload work to rank 1, it can send an rpc which could kick it out of its spin loop to do real work.

How do I launch a UPC++ program on multiple nodes, and particularly in the case when the nodes are connected by IB?

The UPC++ configure script should autodetect the availability of InfiniBand (done during a step called "configuring gasnet"). To build a program to run over InfiniBand, make sure that:

  • CXX=mpicxx is in your environment when calling the configure script.
  • Compile your program using upcxx -network=ibv

Then to launch just use the upcxx-run script.

How do I launch distributed UPC++ jobs with Infiniband?

You have 3 options:

a. upcxx-run (which internally invokes gasnetrun_ibv) to perform ssh-based spawning.

  • This option requires you to correctly setup password-less SSH authentication from at least your head node to all the compute nodes - this document describes how to do that in the context of BUPC (which also uses GASNet) and the information is analogous for UPC++
  • It additionally requires that you pass the host names into the environment, e.g. GASNET_SSH_SERVERS="host1 host2 host3...
  • The gasnetrun_ibv -v option is often useful for troubleshooting site-specific problems that may arise here.
  • You can see more details of what upcxx-run is doing by passing the -v option or setting UPCXX_VERBOSE=1 in the environment, and even more using upcxx-run -vv, which passes -v to the underlying gasnetrun_ibv command.

b. mpirun (possibly invoked from upcxx-run) - uses MPI for job spawn ONLY, then IBV for communication

  • This requires UPC++/GASNet was configured, built and installed with MPI support (usually by setting CXX=mpicxx)
  • Also requires that (non-GASNet) MPI programs spawn correctly via mpirun (and any MPI-implementation-specific tweaking required to make that work)
  • It's also best to use TCP-based MPI if possible for this purpose, to prevent the MPI library from consuming IBV resources that won't be used by the app. There is more info on that topic in this document:
  • mpirun often has the -v option to provide spawn status for troubleshooting

c. PMI spawning.

For more details about job spawning see this guide section

How to I bind processes to cores and ensure the binding is correct to maximize NUMA hardware resources?

The best means of core-binding is system-specific. Parallel job spawners such as SLURM srun and IBM jsrun have options to control core binding as part of parallel process launch, and on systems with those spawners that's usually the best place to start. This might mean a trial run with upcxx-run -show to get the necessary "starting point" for envvars and spawner command, before manually adding some core-binding options. Failing those, tools like hwloc's hwloc-bind can be invoked as a wrapper around your application to achieve specific binding, although it might require some scripting trickery to bind different ranks to distinct resources.

Either way on a system with the hwloc package installed, you can programmatically verify cores were bound as expected using code like the following, which prints the current core binding to stdout:

  std::string s("echo ");
  s += std::to_string(upcxx::rank_me()) + ": `hwloc-bind --get`";

For more details about job spawning see this guide section

Teams and Collectives

I ran the following program on my cluster and the output arrived in an unexpected order. Does this mean barrier is broken?

printf("Hello world from rank %d\n", upcxx::rank_me());
printf("Goodbye world from rank %d\n", upcxx::rank_me());

The issue here is that while upcxx::barrier() DOES synchronize the execution of all the processes, it does NOT necessarily flush all their stdout/stderr output all the way back to the central console on a cluster. The call to fflush(stdout) ensures that each process has flushed output from its private buffers to the file descriptor owned by each local kernel, however in a distributed-memory run this output stream still needs to travel over the network (eg via a socket connection) from the kernel of each node back to the login/batch node running the upcxx-run command before it reaches the console and is multiplexed with output from the other compute nodes to appear on your screen. upcxx::barrier() does not synchronize any part of that output-forwarding channel, so it's common to see such output "races" on distributed-memory systems. The same is true of MPI_Barrier() for the same reasons.

The only way to reliably workaround this issue is to perform all console output from a single process (conventionally rank 0). This is also a good idea for scalability reasons, as generally nobody wants to scroll through 100k copies of "hello world" ;-). It's also worth noting that console output usually vastly underperforms file system I/O with a good parallel file system, so the latter should always be preferred for large-volume I/O.

Is there a UPC++ equivalent to MPI_COMM_SELF?

UPC++ provides built-in teams for the entire job (upcxx::world()) and for the processes co-located on the same physical node (upcxx::local_team()). There is no built-in singleton team for the calling process, but it's easy to construct one using team::create() -- see example below.

#include <upcxx/upcxx.hpp>

upcxx::team *team_self_p;
inline upcxx::team &self() { return *team_self_p; }

int main() {
   int world_rank = upcxx::rank_me();
   // setup a self-team:
   team_self_p = new upcxx::team(upcxx::world().create(std::vector<int>{world_rank})); 

   // use the new team
   UPCXX_ASSERT(self().rank_me() == 0);
   UPCXX_ASSERT(self().rank_n() == 1);
   upcxx::dist_object<int> dobj(world_rank, self());
   UPCXX_ASSERT(dobj.fetch(0).wait() == world_rank);

   // cleanup (optional)
   delete team_self_p;


For more info on teams, see this guide section.


Can I mix UPC++ with C++ threads and/or POSIX threads?

Absolutely! The main caveat is that if more than one thread will make UPC++ calls then you need to build using the thread-safe version of UPC++. This usually just means compiling with upcxx -threadmode=par (or with envvar UPCXX_THREADMODE=par).

The UPC++ model is fully thread-aware, and even provides LPC messaging queues between threads as a convenience. For details, see this guide section.

My program has multiple threads, but only one of them makes UPC++ calls. Can I use the default threadmode=seq UPC++ library?

If exactly one thread per process ever makes UPC++ calls over the lifetime of the program, then yes. This is analogous to MPI_THREAD_FUNNELED, and is permitted under UPC++ threadmode=seq. However threadmode=seq does NOT allow UPC++ communication calls from multiple threads per process, even if those calls are serialized by the caller to avoid concurrency (the analog of MPI_THREAD_SERIALIZED).

For more details on what is permitted in threadmode=seq see this document.

Can I use UPC++ with Kokkos?

Yes! The communication facilities provided by UPC++ nicely complement the on-node computational abstractions provided by Kokkos. Here's a simple example showing UPC++ with Kokkos. You can even integrate UPC++ memory kinds (GPU support) with GPU computation managed by Kokkos - here's an example of that, which is also the topic of a recent paper showing how this code can outperform an equivalent Kokkos+MPI program.


How do I report a problem or suspected bug in UPC++?

Use the UPC++ Issue Tracker to report issues or enhancement requests for the implementation or documentation.

Where else can I find help with UPC++?

The UPC++ Support Forum (email) is a public forum where you can ask questions or browse prior discussions.

You may also contact us via the staff email for private communication or non-technical questions.

Where can I find documentation for the GASNet-EX network layer?

The GASNet website includes links to documentation relevant to the networking layer inside UPC++. This includes:

  1. The GASNet README provides detailed information on running jobs using GASNet (every UPC++ program is also a GASNet program). This notably includes documentation on environment variables that influence the operation of UPC++/GASNet jobs.
  2. The GASNet ChangeLog documents detailed changes deployed at the network level in each release (GASNet-EX releases are generally synchronized with UPC++ releases).
  3. GASNet conduit READMEs provide detailed network-specific documentation regarding UPC++ jobs using that particular network backend. These notably include network-specific environment variables that influence the operation of UPC++/GASNet jobs. For example:

Where can I read about known issues in the GASNet-EX network layer?

The GASNet Issue Tracker contains the latest detailed information about known issues or enhancement requests in the GASNet-EX networking layer. Note that GASNet is an internal component of UPC++ that is also used by other programming models, so many of the issues there may not be relevant to UPC++.

What does the ECONGESTION error from the UDP network backend mean?

Under some circumstances with udp-conduit you might see a runtime error like this:

*** FATAL ERROR(Node 0): An active message was returned to sender,
    and trapped by the default returned message handler (handler 0):
Error Code: ECONGESTION: Congestion at destination endpoint 

ECONGESTION is the error that indicates a sending process timed out and gave up trying to send a packet to a receiver who never answered. Under default settings, this means the udp-conduit retransmit protocol gave up after around 60 seconds with no network attentiveness at the target process. The most common way this occurs is when the target process is genuinely frozen; generally because the target process is either (1) stuck in a prolonged compute loop with no UPC++ internal progress or communication calls from any thread, or (2) stopped by a debugger. The other thing we've occasionally seen is some DDoS filters or firewalls will step in and stop delivering UDP traffic after a certain point if it exceeds an activity threshold.

The udp-conduit retransmission protocol parameters can be tweaked using environment variables, as described in the udp-conduit README. So for example if you are using a debugger to temporarily freeze some processes, you probably want to set the timeout to infinite via envvar export GASNET_REQUESTTIMEOUT_MAX=0 which tells sending processes to never time out waiting for a network-level response from a potentially frozen process.