This is a work-in-progress and not maintained. Answers may be out-of-date.
1Q. How should I handle serialization of user-defined classes, since the upcxx interface for user-defined serialization has not yet been implemented (or even fully specified)?
1A. Several of the STL container types are automatically serialized, but for a class that is not TriviallySerializable we suggest the following. Send the STL container fields as separate arguments to the rpc call. Note that if this communication is in your critical path and the transferred data is "consumed" by the callback, you should consider using view-based serialization to transfer those elements to reduce extraneous data copies (see spec section 6.5 for details).
2Q. I have defined some types that are not trivially copyable (they contain std containers) and thus they need to be serialized. In MPI I could do that easily with implementing serialize method for each class. upcxx specification says that I should be able to do the same, but when I try to rget an instance of my class, static_assert for TriviallySerializable concept fails. Does upcxx support boost::serialization, or is it just a documented, but not suppported feature?
2A. We intend to support boost serialization eventually, but this has yet to be implemented. Since the error you are getting refers to TriviallySerializable, the operation you are attempting will likely never support the serialization you seek. Operations which are intended to be accelerated by network RMA hardware (such as rput, rget) can only move flat byte sequences, hence we assert the type given can be meaningfully moved that way. RPC's are not restricted in the types they transmit since the CPU is always involved, and we do indeed accept and serialize many std containers (though this is largely undocumented) when given as RPC arguments. But, that does not mean your serializable types will work as RPC arguments. Beware though, I believe the latest release had a serious bug which broke much of std container support. If you want to attempt this perhaps try the v2017.9.0 release.
3Q. Is a rank a thread or a process? Are they created using Pthreads or Linux system call fork()? If each rank is a process, and I run UPC++ in some machine which has security enforced (accessing another process's memory is not allowed), does UPC++ crash when one rank tries to access the memory of another rank?
3A. Like MPI, a rank is a process. How they are created is platform dependent, but you can count on fork() being a popular case for non-supercomputers. Typically, UPC++ processes use process shared memory (POSIX shm_***) to communicate. If that isn't available, there is an alternate configuration which will use UDP sockets (not a high performance implementation though).
4Q. How do I launch a UPC++ program on multiple nodes, and particularly in the case when the nodes are connected by IB?
4A. The UPC++ install script should autodetect the availability of infiniband (done during a step called "configuring gasnet"). To build a program to run over infiniband, make sure that:
CXX=mpicxxis in your environment when calling the install script.
UPCXX_GASNET_CONDUIT=ibvis in your environment when calling the upcxx-meta scripts that produce compiler flags. Then to launch just use the upcxx-run script. Let me know if you run into problems.
5Q. When I conjoin futures using when_all(), it appers that progress will made on only when wait() is called and that the operations complete in an arbitrary order. In what order do the operations complete? Why do the conjoined operations complete in less time than an explicity serial ordering? Is there a possilbity of a race condition? Does each future-bearing operation execute in its own thread.
5A. Futures progress sequentially on the calling thread and in unspecified order. UPC++ futures are not a tool for expressing parallelism, but instead non-deterministic sequential behavior. Futures allow a thread to react to events in the order they actually occur, since the order in which communication operations complete is not deterministic. While UPC++ does support multiple threads within each process, and each thread can be managing its own futures, the same future must never be accessed concurrently by multiple threads. Also, UPC++ will never "spawn" new threads. See the section "Futures and Promises" of the UPC++ specification.
6Q. Does a barrier synchronize RMA operations (rput/rget)?
6A. No. UPC++ provides generlized completions to this end.
7Q. Are UPC++ futures thread-safe? 7A. No.
8Q. Is there any good unit test framework that is nicely compatible with developing UPC++ codes? How does the development team write their tests? I started off using the C++ testing framework CATCH, but it will be a bit hacky to use this serial test framework for a parallel code.
8A. ? [Mail chain from Greg Meyer]
9Q. What is the strategy for using a debugger with UPC++?
9A. Always build in debug mode when debugging by setting the eniroment variable
If your problem is a simple enough that a crash stack might solve it, set
GASNET_BACKTRACE=1and you will get a backtrace from any rank crash.
Otherwise, if the problem occurs with a single rank, you can spawn smp-conduit jobs in gdb just like any other process (i.e
If you need multiple ranks and/or a distributed conduit, we recommend setting one or more of the following variables, and then following the on-screen instructions to attach a debugger to the failing rank and resume the process:
GASNET_FREEZE:set to 1 to make GASNet pause and wait for a debugger to attach on startup
GASNET_FREEZE_ON_ERROR:set to 1 to make GASNet pause and wait for a debugger to attach on any fatal errors or fatal signals
GASNET_FREEZE_SIGNAL:set to a signal name (e.g.
SIGUSR1) to specify a signal that will cause the process to freeze and await debugger attach.
10Q. How can I alter GASNET and UPC++ configuration/compilation process and how, for example, to enable debugging output?
10A. Pplease see the INSTALL.md file at
For a normal UPC++ install, GASNet is installed with both debug and non-debug libraries, and you select which mode you want at app compile time by setting UPCXX_CODEMODE when running upcxx-meta.
UPCXX_CODEMODE=[O3|debug]: O3is for highly compiler-optimized code. debug produces unoptimized code, includes extra error checking assertions, and is annotated with the symbol tables needed by debuggers. The default value is always O3.
11Q. Can a UPC++ program dyanmically add more processes?
11A. Like MPI and other SPMD models, the maximum amount of potential parallelism has to be specified at job launch. Dynamically asking for more processes to join the job is not possible, but asking for more than you need up front and "waking" them dynamically is doable. Consider having all but rank=0 spinning in a loop on
upcxx::progress(). When rank 0 wants to offload work to rank 1, it can send an rpc which could kick it out of its spin loop to do real work.
12Q. How do I launch distributed UPC++ jobs with Infiniband?
12A. You have 3 options:
a. upcxx-run (which internally invokes
gasnetrun_ibv) to perform ssh-based spawning.
- This option requires you to correctly setup password-less SSH authentication from at least your head node to all the compute nodes - this document describes how to do that in the context of BUPC (which also uses GASNet) and the information is analogous for UPC++
- It additionally requires that you pass the host names into the environment, e.g.
GASNET_SSH_SERVERS="host1 host2 host3...
gasnetrun_ibv -voption is often useful for troubleshooting site-specific problems that may arise here.
- You can see more details of what upcxx-run is doing by setting
UPCXX_VERBOSE=1in the environment, and even more by manually passing -v to the underlying
b. mpirun (possibly invoked from upcxx-run) - uses MPI for job spawn ONLY, then IBV for communication
* This requires UPC++/GASNet was configured, built and installed with MPI support (usually by setting `CXX=mpicxx`) * Also requires that (non-GASNet) MPI programs spawn correctly via *mpirun* (and any MPI-implementation-specific tweaking required to make that work) * It's also best to use TCP-based MPI if possible for this purpose, to prevent the MPI library from consuming IBV resources that won't be used by the app. There is more info on that topic in this document: [https://gasnet.lbl.gov/dist/other/mpi-spawner/README](https://gasnet.lbl.gov/dist/other/mpi-spawner/README) * *mpirun* often has the -v option to provide spawn status for troubleshooting
c. PMI spawning.
13Q. Will calling chrono sleeping flip a future to the ready state?
13A. It is a requirement that
upcxx::progress() be called for futures to flip to their ready state. No amount of chrono sleeping will flip a future even if the transfer has actually completed.
14Q. How to I bind processes to cores and ensure the binding is correct to maximize NUMA hardware resources?
14A. The best means of core-binding is system-specific. Parallel job spawners such as SLURM
srun and IBM
jsrun have options to control core binding as part of parallel process launch, and on systems with those spawners that's usually the best place to start. This might mean a trial run with
upcxx-run -show to get the necessary "starting point" for envvars and spawner command, before manually adding some core-binding options. Failing those, tools like hwloc's
hwloc-bind can be invoked as a wrapper around your application to achieve specific binding, although it might require some scripting trickery to bind different ranks to distinct resources.
Either way on a system with the
hwloc package installed, you can programmatically verify cores were bound as expected using code like the following, which prints the current core binding to stdout:
std::string s("echo "); s += std::to_string(upcxx::rank_me()) + ": `hwloc-bind --get`"; system(s.c_str());
Q. Does upcxx do any optimization if I happen to call an rpc destined to
rank_me()? Do I need to explicitly do anything to make that more efficient?
A. There is an optimization applied to all targets in the
local_team() (shared-memory bypass avoids one payload copy for large payloads using the rendezvous-get protocol), but nothing specific to same-process "loopback" (
If you think same-process/loopback RPCs comprise a significant fraction of your RPC traffic, you should probably consider that serialization semantics require copying the argument payload at least once, even for a loopback RPC. Also, the progress semantics require deferring callback invocation until the next user-level progress, so if the input arguments are hot in-cache then you may lose that locality by the time the RPC callback runs. There is also some intrinsic cost associated with callback deferment that makes it significantly more expensive than a synchronous direct function call, even for lightweight RPC arguments.
So if the RPC arguments are large (e.g., a large
upcxx::view) or otherwise expensive to serialize (e.g., lots of
std::string or other containers requiring individual fine-grained allocations), and/or are called with very high frequency, then you might consider converting a loopback RPC into a direct function call, ideally passing the arguments by reference (assuming the callback can safely share the input argument data without copying it). Such a transformation would be prohibited by the library API semantics (i.e., because it's not transparent), but a user with knowledge of the application and callback semantics could do it safely in many circumstances and potentially get a big win under the right circumstances.