Wiki

Clone wiki

upcxx / docs / site-docs

Site-specific Documentation for Public UPC++ Installs

This document provides usage instructions for installations of UPC++ at various computing centers.

This document is a continuous work-in-progress, the purpose of which is to provide up-to-date information on public installs maintained by (or in collaboration with) the UPC++ team. However, systems are constantly changing. So, please report of any errors or omissions in the issue tracker.

Typically installs of UPC++ are maintained only for the current default versions of the system-provided environment modules such as compilers and CUDA. If you find one of the installs described in this document to be out-of-date with respect to the current defaults, please report using the issue tracker link above.

This document is not a replacement for the documentation provided by the centers, and assumes general familiarity with the use of the systems.

This document is intended to describe use of existing UPC++ installations and is not a guide to configuring or installing UPC++.


Table of contents


NERSC Cori (Haswell and KNL nodes)

Stable installs are available through environment modules. A wrapper is used to transparently dispatch commands such as upcxx to an install appropriate to the currently loaded PrgEnv-{intel,gnu,cray}, craype-{haswell,mic-knl} and compiler (intel, gcc, or cce) environment modules.

On Cori, the UPC++ environment modules select a default network of aries. You can optionally specify this explicitly on the compile line with upcxx -network=aries ....

Batch jobs

By default, batch jobs on Cori inherit both $PATH and the $MODULEPATH from the environment at the time the job is submitted/requested using sbatch or salloc. So, no additional steps are needed to use upcxx-run if a upcxx environment module was loaded when sbatch or salloc ran.

Interactive example:

cori$ module load upcxx

cori$ module switch craype-haswell craype-mic-knl # both work

cori$ upcxx --version
UPC++ version 2020.10.0  / gex-2020.10.0
Copyright (c) 2020, The Regents of the University of California,
through Lawrence Berkeley National Laboratory.
https://upcxx.lbl.gov

icpc (ICC) 19.0.3.199 20190206
Copyright (C) 1985-2019 Intel Corporation.  All rights reserved.

cori$ upcxx -O hello-world.cpp -o hello-world.x

cori$ salloc -C knl -q interactive --nodes 2
salloc: Granted job allocation 28703076
salloc: Waiting for resource configuration
salloc: Nodes nid0[2350-2351] are ready for job

nid02350$ upcxx-run -n 4 -N 2 ./hello-world.x
Hello world from process 0 out of 4 processes
Hello world from process 2 out of 4 processes
Hello world from process 1 out of 4 processes
Hello world from process 3 out of 4 processes

CMake

A UPCXX CMake package is provided in the UPC++ install on Cori, as described in README.md. Thus with the upcxx environment module loaded, CMake should "just work" on Cori. However, /usr/bin/cmake on Cori is fairly old and users may want to use a newer version via module load cmake.

Running 64-PPN on Haswell Nodes

Running 64 UPC++ processes per node on Cori Haswell nodes (using both hardware threads of all 32 cores) requires a non-default value (4M or larger) for the default size of "hugepages". This can be achieved by loading an appropriate craype-hugepages[size] environment module at run time or by setting the environment variable $HUGETLB_DEFAULT_PAGE_SIZE to a supported value of 4M or larger.

For more information on hugepages, run man intro_hugepages on a Cori login node. However, one should disregard the text describing PGAS models (and $XT_SYMMETRIC_HEAP_SIZE in particular) as these apply to the Cray-provided PGAS implementations, and not to GASNet-based ones such as UPC++.


OLCF Summit

Stable installs are available through environment modules. A wrapper is used to transparently dispatch commands such as upcxx to an install appropriate to the currently loaded compiler environment module.

There are two distinct environment modules available for UPC++:

  • upcxx-cuda
    This module supports "memory kinds", a UPC++ feature that enables transparent communication to/from CUDA memory on Summit's GPUs. The default version uses GPUDirect RDMA capabilities of the GPU and NIC on Summit to perform GPU memory transfers at a speed comparable to host memory. The default version currently supports only the gcc compiler family. However, older versions are still available which support both gcc and pgi at the expense of lacking acceleration for GPU memory transfers.
  • upcxx
    This module supports the gcc and pgi compiler families, but lacks support for GPU memory kinds.

On Summit, the UPC++ environment modules select a default network of ibv. You can optionally specify this explicitly on the compile line with upcxx -network=ibv ....

Please note that UPC++ does not yet work with the IBM XL compilers (the default compiler family on Summit).

Setting MODULEPATH

Because the UPC++ installation is not yet as well-integrated as on Cori, one must run module use /gpfs/alpine/world-shared/csc296/summit/modulefiles to add a non-default directory to the MODULEPATH before the UPC++ environment modules will be accessible. We recommend inclusion of this command in ones shell startup files, such as $HOME/.login or $HOME/.bash_profile.

If not adding the command to ones shell startup files, the module use ... command will be required once per login shell in which you need a upcxx environment module.

Batch jobs

By default, batch jobs on Summit inherit both $PATH and the $MODULEPATH from the environment at the time the job is submitted using bsub. So, no additional steps are needed in batch jobs using upcxx-run if a upcxx or upcxx-cuda environment module was loaded when the job was submitted.

Interactive example (assuming module use in shell startup files)

summit$ module load gcc   # since default `xl` is not supported

summit$ module load upcxx-cuda

summit$ upcxx -V
UPC++ version 2020.11.0  / gex-2020.11.0-memory_kinds
Copyright (c) 2020, The Regents of the University of California,
through Lawrence Berkeley National Laboratory.
https://upcxx.lbl.gov

g++ (GCC) 6.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

summit$ upcxx -O hello-world.cpp -o hello-world.x

summit$ bsub -W 5 -nnodes 4 -P [project] -Is bash
Job <714297> is submitted to default queue <batch>.
<<Waiting for dispatch ...>>
<<Starting on batch2>>

bash-4.2$ upcxx-run -n4 ./hello-world.x
Hello world from process 0 out of 4 processes
Hello world from process 2 out of 4 processes
Hello world from process 3 out of 4 processes
Hello world from process 1 out of 4 processes

Single-node runs on Summit

On a system configured as Summit has been, there are multiple complications related to launch of executables compiled for -network=smp such that no use of jsrun (or simple wrappers around it) can provide a satisfactory solution in general. Therefore, the provided installations on Summit do not support -network=smp. We recommend that for single-node (shared memory) application runs on Summit, one should compile for the default network (ibv). It is also acceptable to use -network=mpi, such as may be required for some hybrid applications (UPC++ and MPI in the same executable).

Using PGI compilers on Summit

By default, the installation of PGI compilers on Summit uses the libstdc++ from the (extremely old) /usr/bin/g++. This has been seen to lead to errors compiling and linking modern C++ code. If you must use PGI compilers, we strongly recommend only doing so with the additional pgi-cxx14 environment module loaded:

summit$ module load pgi pgi-cxx14

CMake

A UPCXX CMake package is provided in the UPC++ install on Summit, as described in README.md. CMake is available on Summit via module load cmake. With the upcxx and cmake environment modules both loaded, CMake will additionally require either CXX=mpicxx in the environment or -DCMAKE_CXX_COMPILER=mpicxx on the command line.

Prototype support for GPUDirect RDMA

The default version of the upcxx-cuda environment module (but not the upcxx one) includes prototype support for the GPUDirect RDMA (GDR) capabilities of the GPUs and InfiniBand hardware on Summit. This enables communication to and from GPU memory without use of intermediate buffers in host memory. This delivers significantly faster GPU memory transfers via upcxx::copy() than previous releases without GDR support. However, there are currently some outstanding known issues.

The upcxx-cuda environment module will initialize your environment with settings intended to provide correctness by default, compensating for the known issues in GDR support. This is true even where this may come at the expense of performance. At this time we strongly advise against changing any GASNET_* or UPCXX_* environment variables set by the upcxx-cuda environment module unless you are certain you know what you are doing. (Running module show upcxx-cuda will show what it sets).

Job launch on Summit

The upcxx-run utility provided with UPC++ is a relatively simple wrapper around the jsrun job launcher on Summit. Since the majority of the resource allocation/placement capabilities of jsrun have no equivalent in upcxx-run, and due to the complexity of a Summit compute node, we strongly recommended to use jsrun directly, for all but the simplest cases. This is especially important when using GPUs, since it is impractical to coerce upcxx-run to pass the appropriate arguments to jsrun on your behalf.

For instructions on launch of UPC++ applications using a system-provided "native" spawner, such as IBM's jsrun on Summit, see the section Advanced Job Launch in the UPC++ Programmer's Guide.

To become familiar with use of jsrun on Summit, you should read the Summit User Guide.

If you would normally have passed -shared-heap to upcxx-run, then you should set the environment variable UPCXX_SHARED_HEAP_SIZE instead. Other relevant environment variables set (or inherited) by upcxx-run can be listed by adding -show to your upcxx-run command. Additional information is available in the Advanced Job Launch chapter of the programmer's guide.

Network ports on Summit

Each Summit compute node has two POWER9 CPUs, each with its own I/O bus. Each I/O bus has a connection to the single InfiniBand Host Channel Adapter (HCA). The HCA is connected to two "rails" (network ports). This combination of two I/O buses and two network rails results in four distinct paths between memory and network. The software stack exposes these paths as four (logical) HCAs named mlx5_0 through mlx5_3.

HCA I/O bus rail
mlx5_0 CPU 0 A
mlx5_1 CPU 0 B
mlx5_2 CPU 1 A
mlx5_3 CPU 1 B

Which HCAs are used in a UPC++ application is determined at run time by the GASNET_IBV_PORTS environment variable. Which ports are used can have a measurable impact on network performance, but unfortunately there is no "one size fits all" optimal setting. For instance, the lowest latency is obtained by having each process use only the two HCAs on the I/O bus of the CPU where it is executing. Meanwhile, obtaining the maximum bandwidth of a given network rail from a single CPU requires use of both I/O buses.

By default, the upcxx and upcxx-cuda environment modules will set GASNET_IBV_PORTS=mlx5_0+mlx5_1. This utilizes both network rails, using only the I/O bus of CPU0. This penalizes transfers instantiated by CPU1, and cannot reach peak bandwidth due to using only a single I/O bus. However, for latency-sensitive applications running on only CPU0, this is a good default. Additionally, this setting is believed to be free of a corner-case problem described at the end of this section.

The following are some scenarios and their recommended settings, based on slides which describe the analogous situation for MPI applications on Summit. However, the manner in which MPI and UPC++ use multiple HCAs differs, which accounts for small differences in the recommendations made below.

  • Processes each bound to a single CPU -- latency-sensitive.
    To get the best latency from both CPU sockets requires different settings for processes running on each, in order to use both network rails and the I/O bus nearest to the CPU.

    • CPU0: GASNET_IBV_PORTS=mlx5_0+mlx5_1
    • CPU1: GASNET_IBV_PORTS=mlx5_2+mlx5_3
  • Processes each bound to a single CPU -- bandwidth-sensitive.
    How to get the full bandwidth from both CPU sockets depends on the communication behaviors of the application. If both CPUs are communicating at the same time, then the latency-optimized settings immediately above are typically sufficient to achieve peak aggregate bandwidth. However, if a single communicating CPU (at a given time) is to achieve the peak bandwidth a different pair of process-specific settings is required (which comes at the cost of slightly increased mean latency).

    • CPU0: GASNET_IBV_PORTS=mlx5_0+mlx5_3
    • CPU1: GASNET_IBV_PORTS=mlx5_1+mlx5_2
  • Processes each bound to a single CPU -- mixed or unknown behavior.
    In relative terms, the bandwidth penalty is greater when using only a single I/O bus than is the latency penalty for use of the farther I/O bus. For this reason the bandwidth-optimizing settings (immediately above) are the nearest thing to a "generic" application recommendation.

  • Processes unbound or individually spanning both CPUs.
    In this case the best average performance comes from using only half of the available paths (and using all four incurs a measurable penalty). This setting also provides a reasonable balance when one is unable to establish per-CPU settings (see below).

    • GASNET_IBV_PORTS=mlx5_0+mlx5_3

Multi-process port selection

The recommendations above include cases in which one should provide distinct environment variables to different processes. In the future we hope this can be automated. However, until that happens one can use a simple bash shell script such as the following:

#!/bin/bash
socket=$(hwloc-calc -I Node $(hwloc-bind --get))
case $socket in
 1) export GASNET_IBV_PORTS=mlx5_2+mlx5_3 ;;
 *) export GASNET_IBV_PORTS=mlx5_0+mlx5_1 ;;
esac
exec "$@"

This example script implements the latency-optimizing settings (see below for an analogous bandwidth-optimizing version) with the additional behavior of assigning the CPU 0 setting to processes which span CPUs. To demonstrate use of this example script, let us assume it has been saved as wrapper.sh in the current directory and made executable (as with chmod +x wrapper.sh). You can then use it to prefix the executable (./my_app in the following) when running with jsrun (see also section "Job launch on Summit", above):

$ jsrun [jsrun options] ./wrapper.sh ./my_app [application args]

Correctness with multiple I/O buses

As mentioned briefly above, the default was chosen in part to avoid a corner-case correctness problem. The latency-optimizing settings are believed to be highly resistant (but not entirely immune) to this problem. However, the other settings described above (ones involving mlx5_0+mlx5_3 or mlx5_1+mlx5_2) use both I/O buses in a single process, which can lead to data corruption in some cases.

The issue is that, by default, the use of multiple I/O buses may permit an rput which has signaled operation completion to be overtaken by a subsequent rput, rget or rpc. When an rput is overtaken by another rput to the same location, the earlier value may be stored rather than the latter. When an rget overtakes an rput targeting the same location, it may fail to observe the value stored by the rput. When an rpc overtakes an rput, CPU accesses to the location targeted by the rput is subject to both of the preceding problems.

If you suspect your application is seeing such data corruption (or just want to be certain that it cannot), we recommend setting GASNET_USE_FENCED_PUTS=1 in your environment at run time. This introduces a penalty in both latency and bandwidth, but the bandwidth penalty is tiny when compared to the increase due to using both I/O buses. With this in mind, the following is the example wrapper script for bandwidth-optimized runs.

#!/bin/bash
socket=$(hwloc-calc -I Node $(hwloc-bind --get))
case $socket in
 1) export GASNET_IBV_PORTS=mlx5_1+mlx5_2 ;;
 *) export GASNET_IBV_PORTS=mlx5_0+mlx5_3 ;;
esac
export GASNET_USE_FENCED_PUTS=1
exec "$@"

ALCF Theta

Stable installs are available through environment modules. A wrapper is used to transparently dispatch commands such as upcxx to an install appropriate to the currently loaded PrgEnv-{intel,gnu,cray} and compiler (intel, gcc, or cce) environment modules.

On Theta, the UPC++ environment modules select a default network of aries. You can optionally specify this explicitly on the compile line with upcxx -network=aries ....

Setting MODULEPATH

Because the UPC++ installation is not yet as well-integrated as on Cori, one must module use ... to add a non-default directory to the MODULEPATH before the UPC++ environment modules will be accessible. We recommend inclusion of the required command in ones $HOME/.modulerc to make it persistent. This file must begin with a #%Module line to be accepted by the module command.

A complete .modulerc suitable for Theta:

#%Module
module use /projects/CSC250STPM17/modulefiles

If not using .modulerc, the module use ... command will be required once per login shell in which you need a upcxx environment module.

Batch jobs

COBALT jobs (both batch and interactive) do not inherit the necessary settings from the submit-time environment, meaning both the module use ... and module load upcxx may be required in batch jobs which use upcxx-run. This is shown in the example below.

Interactive example (assuming use of .modulerc):

theta$ module load upcxx

theta$ upcxx --version
UPC++ version 2020.10.0  / gex-2020.10.0
Copyright (c) 2020, The Regents of the University of California,
through Lawrence Berkeley National Laboratory.
https://upcxx.lbl.gov

icpc (ICC) 19.1.0.166 20191121
Copyright (C) 1985-2019 Intel Corporation.  All rights reserved.

theta$ upcxx -O hello-world.cpp -o hello-world.x

theta$ qsub -q debug-cache-quad -t 10 -n 2 -A CSC250STPM17 -I
Connecting to thetamom3 for interactive qsub...
Job routed to queue "debug-cache-quad".
Memory mode set to cache quad for queue debug-cache-quad
Wait for job 418194 to start...
Opening interactive session to 3833,3836
thetamom3$ # Note that modules have reset to defaults
thetamom3$ module load upcxx
thetamom3$ upcxx-run -n 4 -N 2 ./a.out
Hello from 0 of 4
Hello from 1 of 4
Hello from 3 of 4
Hello from 2 of 4

CMake

A UPCXX CMake package is provided in the UPC++ install on Theta, as described in README.md. While /usr/bin/cmake is too old, sufficiently new CMake versions are available on Theta via module load cmake. With the upcxx and cmake environment modules both loaded, CMake should "just work" on Theta.


NERSC Cori GPU nodes

In addition to their primary Cray XC system, Cori, NERSC maintains a small non-production cluster of GPU-equipped nodes connected by multirail InfiniBand. While they share the same home directories and login nodes as the Cray XC system, the GPU nodes are not binary compatible with the XC nodes. The following assumes you have been granted access to the GPU nodes, and that you have read and understand the online documentation for their use.

Though covered in the online documentation, it is worth repeating here that by default allocations of Cori GPU nodes are shared -- you will be running on a system with multiple users and therefore must not trust performance numbers unless you explicitly request an exclusive node allocation.

Stable installs are available through the upcxx-gpu environment modules. A wrapper is used to transparently dispatch commands such as upcxx to an install appropriate to the currently loaded compiler modules. Since these installs do not use Cray's cc and CC wrappers, a loaded intel or gcc environment module will determine which compiler family is used. Note there is no support for cray, pgi, or nvhpc compiler families on the GPU nodes.

Due to differences in the environments (installed networking libraries in particular) one can only load the upcxx-gpu environment module on a cgpu node, not on a Cori login node. This means that compilation of UPC++ applications to be run on the Cori GPU nodes cannot be done on a login node as one would for the Cray XC nodes. Loading the cgpu and cuda environment modules, and a compiler environment module, are all prerequisites for loading the upcxx-gpu environment module.

Since the upcxx-gpu environment module can only be run on the cgpu nodes, compilation is typically done in an interactive session launched using salloc. Since the slurm configuration does change occasionally, one should consult NERSC's online documentation for the proper command, and especially for the options related to allocation of GPUs.

The upcxx-gpu environment module selects a default network of ibv. You can optionally specify this explicitly on the compile line with upcxx -network=ibv ....

Interactive example:

Please note that, contrary to all prior examples, the upcxx (compile) takes place inside the interactive session on the compute nodes.

cori$ module purge
cori$ module load cgpu

cori$ salloc -N2 -C gpu -p gpu --gpus-per-node=1 -t 10
salloc: Pending job allocation 1149547
salloc: job 1149547 queued and waiting for resources
salloc: job 1149547 has been allocated resources
salloc: Granted job allocation 1149547
salloc: Waiting for resource configuration
salloc: Nodes cgpu[02,13] are ready for job

cgpu02$ module load cuda

cgpu02$ module load intel

cgpu02$ module load upcxx-gpu

cgpu02$ upcxx --version
UPC++ version 2020.11.0  / gex-2020.11.0-memory_kinds
Copyright (c) 2020, The Regents of the University of California,
through Lawrence Berkeley National Laboratory.
https://upcxx.lbl.gov

icpc (ICC) 19.0.3.199 20190206
Copyright (C) 1985-2019 Intel Corporation.  All rights reserved.

cgpu02$ upcxx -O hello-world.cpp -o hello-world.x

cgpu02$ upcxx-run -n 4 -N 2 ./hello-world.x
Hello world from process 0 out of 4 processes
Hello world from process 2 out of 4 processes
Hello world from process 1 out of 4 processes
Hello world from process 3 out of 4 processes

CMake

A UPCXX CMake package is provided in the UPC++ install on the Cori GPU nodes as described in README.md. Thus with the upcxx-gpu environment module loaded, CMake should "just work". However, /usr/bin/cmake is fairly old and users may want to use a newer version via module load cmake.

Multirail networking on Cori GPU Nodes

Each Cori GPU node has five Mellanox InfiniBand Host Channel Adapters (HCAs) providing a total of nine network ports. Of those, as many as seven are potentially usable for UPC++. The upcxx-gpu environment module will initialize your environment with settings which emphasize correctness over network performance. At this time we strongly advise against changing any GASNET_* or UPCXX_* environment variables set by the upcxx-gpu environment module unless you are certain you know what you are doing. (Running module show upcxx-gpu on a GPU node will show what it sets).

Prototype support for GPUDirect RDMA

The default upcxx-gpu environment module includes prototype support for the GPUDirect RDMA (GDR) capabilities of the GPUs and InfiniBand hardware on the Cori GPU nodes. This enables communication to and from GPU memory without use of intermediate buffers in host memory. This delivers significantly faster GPU memory transfers via upcxx::copy() than previous releases without GDR support. However, there are currently some outstanding known issues.

The upcxx-gpu environment module will initialize your environment with settings intended to provide correctness by default, compensating for the known issues in GDR support. This is true even where this may come at the expense of performance. At this time we strongly advise against changing any GASNET_* or UPCXX_* environment variables set by the upcxx-gpu environment module unless you are certain you know what you are doing. (Running module show upcxx-gpu on a GPU node will show what it sets).

Job launch on Cori GPU nodes

The upcxx-run utility provided with UPC++ is a relatively simple wrapper, which in the case of the Cori GPU nodes simply runs srun. To have full control over process placement, thread pinning and GPU allocation, users are advised to consider launching their UPC++ applications directly with srun. However, one should do so only with the upcxx-gpu environment module loaded due to the importance of the environment variable settings for use of multiple InfiniBand ports, alluded to above.

If you would normally have passed -shared-heap to upcxx-run, then you should set the environment variable UPCXX_SHARED_HEAP_SIZE instead. Other relevant environment variables set (or inherited) by upcxx-run can be listed by adding -show to your upcxx-run command. Additional information is available in the Advanced Job Launch chapter of the programmer's guide.

Updated