Using UPC++ on ALCF Polaris

ALCF Polaris

This document is a continuous work-in-progress, intended to provide up-to-date information on a public install maintained by (or in collaboration with) the UPC++ team. However, systems are constantly changing. So, please report any errors or omissions in the issue tracker.

Typically installs of UPC++ are maintained only for the current default versions of the system-provided environment modules such as for PrgEnv, CUDA and compiler.

This document is not a replacement for the documentation provided by the centers, and assumes general familiarity with the use of the system.

General

Stable installs are available through environment modules. A wrapper is used to transparently dispatch commands such as upcxx to an install appropriate to the currently loaded PrgEnv-{gnu,cray,nvidia} and compiler (gcc, cce, or nvidia) environment modules.

Environment Modules

In order to access the UPC++ installation on Polaris, one must run

$ module use /lus/eagle/projects/CSC250STPM17/polaris/modulefiles

to add a non-default directory to the MODULEPATH before the UPC++ environment modules will be accessible. We recommend inclusion of this command in one's shell startup files, such as $HOME/.login or $HOME/.bash_profile.

If not adding the command to one's shell startup files, the module use ... command will be required once per login shell or PBS job in which you need a upcxx environment module.

Environment modules provide two alternative configurations of the UPC++ library:

upcxx-cuda
This module supports memory kinds, a UPC++ feature that enables communication to/from GPU memory via upcxx::copy on upcxx::global_ptr<T, memory_kind::cuda_device>. When using this module, copy operations on cuda_device memory leverage GPUDirect RDMA ("native" memory kinds).
upcxx
This module omits support for constructing an active upcxx::device_allocator<upcxx::cuda_device> object, resulting in a small potential speed-up for applications which do not require a "CUDA-aware" build of UPC++.

By default each module above will select the latest recommended version of the UPC++ library. One can see the installed versions with a command like module avail upcxx and optionally explicitly select a particular version with a command of the form: module load upcxx/20XX.YY.ZZ.

On Polaris, the UPC++ environment modules select a default network of ofi. You can optionally specify this explicitly on the compile line with upcxx -network=ofi ....

Caveats

The installs provided on Polaris utilize the Cray Programming Environment, and the cc and CC compiler wrappers in particular. It is possible to use upcxx (or CC and upcxx-meta) to link code compiled with the "native compliers" such as g++ and nvc++ (provided they match the PrgEnv-* module). However, direct use of the native compilers to link UPC++ code is not supported with these installs.

Job launch

The upcxx-run utility provided with UPC++ is a relatively simple wrapper, which in the case of Polaris uses aprun. To have full control over process placement, thread pinning and GPU allocation, users are advised to consider launching their UPC++ applications directly with aprun. However, one should do so only with the upcxx or upcxx-cuda environment module loaded to ensure the appropriate environment variable settings.

If you would normally have passed -shared-heap to upcxx-run, then it is particularly important that both UPCXX_SHARED_HEAP_SIZE and GASNET_MAX_SEGSIZE be set accordingly. The values of those and other potentially relevant environment variables set (or inherited) by upcxx-run can be listed by adding -show to your upcxx-run command (which will print useful information but not run anything). Additional information is available in the Advanced Job Launch chapter of the UPC++ v1.0 Programmer's Guide.

Single-node runs

On a system like Polaris, there are multiple complications related to launch of executables compiled for -network=smp such that no use of aprun (or simple wrappers around it) can provide a satisfactory solution in general. Therefore, we recommend that for single-node (shared memory) application runs on Polaris, one should compile for the default network (ofi). It is also acceptable to use -network=mpi, such as may be required for some hybrid applications (UPC++ and MPI in the same executable). However, note that in multi-node runs -network=mpi imposes a significant performance penalty.

Batch jobs

By default, PBS jobs (both batch and interactive) do not inherit the necessary settings from the submit-time environment, meaning both the module use ... and module load upcxx may be required in batch jobs which use upcxx-run. This is shown in the example below.

Interactive example:

polaris$ module use /lus/eagle/projects/CSC250STPM17/polaris/modulefiles

polaris$ module load upcxx

polaris$ upcxx --version
UPC++ version 2023.9.0  / gex-2023.9.0-0-g5b1e532
Citing UPC++ in publication? Please see: https://upcxx.lbl.gov/publications
Copyright (c) 2023, The Regents of the University of California,
through Lawrence Berkeley National Laboratory.
https://upcxx.lbl.gov

nvc++ 21.9-0 64-bit target on x86-64 Linux -tp zen-64
NVIDIA Compilers and Tools
Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

polaris$ upcxx -O hello-world.cpp -o hello-world.x

polaris$ qsub -q debug -l walltime=10:00,filesystems=home -A ... -l select=2:system=polaris -I

x3005c0s37b0n0$ module use /lus/eagle/projects/CSC250STPM17/polaris/modulefiles
x3005c0s37b0n0$ module load upcxx

x3005c0s37b0n0$ upcxx-run -n 4 -N 2 ./hello-world.x
Hello world from process 0 out of 4 processes
Hello world from process 1 out of 4 processes
Hello world from process 2 out of 4 processes
Hello world from process 3 out of 4 processes

CMake

A UPCXX CMake package is provided in the UPC++ install on Polaris, as described in README.md. Thus with the upcxx environment module loaded, CMake should "just work".

Known Issues

Correctness problems with intensive communication on HPE Slingshot-11

Currently, there are known issues with the vendor's communications software stack below UPC++ and GASNet-EX which may negatively impact certain communication-intensive UPC++ applications (e.g. those concurrently sending large numbers of RPCs to one or more processes).

Impacts observed have included crashes and hangs of correct UPC++ applications. Or course, either of those failure modes can be the result of other issues. If you believe your application is impacted, please follow the steps below.

Try running your application on a system with a network other than Slingshot-11 (but not Slingshot-10 which has a similar, but distinct, issue). If the failures persist, then the problem is not the one described here. You should look for defects in your application, or for other defects in UPC++ or external software.
If you have observed crashes, but not hangs, then try running your application with GASNET_OFI_RECEIVE_BUFF_SIZE=recv in the environment. This disables use of a feature linked to the known source of crashes, but may result in a small reduction in RPC performance.
If you have observed hangs, then try running your application with all of the following environment variable settings:
GASNET_OFI_RECEIVE_BUFF_SIZE=recv
FI_OFI_RXM_RX_SIZE=8192
FI_CXI_DEFAULT_CQ_SIZE=13107200
FI_MR_CACHE_MONITOR=memhooks
FI_CXI_RX_MATCH_MODE=software
FI_CXI_REQ_BUF_MIN_POSTED=10
FI_CXI_REQ_BUF_SIZE=25165824
These settings will have negative impact on both performance and on memory use. However, in most cases they have been seen to be sufficient to eliminate the problem(s).

If none of the options above resolves crashes or hangs of your communication-intensive UPC++ application, you can seek assistance using the issue tracker.

Information about UPC++ installs on other production systems

Please report any errors or omissions in the issue tracker.

Wiki

upcxx / docs / system / polaris