Wiki

Clone wiki

upcxx / FAQ

UPC++ FAQs

Below are some Frequently Asked Questions (FAQs) for UPC++.

All answers refer to the latest release of the UPC++ implementation and specification, which are available at the UPC++ home page.


General

What's the best way to get started learning UPC++?

For training materials to get started using UPC++, see the UPC++ Training Page


What systems support UPC++?

UPC++ runs on a variety of systems, ranging from laptops to custom supercomputers. The complete list of officially supported platforms (and C++ compilers) appears in the UPC++ INSTALL document. This currently includes systems with x86_64, ARM64 and PowerPC64 architectures running Linux, macOS or Windows (with WSL). It's also been successfully used on other POSIX-like systems, despite not being officially supported.


How do I download/install UPC++ for my system?

There are several ways to access a working UPC++ environment:

Installs on the major DOE centers

We maintain public installs of UPC++ at several DOE centers.

See UPC++ at NERSC, OLCF, and ALCF

Download and install from source

UPC++ supports a wide range of UNIX-like systems and architectures, ranging from supercomputers to laptops, including Linux, macOS and Windows (via WSL).

Visit the UPC++ download page for links to download and install UPC++ on your own system.

Docker containers with UPC++

We maintain a Linux Docker container with UPC++.

Assuming you have a Linux-compatible Docker environment, you can get a working UPC++ environment in seconds with the following command:

docker run -it --rm upcxx/linux-amd64
The UPC++ commands inside the container are upcxx (compiler wrapper) and upcxx-run (run wrapper), and the home directory contains some example codes. Note this container is designed for test-driving UPC++ on a single node, and is not intended for long-term development or use in production environments.

The container above supports the SMP and UDP backends, and does not include an MPI install. If you additionally need MPI support, there is also a larger "full" variant of the container that includes both UPC++ and MPI:

docker run -it --rm upcxx/linux-amd64-full
For details on hybrid programming with MPI, see Mixing UPC++ with MPI.


Is UPC++ available in any containers?

We provide minimal Docker containers intended for test-driving the latest UPC++ release (see above answer).

The E4S project provides a variety of full-featured containers in Docker, Singularity, OVA and EC2 that include UPC++, along with the entire E4S SDK. See documentation at the E4S website for usage instructions.


How should I cite UPC++ in publications?

The UPC++ publications page provides the recommended general citations for citing UPC++ and/or GASNet-EX, along with a BibTeX file you can import directly to make it easy.

We're especially interested to hear about work-in-progress publications using UPC++, please contact us to start a discussion!


Futures and completion

Are UPC++ futures thread-safe?

No. See the guide section on Personas for more details.


Does a barrier synchronize RMA operations (rput/rget)?

No. UPC++ provides per-operation completions to this end. See the guide section on the Interaction of Collectives and Operation Completion for more details.


Will calling sleep() flip a future to the ready state?

No. Futures returned by communication operations will only become readied during upcxx::progress() or other library calls with user-level progress such as future::wait(). No amount of sleeping (sleep(), usleep(), std::this_thread::sleep_for() etc.) will flip a future even if the transfer has actually completed.


Are asynchronous RMA/RPC operations guaranteed to happen in the order they were initiated?

No. There are no implicit ordering guarantees between any concurrent/overlapped communication operations. The only ordering guarantees are those explicitly enforced through the use of operation completions and synchronization. See the guide section on Completions for more details.

In particular, there are no implicit point-to-point ordered delivery guarantees for overlapped RPC operations between the same pair of processes (in contrast to, for example, the message ordering semantics provided by MPI_Isend/MPI_Recv). Similarly, two concurrent RMA put operations issued by the same process are not guaranteed to update the target memory and complete in the order they were issued.

For example:

// UNORDERED puts:
auto f1 = upcxx::rput(1, gptr1); // first put begins
auto f2 = upcxx::rput(2, gptr2); // second put begins
// both puts are *concurrently* "in-flight" 
// They are not guaranteed to update memory or signal completion in any particular order
f1.wait(); // returns when first put is completed
f2.wait(); // returns when second put is completed
// Explicitly ORDERED puts: (using operation completion and program order)
auto f1 = upcxx::rput(1, gptr1); // first put begins
f1.wait(); // returns when first put is completed
auto f2 = upcxx::rput(2, gptr2); // second put begins
f2.wait(); // returns when second put is completed 
// Explicitly ORDERED puts: (using future continuations)
auto f = upcxx::rput(1, gptr1) // first put begins
   .then([=]() {
     // completion runs after first put is completed
     return upcxx::rput(2, gptr2); // second put begins
  });
f.wait(); // returns when both puts are completed 

These same examples also apply analogously to upcxx::rpc() operations.

The lack of implicit ordering guarantees enables more efficient communication, because it helps ensure one only incurs synchronization overheads where explicitly required for program correctness.


When I conjoin futures using when_all(), in what order do the operations complete? Is there a possibility of a race condition?

Futures progress sequentially on their owning thread and in unspecified order. UPC++ futures are not a tool for expressing parallelism, but instead non-deterministic sequential behavior. Futures allow a thread to react to events in the order they actually occur, since the order in which communication operations complete is not deterministic. While UPC++ does support multiple threads within each process, and each thread can be managing its own futures, the same future must never be accessed concurrently by multiple threads. Also, UPC++ will never "spawn" new threads. See the section "Futures and Promises" of the UPC++ specification.


Do some completion events imply other completion events?

Yes. Here is the exact wording from the UPC++ Specification:

Operation completion implies both source and remote completion. However, it does not imply that notifications associated with source and remote completion have occurred. Similarly, remote completion implies source completion, but it does not imply that notifications associated with source completion have occurred.

So for example in an RMA injection function call of the abstract form: 

upcxx::rput(psrc, pdst, sz, 
            operation_cx::N1 | remote_cx::N2 | source_cx::N3)

The operation completion event implies the other two completion events have also occurred, and the remote completion event implies the source completion event has also occurred. However, all of these events reflect the state of rput's underlying RMA data transfer (i.e. the state of the memory referenced by psrc and pdst), and do NOT imply that any notifications (N1,N2,N3) associated with those events have been signaled yet. So for example if the actual call was:

upcxx::rput(psrc, pdst, sz, 
            operation_cx::as_promise(p1) | remote_cx::as_rpc(C2) | source_cx::as_promise(p2))

Observing a signal on promise p1 does not guarantee that callback C2 has executed at the destination process, nor that promise p2 has been signaled yet. It only implies that those event notifications are queued with the UPC++ runtime at the appropriate process and will eventually be delivered during a call with user-level progress. The order in which those notifications are actually delivered to the application is unspecified, and may depend on both system-dependent details and execution timing. The same remains true when the as_promise() notification is replaced by other notification mechanisms including as_future() or as_lpc().

The discussion above is focused on bulk rput(), because it is one of the few communication injection calls that supports all three types of completion event. Other put-like RMA calls are similar. The same properties also apply in the obvious way to communication functions that only support two types of completion event. However note the discussion above should not be misread with regards to round-trip RPC injection, e.g. upcxx::rpc(tgtrank, operation_cx::N4, C4); in that particular case, operation completion event N4 does imply that remote callback C4 has completed execution at the target process and delivered its return value (if any) back to the initiator.


Remote Procedure Call (RPC)

When I send a lambda expression to RPC, can I safely use lambda captures?

The short answer is: NEVER use by-reference lambda captures for RPC, and by-value captures are ONLY safe if the types are TriviallySerializable (TriviallyCopyable). See the guide section on RPC and Lambda Captures for more details.


Can I safely send a pointer to static data variables and dereference them in an RPC callback?

In general this is NOT guaranteed to work correctly. In particular, systems using Address space layout randomization (ASLR) (a security feature which is enabled by default on many Linux and macOS systems) will load static data and possibly even code segments at different randomized base addresses in virtual memory for each process in the job.

Note your callbacks can safely refer to static/global variables by name, which will be resolved correctly wherever the callback runs. For variables with a dynamic lifetime, the recommended solution is to use a dist_object wrapper; sending a dist_object reference as an RPC argument automatically ensures the callback receives a reference to the corresponding local representative of the dist_object.


Do RPC callbacks incur concurrency or thread-safety issues?

No.

Remote Procedure Call (RPC) callbacks always execute sequentially and synchronously inside user-level progress for the thread holding the unique "master persona" (a special persona that is initially granted to the thread calling upcxx::init() on each process). User-level progress only occurs inside a handful of explicit UPC++ library calls (e.g. upcxx::progress() and upcxx::future::wait()).

For more details, see the guide section on the Master Persona.


Can you make an RPC call from within a RPC callback? When I try to wait on the nested RPC I get an error/hang.

You can absolutely launch RPC, RMA or other communication from within an RPC callback. However you cannot wait on the completion of communication from within the RPC callback. If you run in debug mode you'll get an error message explaining that, e.g.: "You have attempted to wait() on a non-ready future within upcxx progress, this is prohibited because it will never complete."

So if you launch communication from RPC callback context you need to use UPC++'s other completion and asynchrony features to synchronize completion instead of calling future::wait(). There are many different ways to do this, and the best choice depends on the details of what the code is doing. For example, one could request future-based completion on that op and either stash the resulting future in storage outside the callback's stack frame, or better yet schedule a completion callback on the future using future::then() with a continuation of whatever you would have done after wait (see the guide section on Asynchronous Computation for more details.). There are also operations (like rpc_ff) and completion types (remote_cx::as_rpc) that signal completion at the remote side (see the guide section on Remote Completions for more details), so you perform synchronization at the target instead of the initiator.


Does upcxx do any optimization if I happen to call an RPC destined to rank_me()? Do I need to explicitly do anything to make that more efficient?

There is an optimization applied to all targets in the local_team() (shared-memory bypass avoids one payload copy for large payloads using the rendezvous-get protocol), but nothing specific to same-process "loopback" (rank_me).

If you think same-process/loopback RPCs comprise a significant fraction of your RPC traffic, you should probably consider that serialization semantics require copying the argument payload at least once, even for a loopback RPC. Also, the progress semantics require deferring callback invocation until the next user-level progress, so if the input arguments are hot in-cache then you may lose that locality by the time the RPC callback runs. There is also some intrinsic cost associated with callback deferment that makes it significantly more expensive than a synchronous direct function call, even for lightweight RPC arguments.

So if the RPC arguments are large (e.g., a large upcxx::view) or otherwise expensive to serialize (e.g., lots of std::string or other containers requiring individual fine-grained allocations), and/or are called with very high frequency, then you might consider converting a loopback RPC into a direct function call, ideally passing the arguments by reference (assuming the callback can safely share the input argument data without copying it). Such a transformation would be prohibited by the library API semantics (i.e., because it's not transparent), but a user with knowledge of the application and callback semantics could do it safely in many circumstances and potentially get a big win under the right circumstances.


Is a dist_object<T> a shared object?

In general NO - these concepts are orthogonal.

The upcxx::dist_object<T> comprising a distributed object need not involve any storage in the shared memory segment. In fact, in many common use cases for dist_object<T> these objects live on the program stack, meaning both the dist_object and embedded T object reside entirely in private memory (and attempting to call upcxx::try_global_pointer() on the address of such an object would yield a null pointer). You can of course explicitly construct a upcxx::dist_object<T> object in the shared heap using upcxx::new_<upcxx::dist_object<T>>(...), but this is probably not a very useful pattern.

A more common and useful pattern is dist_object<global_ptr<T>>, where the dist_object and global pointer objects themselves usually live in private memory, but the contained global pointer references an object allocated in the shared heap.

It might be helpful to think of distributed objects as a collective abstraction which is mostly useful for RPC and dist_object::fetch(). This abstraction is mostly orthogonal to object storage in the shared segment, which is mostly used for RMA operations (rput/rget/copy) or local_team shared-memory bypass.


Serialization

How should I handle serialization of user-defined classes?

Many of the standard library container types are automatically serialized, as are TriviallyCopyable types. For more complicated types there are several mechanisms for customizing serialization: see the guide section on Serialization for more details. Note that if you are transferring a container of data elements "consumed" by the callback, you should consider using view-based serialization for transferring those elements to reduce extraneous data copies: see the guide section on View-Based Serialization for details.


Why does RMA on a non-trivially-copyable class generate a compile error about TriviallySerializable?

The error you are getting refers to TriviallySerializable because the RMA operation you are attempting does not support the serialization you seek. Operations which are intended to be accelerated by network RMA hardware (such as rput, rget) can only move flat byte sequences, hence we assert the type given can be meaningfully moved that way. RPC's are not restricted in the types they transmit since the CPU is always involved, and we do indeed accept and serialize most std:: containers when given as RPC arguments. There are also now interfaces for defining custom serialization on your own types for use with RPC: see the guide section on Serialization for more details.


How does the efficiency a upcxx::view<T> argument compare to other mechanisms for passing the same TriviallySerializable T elements to an RPC?

Background info on view-based serialization. A view argument over a TriviallySerializable type adds 8 bytes of serialized data payload to store the number of elements. Aside from that there are no additional overheads to using a view. There are no incremental costs associated with direct access to the network buffer from the RPC callback at the target.

If your alternative is to send those elements in a container like a std::vector, std::list, std::set, etc those also all add 8 bytes of element count in the serialized representation (std::array notably does not, as the element count is static info), but these containers all also add additional element data copies and allocation calls to construct the container at the target. So a view should never be worse than a container over the same elements, and can be significantly better.

If your alternative is to send each element as an individual unboxed RPC argument, that might be comparable to a view for a very small number of elements. But the view should quickly win once you amortize the extra 8 bytes on-the-wire and reap the efficiency benefit of passing the arguments to the callback function via pointer indirection instead of forcing the compiler to copy/move each element to the program stack (beyond the point where they all fit in registers).


Remote Memory Access (RMA)

What are the benefits of using upcxx::rput and upcxx::rget RMA communication?

The main benefits are:

  1. The data transfer is performed in a zero-copy manner on most HPC-relevant hardware (e.g. Cray networks or InfiniBand). Fewer payload copies means less CPU overheads tied up in communication, and these costs are especially relevant when the payload is large.
  2. On HPC-relevant hardware, the transfer is one-sided down to the hardware level. Specifically, the data transfer is offloaded to the network hardware on the remote end, meaning there are no communication delays associated with involvement of the remote CPU.

These same benefits also apply to upcxx::copy when applied to host memory, or to device memory with native memory kinds support.


Are data transfers still one-sided when using RMA with remote completion?

Yes, with a caveat. UPC++ allows RMA transfers (rput and copy) to request target-side remote completion notification, sometimes known as "signaling put". This feature is described in the guide section on Remote Completions.

Here's a simple example:

double *lp_src = ...;
global_ptr<double> gp_dst = ...;
upcxx::rput(lp_src, gp_dst, count, 
            remote_cx::as_rpc([](t1 f1, t2 f2){ ... }, a1, a2));

The RMA data payload transfer (from lp_src to gp_dst) remains fully one-sided and zero-copy using RDMA hardware on networks that provide that capability (e.g., InfiniBand and Cray networks).

The RPC arguments (a1, a2 and the lambda itself) are subject to serialization semantics, so that part of the payload is not zero-copy; however in a well-written application this is typically small (under a few cache lines) so the extra copies are usually irrelevant to performance. Also as usual, the RPC callback will not execute until the target process enters UPC++ user-level progress.


What happens if there are conflicting RMA accesses to the same memory location?

If two or more non-atomic RMA operations concurrently access the same shared memory location with no intervening synchronization and at least one of them is a write/put, that constitutes a data race, just like in thread-style programming. This property holds whether the accesses are performed via RMA function calls (e.g. upcxx::rput) or via deference operations on raw C++ pointers by a process local to the shared memory. An RMA data race is not immediately fatal, but any values returned and/or deposited in memory will be indeterminate, i.e. potentially garbage. Well-written programs should use synchronization to avoid data races.

UPC++ also provides remote atomic operations that guarantee atomicity and data coherence in the presence of conflicting RMAs to locations in shared memory. These can be used in ways analogous to atomic operations in thread-style programming to ensure deterministically correct behavior.


Memory Kinds and GPU support

How does UPC++ support GPU accelerators?

UPC++ provides the Memory Kinds interface which enables RMA communication between GPU memories and host memories anywhere in the system, all within a convenient and uniform abstraction. UPC++ global_ptr's can reference memory in host or GPU memory, local or remote, and the upcxx::copy() function moves data between them.

For more details see the guide section on Memory Kinds.


What GPUs does UPC++ support?

The UPC++ Memory Kinds interface currently supports:

  • CUDA-compatible devices, including all recent NVIDIA-branded GPUs
  • ROCm/HIP-compatible devices, including all recent AMD-branded GPUs
  • oneAPI Level Zero compatible devices, including all recent Intel-branded GPUs

On systems with Mellanox/NVIDIA-branded InfiniBand or HPE Slingshot network hardware, this additionally includes native RMA acceleration for GPU memory implemented using technologies such as GPUDirect RDMA and ROCmRDMA. Forthcoming releases will add native RMA acceleration for GPU memory on additional platforms.


How can I share GPU device memory between UPC++ and other libraries?

UPC++ provides several options for constructing a device segment, which is managed by a upcxx::device_allocator object. The upcxx::make_gpu_allocator() factory function is usually the most convenient way to construct a device segment, and is discussed in the guide section on Memory Kinds.

By default, device segment creation allocates a segment of the requested size on the device (i.e. the UPC++ library performs the allocation). However there is also an argument that optionally accepts a pointer to previously allocated device memory (e.g. allocated using cudaMalloc or some other GPU-enabled library).

Once constructed, the device_allocator abstraction can optionally be used to carve up the segment into individual objects using device_allocator::::allocate<T>().

Alternatively you may construct a UPC++ device segment around a provided region of device memory that already contains objects and data (e.g. created using a different programming model). Constructing a device_allocator in this case allows UPC++ to register the device memory with the network for RMA and provides the means to create a global_ptr referencing the device memory that can be shared with other processes.

A local pointer into the device segment can be converted into a global_ptr of appropriate memory kind via device_allocator::to_global_ptr(), and back again via device_allocator::local(). The global_ptr version of the pointer can be shared with other processes and used in the upcxx::copy communication calls, and the local version of the device pointer can be passed to GPU compute kernels and other GPU-aware libraries.


Is UPC++ compatible with CUDA Managed Memory (aka Unified Memory)?

CUDA Managed Memory (sometimes referred to as "Unified Memory") is a feature that allows memory allocated in a special way to transparently migrate back-and-forth between physical pages resident in GPU memory and system DRAM, based on accesses. This migration uses page fault hardware to detect accesses and perform page migration on-demand, which can have a substantial performance cost.

The upcxx::device_allocator constructors either allocate a (non-managed) device segment in GPU memory or accept a pre-allocated device-resident memory segment. In both cases the entire device segment must be resident in GPU memory. When native Memory Kinds are active (enabled by default for supported networks) the pages comprising the cuda_device memory segment are pinned/locked in physical memory and registered with the network adapter as a prerequisite to enabling accelerated RDMA access (e.g. GPUDirect RDMA) on behalf of remote processes. This currently precludes Managed Memory on device segments, as it prevents migration of pages in the memory kinds segment (prohibiting direct/implicit access from the host processor). The host processor can explicitly copy data to/from a memory kinds device segment using upcxx::copy, or (for segments it owns) using an appropriate cudaMemcpy call.

Nothing prevents use of CUDA Managed Memory for objects outside a UPC++ device segment.


Building and Running UPC++ Programs

The upcxx compiler wrapper is the best and recommended way to compile most programs using UPC++ from the command line or Makefiles. It's a script that works analogously to the familiar mpicxx wrapper, in that it invokes the underlying C++ compiler with the provided options, and automatically prepends the options necessary for including/linking the UPC++ library.

For details on upcxx, see the guide section on Compiling UPC++ Programs.

For CMake projects, a UPCXX CMake package is provided in the installation directory. To use it in a CMake project, append the UPC++ installation directory to the CMAKE_PREFIX_PATH variable (cmake ... -DCMAKE_PREFIX_PATH=/path/to/upcxx/install/prefix ...), then use find_package(UPCXX) in the CMakeLists.txt file of the project.

Some advanced use cases (notably when composing a hybrid of UPC++ with certain other parallel programming models) might require more manual control of the underlying compiler commands. For those use cases there is also a upcxx-meta script that directly supplies the relevant compiler flags, analogously to the pkg-config tool. For details on upcxx-meta, see the README section.


Is a UPC++ rank a thread or a process? How does UPC++ allow one rank to directly access the memory of another rank?

Like MPI, a UPC++ rank is a process. How they are created is platform dependent, but you can count on fork() being a popular case for non-supercomputers. Typically, UPC++ processes use process shared memory (POSIX shm_*) to communicate within a node. Several other OS-level shared memory bypass mechanisms are also supported: see GASNet documentation on "GASNet inter-Process SHared Memory (PSHM)" for details.


How can I control the size of the shared heap segment?

The shared heap segment size for each process is determined at job creation and remains fixed for the lifetime of the job. The easiest way to control the shared segment size is the upcxx-run -shared-heap=HEAPSZ argument, where HEAPSZ can be either an absolute per-process size (e.g., -shared-heap=512MB, -shared-heap=2GB, etc) or a percentage of physical memory (e.g., -shared-heap=50%, which devotes half of compute-node physical memory to shared heap, divided evenly between co-located processes).

For details on upcxx-run see the guide section on Running UPC++ Programs.


What is the best way to debug UPC++ programs?

See docs/debugging for tips on debugging UPC++ codes.


How can I use Valgrind with UPC++?

See docs/debugging for information about using the Valgrind instrumentation framework in UPC++ programs.


Can a UPC++ program dynamically add more processes?

Like MPI and other SPMD models, the maximum amount of potential parallelism has to be specified at job launch. Dynamically asking for more processes to join the job is not possible, but asking for more than you need up front and "waking" them dynamically is doable. Consider having all but rank=0 spinning in a loop on upcxx::progress(). When rank 0 wants to offload work to rank 1, it can send an RPC which could kick it out of its spin loop to do real work.


How do I launch a UPC++ program on multiple nodes, and particularly in the case when the nodes are connected by InfiniBand?

The UPC++ configure script should auto-detect the availability of InfiniBand (done during a step called "configuring gasnet"). To build a program to run over InfiniBand, make sure that:

  • CXX=mpicxx is in your environment when calling the configure script.
  • Compile your program using upcxx -network=ibv

Then to launch just use the upcxx-run script.


How do I launch distributed UPC++ jobs with InfiniBand?

You have three options:

  1. upcxx-run (which internally invokes gasnetrun_ibv) to perform ssh-based spawning.
  • This option requires you to correctly setup password-less SSH authentication from at least your head node to all the compute nodes - this document describes how to do that in the context of BUPC (which also uses GASNet) and the information is analogous for UPC++
  • It additionally requires that you pass the host names into the environment, e.g. GASNET_SSH_SERVERS="host1 host2 host3...
  • The gasnetrun_ibv -v option is often useful for troubleshooting site-specific problems that may arise here.
  • You can see more details of what upcxx-run is doing by passing the -v option one or more times (for increasing levels of verbosity).
  1. mpirun (possibly invoked from upcxx-run) - uses MPI for job spawn ONLY, then IBV for communication
  • This requires UPC++/GASNet was configured, built and installed with MPI support (usually by setting CXX=mpicxx)
  • Also requires that (non-GASNet) MPI programs spawn correctly via mpirun (and any MPI-implementation-specific tweaking required to make that work)
  • It's also best to use TCP-based MPI if possible for this purpose, to prevent the MPI library from consuming IBV resources that won't be used by the app. There is more info on that topic in this document: https://gasnet.lbl.gov/dist/other/mpi-spawner/README
  • mpirun often has the -v option to provide spawn status for troubleshooting
  1. PMI spawning.

For more details about job spawning see the guide section on Advanced Job Launch.


How can I bind processes to cores and ensure the binding is correct to maximize NUMA hardware resources?

Appropriate binding of processes to CPU cores can be crucial to maximizing computational performance on dedicated computational resources. Core binding helps to prevent performance loss stemming from executing processes periodically migrating between cores, sacrificing cache state and sharing CPU core resources in other undesirable ways. Process binding can also help ensure processes acquire and maintain affinity to physical memory pages that are "nearest" to the cores executing the process; this can be especially important on modern many-core/multi-socket systems that often exhibit highly non-uniform memory access (NUMA) performance characteristics.

The best means of core-binding is system-specific, and documentation from your system administrator may provide the most accurate resource. Parallel job spawners such as SLURM srun and IBM jsrun have options to control core binding as part of parallel process launch, and on systems with those spawners that's often the best place to start. This might mean a trial run with upcxx-run -show to get the necessary "starting point" for envvars and spawner command, before manually adding some core-binding options. Failing those, tools like hwloc's hwloc-bind can be invoked as a wrapper around your application to achieve specific binding, although it might require some scripting trickery to bind different ranks to distinct resources.

Either way on a system with the hwloc package installed, you can programmatically verify cores were bound as expected using code like the following, which prints the current core binding to stdout:

  std::string s("echo ");
  s += std::to_string(upcxx::rank_me()) + ": `hwloc-bind --get`";
  system(s.c_str());

For more details about job spawning see the guide section on Advanced Job Launch.


How do I solve run-time errors about missing shared libraries?

Sometimes job startup can fail with shared library loading errors of the form:

/lib64/libstdc++.so.6: version `GLIBCXX_3.4.26' not found

These are especially likely to crop up in multi-node runs, where a job spawning program is starting processes across remote nodes on your behalf, but those processes might not see appropriate LD_LIBRARY_PATH/LD_RUN_PATH environment variables. There are many different possible causes for such errors, and the identity of the missing shared library is usually the best clue of how to proceed.

One common case of the error above is configuring UPC++ with a GNU C++ compiler that is not the default compiler for the Linux distro, meaning the supporting shared libraries are not present in the loader's default shared library search path (e.g. /lib64). Please see docs/local-gcc for tips on using UPC++ with a non-default GCC compiler on Linux.

Similar issues can arise from configuring UPC++ to use a non-GNU compiler that is not fully integrated with the system programming environment. Please see docs/alt-compilers for tips on using UPC++ with non-GNU compilers on Linux.

Another common failure mode in multi-node runs is failures to find networking or GPU libraries, eg:

error while loading shared libraries: libfabric.so.1: 
   cannot open shared object file: No such file or directory

This particular failure can often be solved by configuring with --enable-rpath, which requests GASNet to add the link-time shared library locations to the executable RPATH. Alternatively most of the same solutions suggested for the C++ shared library in docs/local-gcc can also be applied to other missing shared libraries.


Teams and Collectives

The console output from different processes in my UPC++ program exhibits unexpected order. Does this mean barrier is broken?

Example program:

printf("Hello world from rank %d\n", upcxx::rank_me());
fflush(stdout);
upcxx::barrier();
printf("Goodbye world from rank %d\n", upcxx::rank_me());
fflush(stdout);
Possible output from a 2-node job, as seen on the console:
$ upcxx-run -n 2 -N 2 ./a.out
Hello world from rank 1
Goodbye world from rank 1
Hello world from rank 0
Goodbye world from rank 0

The issue here is that while upcxx::barrier() DOES synchronize the execution of all the processes, it does NOT necessarily flush all their stdout/stderr output all the way back to the central console on a multi-node system. The call to fflush(stdout) ensures that each process has flushed output from its private buffers to the file descriptor owned by each local kernel, however in a distributed-memory run this output stream still needs to travel over the network (e.g., via a socket connection) from the kernel of each node back to the login/batch node running the upcxx-run command before it reaches the console and is multiplexed with output from the other compute nodes to appear on your screen. upcxx::barrier() does not synchronize any part of that output-forwarding channel, so it's common to see such output "races" on distributed-memory systems. The same is true of MPI_Barrier() for the same reasons.

The only way to reliably workaround this issue is to perform all console output from a single process (conventionally rank 0). This is also a good idea for scalability reasons, as generally nobody wants to scroll through 100k copies of "hello world" ;-). It's also worth noting that console output usually vastly underperforms file system I/O with a good parallel file system, so the latter should always be preferred for large-volume I/O.


How can I optimize for node-level locality?

UPC++ supports several orthogonal optimizations for systems with hierarchical locality. Most modern systems have multi-core CPUs, and it's common (although not required) to layout distributed UPC++ jobs with multiple processes sharing a physical node. In such a layout, processes sharing a physical node (and hence a coherent physical memory domain) are "closer" to each other (both in terms of wire distance and access time) than processes running on different physical nodes connected by a network.

By default, UPC++/GASNet-EX will optimize communication between processes sharing a physical node to use a fast shared-memory transport, bypassing the network adapter. This optimization is automatic (no code changes required) and can result in orders-of-magnitude improvement in communication latency and bandwidth for peers sharing a physical node, relative to processes communicating over a network.

UPC++ provides a special team called upcxx::local_team() that contains all the processes co-located on a physical node. Processes sharing a local_team() additionally have direct load/store access to each other's UPC++ shared segment, via virtual memory cross-mappings established at job creation.

UPC++ applications can leverage upcxx::local_team() to explicitly optimize their performance in several ways:

  1. Because processes sharing a local_team() (a physical node) communicate much more efficiently than those which do not, application data distribution decisions can sometime be made to favor communication between local_team() peers.
  2. Processes sharing a local_team() can create raw C++ pointers and references directly to objects in each other's shared segments, allowing for direct load/store access without the need for explicit UPC++ RMA calls. This technique eliminates overheads associated with library calls, and potentially eliminates the overhead of an in-memory copy for the accessed data (i.e. instead moving data directly between a peer's shared memory and local CPU registers).
  3. Processes sharing a local_team() can arrange to maintain a single node-wide copy of any replicated data, potentially improving memory scalability by a factor of local_team().rank_n().

See the guide section on local_team for further details and an example of optimization using local_team().


Is there a UPC++ equivalent to MPI_COMM_SELF?

UPC++ provides built-in teams for the entire job (upcxx::world()) and for the processes co-located on the same physical node (upcxx::local_team()). There is no built-in singleton team for the calling process, but it's easy to construct one using team::create() -- see example below.

#include <upcxx/upcxx.hpp>

upcxx::team *team_self_p;
inline upcxx::team &self() { return *team_self_p; }

int main() {
   upcxx::init();
   int world_rank = upcxx::rank_me();
   // setup a self-team:
   team_self_p = new upcxx::team(upcxx::world().create(std::vector<int>{world_rank})); 

   // use the new team
   UPCXX_ASSERT(self().rank_me() == 0);
   UPCXX_ASSERT(self().rank_n() == 1);
   upcxx::barrier(self());
   upcxx::dist_object<int> dobj(world_rank, self());
   UPCXX_ASSERT(dobj.fetch(0).wait() == world_rank);

   // cleanup (optional)
   self().destroy();
   delete team_self_p;

   upcxx::finalize();
}

For more info on teams, see the guide section on Teams.


Interoperability

Can I mix UPC++ with C++ threads and/or POSIX threads?

Absolutely! The main caveat is that if more than one thread will make UPC++ calls then you probably need to build using the thread-safe version of UPC++. This usually just means compiling with upcxx -threadmode=par (or with envvar UPCXX_THREADMODE=par).

The UPC++ model is fully thread-aware, and even provides LPC messaging queues between threads as a convenience. For details, see the guide section on Personas.


Can I mix UPC++ with OpenMP?

Absolutely! You'll usually need to build using the thread-safe version of UPC++, which just means compiling with upcxx -threadmode=par (or with envvar UPCXX_THREADMODE=par). The other main caveat is you may need to be careful with blocking synchronization operations (such as the implicit thread barrier at the end of an OpenMP parallel region). Specifically, blocking OpenMP operations don't advance UPC++ progress, so incautious mixing of OpenMP and UPC++ synchronization can sometimes lead to deadlock.

For details, see the guide section on OpenMP Interoperability.


My program has multiple threads, but only one of them makes UPC++ calls. Can I use the default threadmode=seq UPC++ library?

If exactly one thread per process ever makes UPC++ calls over the lifetime of the program, then yes. This is analogous to MPI_THREAD_FUNNELED, and is permitted under UPC++ threadmode=seq. However threadmode=seq does NOT allow UPC++ communication calls from multiple threads per process, even if those calls are serialized by the caller to avoid concurrency (the analog of MPI_THREAD_SERIALIZED).

For more details on what is permitted in threadmode=seq see docs/implementation-defined.


Can I use UPC++ with Kokkos?

Yes! The communication facilities provided by UPC++ nicely complement the on-node computational abstractions provided by Kokkos. Here's a simple example showing UPC++ with Kokkos. You can even integrate UPC++ memory kinds (GPU support) with GPU computation managed by Kokkos - here's an example of that, which is also the topic of a recent paper showing how this code can outperform an equivalent Kokkos+MPI program.


Troubleshooting

How do I report a problem or suspected bug in UPC++?

Use the UPC++ Issue Tracker to report issues or enhancement requests for the implementation or documentation.


Where else can I find help with UPC++?

The UPC++ Support Forum (email) is a public forum where you can ask questions or browse prior discussions.

You may also contact us via the staff email for private communication or non-technical questions.


Where can I find documentation for the GASNet-EX network layer?

The GASNet website includes links to documentation relevant to the networking layer inside UPC++. This includes:

  1. The GASNet README provides detailed information on running jobs using GASNet (every UPC++ program is also a GASNet program). This notably includes documentation on environment variables that influence the operation of UPC++/GASNet jobs.
  2. The GASNet ChangeLog documents detailed changes deployed at the network level in each release (GASNet-EX releases are often synchronized with UPC++ releases).
  3. GASNet conduit READMEs provide detailed network-specific documentation regarding UPC++ jobs using that particular network backend. These notably include network-specific environment variables that influence the operation of UPC++/GASNet jobs. For example:

Where can I read about known issues in the GASNet-EX network layer?

The GASNet Issue Tracker contains the latest detailed information about known issues or enhancement requests in the GASNet-EX networking layer. Note that GASNet is an internal component of UPC++ that is also used by other programming models, so many of the issues there may not be relevant to UPC++.


What does the ECONGESTION error from the UDP network backend mean?

Under some circumstances with udp-conduit you might see a runtime error like this:

*** FATAL ERROR(Node 0): An active message was returned to sender,
    and trapped by the default returned message handler (handler 0):
Error Code: ECONGESTION: Congestion at destination endpoint  

ECONGESTION is the error that indicates a sending process timed out and gave up trying to send a packet to a receiver who never answered. Under default settings, this means the udp-conduit retransmit protocol gave up after around 60 seconds with no network attentiveness at the target process. The most common way this occurs is when the target process is genuinely frozen; generally because the target process is either (1) stuck in a prolonged compute loop with no UPC++ internal progress or communication calls from any thread, or (2) stopped by a debugger. The other thing we've occasionally seen is some DDoS filters or firewalls will step in and stop delivering UDP traffic after a certain point if it exceeds an activity threshold.

The udp-conduit retransmission protocol parameters can be tweaked using environment variables, as described in the udp-conduit README. So for example if you are using a debugger to temporarily freeze some processes, you probably want to set the timeout to infinite via envvar export GASNET_REQUESTTIMEOUT_MAX=0 which tells sending processes to never time out waiting for a network-level response from a potentially frozen process.

Updated