Wiki
Clone wikiupcxx / docs / mpi-hybrid
Recommendations for hybrid UPC++/MPI applications
With care, it is possible to construct applications which use both UPC++ and MPI
explicitly in user code (note that this is different from UPC++ programs which
do not explicitly use MPI, but are compiled with UPCXX_NETWORK=mpi
: such programs do
not require any special treatment). Note however, that strict programming
conventions (below) must be adhered to when switching between MPI and UPC++
network communication, otherwise deadlock can result on many systems.
In general, mixed MPI/UPC++ applications must be linked with an MPI C++
compiler. This may be named mpicxx
or mpic++
, among other possible names.
However, on Cray systems CC
is both the regular C++ compiler and the MPI C++
compiler. You may need to pass this same compiler as $CXX when installing UPC++
to ensure object compatibility.
Certain UPC++ network types (currently mpi
and ibv
) may use MPI
internally. For this reason, MPI objects should be compiled with the same MPI
compiler that was used when UPC++ itself was build (normally the mpicc
in
one's $PATH, unless some action is taken to override that default).
Additionally, the MPI portion of an application should make use of
MPI_Initialized()
to ensure exactly one call is made to initialize MPI.
See 'Correct library initialization' section below.
Both MPI and UPC++ cause network communication, and the respective runtimes do so without any coordination. As a result, it is quite easy to cause network deadlock when mixing MPI and UPC++, unless the following protocol is strictly observed:
-
When the application starts, the first MPI or UPC++ call (i.e.
MPI_Init()
orupcxx::init()
) which may result in network traffic from any thread should be considered to put the entire job in 'MPI' or 'UPC++' mode, respectively. -
When an application is in 'MPI' mode, and needs to switch to using UPC++, it should quiesce all MPI operations in-flight and then collectively execute an
MPI_Barrier()
as the last MPI call before causing any UPC++ communication. Once any UPC++ communication has occurred from any rank, the program should be considered to have switched to 'UPC++' mode. -
When an application is in 'UPC++' mode, and an MPI call that may communicate is needed, the application must quiesce all UPC++ communication and then execute a upcxx::barrier() before any MPI calls are made. Once any MPI functions have been called from any thread, the program should be considered to be in 'MPI' mode.
If this simple construct of alternating MPI and UPC++ phases can be observed, then it should be possible to avoid deadlock.
Correct library initialization
When writing your hybrid MPI / UPC++ program, it's important to correctly
initialize both layers before using them. When doing so, it's crucial to
realize that UPC++ might also be initializing MPI inside upcxx::init()
,
e.g., in order to control MPI-based job spawning. Whether or not this happens
depends on your site configuration (see next section).
The recommended method for hybrids with no special MPI thread safety requirements is a formula such as:
#include <upcxx/upcxx.hpp> #include <mpi.h> #include <iostream> // optional int main(int argc, char **argv) { // init UPC++ upcxx::init(); // init MPI, if necessary int mpi_already_init; MPI_Initialized(&mpi_already_init); if (!mpi_already_init) MPI_Init(&argc, &argv); /* program goes here */ int mpi_rankme, mpi_rankn; MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rankme); MPI_Comm_size(MPI_COMM_WORLD, &mpi_rankn); char hostname[MPI_MAX_PROCESSOR_NAME]; int junk; MPI_Get_processor_name(hostname, &junk); std::cout << "Hello world from UPC++ rank " << upcxx::rank_me() << "/" << upcxx::rank_n() << ", MPI rank " << mpi_rankme << "/" << mpi_rankn << " : " << hostname << std::endl; // Finalize MPI only if we initted it if (!mpi_already_init) MPI_Finalize(); // finalize UPC++ upcxx::finalize(); return 0; }
MPI thread-safety considerations
UPC++ configurations that initialize MPI generally do so with the minimal
thread-safety guarantees needed for correct UPC++ operation. For
UPCXX_THREADMODE=seq
(the default) this usually means single-threaded MPI
mode MPI_THREAD_SINGLE
. For UPCXX_THREADMODE=par
, this usually means
MPI_THREAD_SERIALIZED
(where the application is responsible for serializing
multi-threaded calls to MPI and UPC++ operations that may invoke MPI). If your
MPI program needs full MPI thread safety (e.g., MPI_THREAD_MULTIPLE
), there are
two possible approaches:
-
If your current UPC++ configuration is initializing MPI for you, set
export GASNET_MPI_THREAD=MULTIPLE
to request it does so using the stronger thread-safety mode. Note this only has any effect ifupcxx::init
indeed initializes MPI and does so before any otherMPI_Init*
calls in the process. You can confirm the setting by running withupcxx-run -vv
. -
Call
MPI_Init_thread()
explicitly beforeupcxx::init()
with the desired thread-safety level. UPC++ configurations using MPI will detect MPI has already been initialized and act accordingly. Consult MPI documentation for the details of using this call.
Job spawning
The details of correctly spawning parallel jobs are often site-specific,
especially for distributed systems. Correctly spawning a hybrid MPI / UPC++
application carries additional wrinkles on some systems. The details also
differ based on which GASNet conduit your UPC++ application was compiled for
(via $UPCXX_NETWORK
or the default value determined at installation
time).
aries-conduit for Cray XC systems
The native GASNet conduits on Cray are fully compatible with the PMI-based ALPS
and SLURM spawners used at most sites. Run your job using the normal aprun
or
srun
command recommended for MPI programs at your site.
ibv-conduit for InfiniBand networks
On ibv-conduit, the use of the MPI-based spawner is recommended for best
results. To enable this support, make sure to set CXX=mpicxx
when installing
UPC++. You may additionally need to set $MPI_CC
at install time to the
location of the MPI C compiler, if the correctmpicc
is not first in your
$PATH
. This should enable MPI-based spawning by default when using
upcxx-run
, otherwise you can force this behavior at run-time by setting
GASNET_IBV_SPAWNER=mpi
.
This assumes that job spawn support for regular MPI programs is already correctly installed and configured on your system. Consult the documentation for your MPI distribution if this is not the case.
udp-conduit for laptops and Ethernet networks
When running hybrid applications on a single-node system (e.g., your laptop) or on
Ethernet-based clusters, the recommended GASNet backend is udp-conduit (i.e.,
UPCXX_NETWORK=udp
at app compile time). A few additional run-time
settings are recommended for MPI integration, and
then the job can be spawned using upcxx-run
:
export GASNET_SPAWNFN='C' export GASNET_CSPAWN_CMD='mpirun -np %N %C' export GASNET_WORKER_RANK=OMPI_COMM_WORLD_RANK # optional, assumes Open MPI upcxx-run -np 2 hello-world
This assumes the mpirun
command is in your $PATH
and takes the usual
arguments - otherwise adjust the above settings accordingly. It also assumes
that job spawn support for regular MPI programs is already correctly installed
and configured on your system. Consult the documentation for your MPI
distribution if this is not the case.
The (optional) GASNET_WORKER_RANK=OMPI_COMM_WORLD_RANK
setting above assumes
an Open-MPI-based MPI implementation, and names an environment variable provided
by the mpirun job spawner in the worker process environment that identifies the
MPI process rank. This requests udp-conduit to assign the UPC++/GASNet rank
identifiers identically to the MPI_COMM_WORLD
rank numbering for the
corresponding process. For MPICH-based MPI implementations, you may want
the setting GASNET_WORKER_RANK=PMI_RANK
. For other MPI implementations,
consult their documentation for the corresponding variable name.
If GASNET_WORKER_RANK
is unset (or names a non-existent variable) then
udp-conduit may number your UPC++ ranks differently from MPI ranks,
so you may want to perform an MPI rank renumbering after startup, eg:
MPI_Comm newcomm; MPI_Comm_split(MPI_COMM_WORLD, 0, upcxx::rank_me(), &newcomm);
mpi-conduit portable MPI-based backend
On mpi-conduit, your UPC++ application is an MPI program, so job spawning works exactly like it would for any other MPI program. Note this portable conduit exists only for portability reasons and is basically never the highest-performance choice for distributed systems.
smp-conduit for single-node systems
smp-conduit is currently incompatible with hybrid MPI applications, due to job spawning limitations. Please use one of the other conduits listed above with appropriate spawn arguments to use your single node. By default UPC++ will use process shared memory and efficient inter-process comms to implement all the on-node communication (bypassing the network adapter). High-quality MPI implementations will do the same.
Troubleshooting:
If your application follows the recommendation above but still has problems, here are some things to consider:
-
Some HPC networks APIs are not well-suited to running multiple runtimes. They may either permit a process to "open" the network adapter at most once, or they may provide multiple opens but allow them to block one another. Additionally, it is possible in some cases that one network runtime may monopolize the network resources, leading to abnormally slow performance or complete failure of the other. The solutions available for these case are either to use distinct networks for UPC++ and MPI, or to use mpi-conduit for UPC++. These have adverse performance impact on different portions of your code, and the best choice will depend on specifics of your code.
- Use TCP-based communication in MPI
We have compiled a list of the relevant configuration parameters for several common MPI implementations, which can be found here (look for the phrase "hybrid GASNet+MPI"). - Use UDP for communication in UPC++
Setexport UPCXX_NETWORK=udp
when compiling UPC++ app code - Use MPI for communication in UPC++
Setexport UPCXX_NETWORK=mpi
when compiling UPC++ app code
- Use TCP-based communication in MPI
-
The Aries network adapter on the Cray XC platform has approximately 120 hardware contexts for communications. With MPI and UPC++ each consuming one per process, a 64-process-per-node run of a hybrid application exceeds the available resources. The solution is to set the following two environment variables at run time to instruct both libraries to request virtualized contexts:
GASNET_GNI_FMA_SHARING=1 MPICH_GNI_FMA_SHARING=enabled
Updated