{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### CUDA-aware MPI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "MPI helped clean up much of the boilerplate we used when managing multiple devices explicitly. But we also gave up the benefit of multiple GPUs talking to each other directly. MPI is a [distributed memory parallel programming model](https://en.wikipedia.org/wiki/Distributed_memory), where each processor has its own (virtual) memory and address space, even if all ranks are on the same server and thus share the same physical memory. (This is typically contrasted with [shared memory parallel programming models](https://en.wikipedia.org/wiki/Shared_memory), where each processing thread has access to the same memory space, like [OpenMP](https://en.wikipedia.org/wiki/OpenMP), and also like traditional single-GPU CUDA programming where all threads have access to global memory.) So we copied the result for each GPU to the CPU and then summed the results on the CPU.\n", "\n", "But as long as we're staying on a single server the rules of the [CUDA universal address space](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-virtual-address-space) still hold, so all *CUDA* allocations result in virtual addresses that can be meaningfully shared across processes (even if the normal CPU dynamic memory allocations cannot be). As a result, it's possible for MPI to directly implement peer memory copies under the hood. For communication among remote servers this is not possible, but there are other technologies that allow direct GPU to GPU communication through a network interface, in particular [GPUDirect RDMA](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html). Recognizing the value in leveraging these technologies for efficient communication, many MPI implementations (including OpenMPI) provide [CUDA-aware MPI](https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/), which allows the programmer to provide an address to an MPI communication routine which may reside on a device. The MPI implementation is then free to use whatever scheme it desires to transmit the data from one GPU to another, including the use of GPUDirect P2P and GPUDirect RDMA where appropriate. (Note that [GPUDirect](https://developer.nvidia.com/gpudirect) refers to a family of technologies while CUDA-aware MPI refers to an API which may use those technologies under the hood, although it is common to see the two terms incorrectly conflated.)\n", "\n", "