{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### CUDA-aware MPI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "MPI helped clean up much of the boilerplate we used when managing multiple devices explicitly. But we also gave up the benefit of multiple GPUs talking to each other directly. MPI is a [distributed memory parallel programming model](https://en.wikipedia.org/wiki/Distributed_memory), where each processor has its own (virtual) memory and address space, even if all ranks are on the same server and thus share the same physical memory. (This is typically contrasted with [shared memory parallel programming models](https://en.wikipedia.org/wiki/Shared_memory), where each processing thread has access to the same memory space, like [OpenMP](https://en.wikipedia.org/wiki/OpenMP), and also like traditional single-GPU CUDA programming where all threads have access to global memory.) So we copied the result for each GPU to the CPU and then summed the results on the CPU.\n", "\n", "But as long as we're staying on a single server the rules of the [CUDA universal address space](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-virtual-address-space) still hold, so all *CUDA* allocations result in virtual addresses that can be meaningfully shared across processes (even if the normal CPU dynamic memory allocations cannot be). As a result, it's possible for MPI to directly implement peer memory copies under the hood. For communication among remote servers this is not possible, but there are other technologies that allow direct GPU to GPU communication through a network interface, in particular [GPUDirect RDMA](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html). Recognizing the value in leveraging these technologies for efficient communication, many MPI implementations (including OpenMPI) provide [CUDA-aware MPI](https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/), which allows the programmer to provide an address to an MPI communication routine which may reside on a device. The MPI implementation is then free to use whatever scheme it desires to transmit the data from one GPU to another, including the use of GPUDirect P2P and GPUDirect RDMA where appropriate. (Note that [GPUDirect](https://developer.nvidia.com/gpudirect) refers to a family of technologies while CUDA-aware MPI refers to an API which may use those technologies under the hood, although it is common to see the two terms incorrectly conflated.)\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So CUDA-aware MPI provides the benefit of simplified programming while retaining the performance benefit of avoiding unnecessary copies to CPU memory. With that in mind, one way to write the final reduction is:\n", "\n", "```\n", " MPI_Reduce(d_hits, total_hits, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);\n", "```\n", "\n", "where MPI automatically detects that the send buffer `d_hits` resides on the device while the receive buffer `total_hits` resides on the host and does the right thing behind the scenes to enable this copy. First, let's verify this works:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Estimated value of pi = 3.14072\n", "Error = 0.000277734\n" ] } ], "source": [ "!nvcc -ccbin=mpicxx -x cu -arch=sm_70 -rdc=true -o monte_carlo_mgpu_cuda_mpi_cuda_aware exercises/monte_carlo_mgpu_cuda_mpi_cuda_aware.cpp\n", "!mpirun -np $NUM_DEVICES ./monte_carlo_mgpu_cuda_mpi_cuda_aware" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, as a simple exercise, rewrite this application to do the reduction entirely in GPU memory, and then explicitly copy the result back to the host on rank 0 at the end. You can see the solution in [solutions/monte_carlo_mgpu_cuda_mpi_cuda_aware.cpp](solutions/monte_carlo_mgpu_cuda_mpi_cuda_aware.cpp)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Estimated value of pi = 3.14072\n", "Error = 0.000277734\n" ] } ], "source": [ "!nvcc -ccbin=mpicxx -x cu -arch=sm_70 -rdc=true -o monte_carlo_mgpu_cuda_mpi_cuda_aware solutions/monte_carlo_mgpu_cuda_mpi_cuda_aware.cpp\n", "!mpirun -np $NUM_DEVICES ./monte_carlo_mgpu_cuda_mpi_cuda_aware" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You've now seen multiple methods for managing multiple GPUs to distribute a parallel workload, including looping over devices explicitly and using MPI to use multiple devices implicitly, and multiple methods for handling data transfer between GPUs, including direct peer memory access, peer to peer CUDA memory copies, and copies that go through the CPU. Hopefully this gives you a better feel for the pros and cons of each method. In particular, the CUDA-aware MPI method is very nice and can be highly performant, but we also have to return to the CPU to initiate the transfer. For many applications, in particular ones that hit strong scaling limits due to how efficient GPUs are at traditional calculations, the latency of returning to the CPU can be a significant performance penalty. Wouldn't it be nice if we could have the SPMD-like characteristics of MPI that result in highly understandable programmings, while retaining the potential performance benefit of initiating transfers directly from the kernel? Next, we're going to introduce NVSHMEM and demonstrate that it provides exactly that benefit.\n", "\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" } }, "nbformat": 4, "nbformat_minor": 4 }