{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Monte Carlo Approximation of $\\pi$ - CUDA-Aware MPI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we will introduce CUDA-aware MPI, which will grant us the benefits of the SPMD paradigm while retaining the ability to utilize direct peer-to-peer memory with multiple GPUs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Objectives" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By the time you complete this notebook you will:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Be able to use CUDA-aware MPI to run multiple copies of a CUDA application on multiple GPUs while leveraging direct peer-to-peer memory access between GPUs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CUDA-Aware MPI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "MPI helped clean up much of the boilerplate we used when managing multiple devices explicitly. But we also gave up the benefit of multiple GPUs talking to each other directly. MPI is a [distributed memory parallel programming model](https://en.wikipedia.org/wiki/Distributed_memory), where each processor has its own (virtual) memory and address space, even if all ranks are on the same server and thus share the same physical memory. (This is typically contrasted with [shared memory parallel programming models](https://en.wikipedia.org/wiki/Shared_memory), where each processing thread has access to the same memory space, like [OpenMP](https://en.wikipedia.org/wiki/OpenMP), and also like traditional single-GPU CUDA programming where all threads have access to global memory.) So we copied the result for each GPU to the CPU and then summed the results on the CPU.\n", "\n", "But as long as we're staying on a single server the rules of the [CUDA unified virtual address space](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-virtual-address-space) still hold, so all *CUDA* allocations result in virtual addresses that can be meaningfully shared across processes (even if the normal CPU dynamic memory allocations cannot be). As a result, it's possible for MPI to directly implement peer memory copies under the hood. For communication among remote servers this is not possible, but there are other technologies that allow direct GPU to GPU communication through a network interface, in particular [GPUDirect RDMA](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html). Recognizing the value in leveraging these technologies for efficient communication, many MPI implementations (including OpenMPI, which we are using in this workshop) provide [CUDA-aware MPI](https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/), which allows the programmer to provide an address to an MPI communication routine which may reside on a device. The MPI implementation is then free to use whatever scheme it desires to transmit the data from one GPU to another, including the use of GPUDirect P2P and GPUDirect RDMA where appropriate. (Note that [GPUDirect](https://developer.nvidia.com/gpudirect) refers to a family of technologies while CUDA-aware MPI refers to an API which may use those technologies under the hood, although it is common to see the two terms incorrectly conflated.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So CUDA-aware MPI provides the benefit of simplified programming while retaining the performance benefit of avoiding unnecessary copies to CPU memory. With that in mind, one way to write the final reduction is..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```cpp\n", "MPI_Reduce(d_hits, total_hits, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "...where MPI automatically detects that the send buffer `d_hits` resides on the device while the receive buffer `total_hits` resides on the host and does the right thing behind the scenes to enable this copy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check for CUDA-Aware MPI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In [exercises/monte_carlo_mgpu_cuda_mpi_cuda_aware.cpp](exercises/monte_carlo_mgpu_cuda_mpi_cuda_aware.cpp), we do the reduction as just described, reducing the device array `d_hits` into the CPU array `total_hits`. Before we proceed to an exercise, let's first confirm that this implementation, which we are stating should be able to read directly from GPU memory, works as expected." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!nvcc -ccbin=mpicxx -x cu -arch=sm_70 -o monte_carlo_mgpu_cuda_mpi_cuda_aware exercises/monte_carlo_mgpu_cuda_mpi_cuda_aware.cpp\n", "!mpirun -np $NUM_DEVICES ./monte_carlo_mgpu_cuda_mpi_cuda_aware" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise: Perform Reduction Entirely on GPUs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, as a simple exercise, rewrite this application to do the reduction entirely in GPU memory.\n", "\n", "In order to do this you will need to create a device array for storing the hits total, and use this device array in the call to `MPI_Reduce`, and then explicitly copy the result back to the host on rank 0 at the end.\n", "\n", "If you need, you can see the solution in [solutions/monte_carlo_mgpu_cuda_mpi_cuda_aware.cpp](solutions/monte_carlo_mgpu_cuda_mpi_cuda_aware.cpp)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run the Code" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!nvcc -ccbin=mpicxx -x cu -arch=sm_70 -o monte_carlo_mgpu_cuda_mpi_cuda_aware exercises/monte_carlo_mgpu_cuda_mpi_cuda_aware.cpp\n", "!mpirun -np $NUM_DEVICES ./monte_carlo_mgpu_cuda_mpi_cuda_aware" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You've now seen multiple methods for managing multiple GPUs to distribute a parallel workload, including looping over devices explicitly and using MPI to use multiple devices implicitly. You have also seen multiple methods for handling data transfer between GPUs, including direct peer memory access, peer to peer CUDA memory copies, and copies that go through the CPU.\n", "\n", "Hopefully this gives you a better feel for the pros and cons of each method. In particular, the CUDA-aware MPI method is very nice and can be highly performant.\n", "\n", "But even with CUDA-aware MPI we have to return to the CPU to initiate the transfer. For many applications, in particular ones that hit strong scaling limits due to how efficient GPUs are at traditional calculations, the latency of returning to the CPU can be a significant performance penalty. Wouldn't it be nice if we could have the SPMD-like characteristics of MPI that result in highly understandable programmings, while retaining the potential performance benefit of initiating transfers directly from the kernel? Next, we're going to introduce NVSHMEM and demonstrate that it provides exactly that benefit.\n", "\n", "Please open the next notebook: [_Monte Carlo Approximation of $\\pi$ - NVSHMEM_](07_MCπ-NVSHMEM-Dup.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }