{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding MPI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Managing all devices in a single program can be cumbersome. It often looks like what we've been doing so far, where we loop over all available devices and then take the same action (e.g. launch a kernel). The program can be simplified greatly with MPI (the [Message Passing Interface](https://en.wikipedia.org/wiki/Message_Passing_Interface)). When using MPI, we launch the *same* program multiple times independently (the [single-program, multiple-data](https://en.wikipedia.org/wiki/SPMD) paradigm). In the most common use case, we launch as many independent copies of the process as there are GPUs in your server, and each copy works with exactly one GPU.\n", "\n", "
\n", "\n", "Each independent process has a unique numerical identifier associated with it (called its **rank**) as well as information about how many total processes are running. We can programmatically obtain the rank for each process (using `MPI_Comm_rank()`) and the number of processes (using `MPI_Comm_size()`). With this information, we can have each rank make independent processing decisions (while still using only one copy of the source code to achieve that). For example, we can (arbitrarily) set the GPU to be used with `cudaSetDevice()` to be equal to the rank (assuming there are at most the same number of ranks as GPUs)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With MPI, it's straightforward for every rank to independently do its *N / M* calculations and then sum them up, and indeed this is the most common way to write this style of program when using MPI. In [exercises/monte_carlo_mgpu_cuda_mpi.cpp](exercises/monte_carlo_mgpu_cuda_mpi.cpp) we've sketched out how this application would look with MPI (again, leaving a couple simple `FIXME` tasks for you to do).\n", "\n", "Some differences include:\n", "\n", "- MPI must be initialized and finalized as the first and last thing (respectively) in an MPI program.\n", "\n", "```\n", " // Initialize MPI\n", " MPI_Init(&argc, &argv);\n", " ...\n", " // Finalize MPI\n", " MPI_Finalize();\n", "```\n", "\n", "- We select one GPU for each rank.\n", "\n", "```\n", " // Each rank (arbitrarily) chooses the GPU corresponding to its rank\n", " int dev = rank;\n", " cudaSetDevice(dev);\n", "```\n", "\n", "- We sum up the results across all ranks and store the result in a single rank, which will arbitrarily do the final calculation and print. By convention, this is rank 0 (often called the \"root\" processor).\n", "\n", "```\n", " // Accumulate the results across all ranks to the result on rank 0\n", " int* total_hits;\n", " total_hits = (int*) malloc(sizeof(int));\n", "\n", " int root = 0;\n", " MPI_Reduce(hits, total_hits, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);\n", " \n", " if (rank == root) {\n", " // Calculate final value of pi and print out result\n", " ...\n", " }\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you're ready to go, execute the next cell (note that we defined `NUM_DEVICES` earlier as the number of GPUs in this system). The parameter `-np` determines how many independent copies of the process (MPI ranks) are used." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Estimated value of pi = 3.14072\n", "Error = 0.000277734\n" ] } ], "source": [ "!nvcc -ccbin=mpicxx -x cu -arch=sm_70 -rdc=true -o monte_carlo_mgpu_cuda_mpi solutions/monte_carlo_mgpu_cuda_mpi.cpp\n", "!mpirun -np $NUM_DEVICES ./monte_carlo_mgpu_cuda_mpi" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" } }, "nbformat": 4, "nbformat_minor": 4 }