"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Monte Carlo Approximation of $\\pi$ - MPI"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook we will introduce the single-program, multiple-data paradigm with MPI."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Objectives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By the time you complete this notebook you will:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Be able to use MPI to run multiple copies of a CUDA application on multiple GPUs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## MPI"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Managing all devices in a single program can be cumbersome. It often looks like what we've been doing so far, where we loop over all available devices and then take the same action (e.g. launch a kernel). The program can be simplified greatly with MPI (the [Message Passing Interface](https://en.wikipedia.org/wiki/Message_Passing_Interface)). When using MPI, we launch the *same* program multiple times independently (the [single-program, multiple-data](https://en.wikipedia.org/wiki/SPMD) paradigm). In the most common use case, we launch as many independent copies of the process as there are GPUs in your server, and each copy works with exactly one GPU."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## MPI Ranks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each independent process has a unique numerical identifier associated with it (called its **rank**) as well as information about how many total processes are running. We can programmatically obtain the rank for each process using `MPI_Comm_rank()`, and the number of processes using `MPI_Comm_size()`. With this information, we can have each rank make independent processing decisions (while still using only one copy of the source code). For example, we can (arbitrarily) set the GPU to be used with `cudaSetDevice()` to be equal to the MPI rank (assuming there are at most the same number of ranks as GPUs)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With MPI, it's straightforward for every rank to independently do its *N / number_of_gpus* calculations and then sum them up, and indeed this is the most common way to write this style of program when using MPI."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using MPI"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are some details for using MPI in your code."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Initializing and Finalizing MPI"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"MPI must be initialized and finalized as the first and last thing (respectively) in an MPI program."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```cpp\n",
"// Initialize MPI\n",
"MPI_Init(&argc, &argv);\n",
"...\n",
"// Finalize MPI\n",
"MPI_Finalize();\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Obtaining Rank and Number of Ranks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We obtain a rank and the total number of ranks."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```cpp\n",
"int rank, num_ranks;\n",
"\n",
"// MPI_COMM_WORLD means that we want to include all processes.\n",
"// It is possible in MPI to create \"communicators\" that only include some of the ranks.\n",
"MPI_Comm_rank(MPI_COMM_WORLD, &rank);\n",
"MPI_Comm_size(MPI_COMM_WORLD, &num_ranks);\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Associate a GPU with a Rank"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We select one GPU for each rank."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```cpp\n",
"// Each rank (arbitrarily) chooses the GPU corresponding to its rank\n",
"int dev = rank;\n",
"cudaSetDevice(dev);\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Gather Results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our program we will sum up (reduce) the results across all ranks and store the result in a single rank, which will arbitrarily do the final calculation and print. By convention, this is rank 0 (often called the \"root\" processor).\n",
"\n",
"For this reduction we will use [`MPI_Reduce`](https://www.open-mpi.org/doc/v4.1/man3/MPI_Reduce.3.php) which expects where to make the reductions from (`hits`), where to make reductions to (`total_hits`), the count of items to reduce (`1`), the data type being reduced (`MPI_INT`), the operation to perform for the reduction (`MPI_SUM`), the rank where to store the results (`root`), and the communicator with the processes to involve (`MPI_COMM_WORLD`)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```cpp\n",
"// Accumulate the results across all ranks to the result on rank 0\n",
"int* total_hits;\n",
"total_hits = (int*) malloc(sizeof(int));\n",
"\n",
"int root = 0;\n",
"MPI_Reduce(hits, total_hits, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);\n",
"\n",
"if (rank == root) {\n",
" // Calculate final value of pi and print out result\n",
" ...\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise: Use MPI"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In [exercises/monte_carlo_mgpu_cuda_mpi.cpp](exercises/monte_carlo_mgpu_cuda_mpi.cpp) we've sketched out how the monte-carlo $\\pi$ approximation application would look with MPI. We have once again left a couple simple `FIXME` tasks for you to do."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check out the [solution](solutions/monte_carlo_mgpu_cuda_mpi.cpp) if you get stuck."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run the Code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you're ready to go, execute the next cell. `NUM_DEVICES` has been already been set by us to reflect the number of GPUs on this system. The parameter `-np` determines how many independent copies of the process (MPI ranks) are used."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!nvcc -ccbin=mpicxx -x cu -arch=sm_70 -o monte_carlo_mgpu_cuda_mpi exercises/monte_carlo_mgpu_cuda_mpi.cpp\n",
"!mpirun -np $NUM_DEVICES ./monte_carlo_mgpu_cuda_mpi"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"MPI helped clean up much of the boilerplate we used when managing multiple devices explicitly, but we also gave up the benefit of multiple GPUs talking to each other directly. In the next notebook we will look at CUDA-aware MPI which will give us the benefits of the SPMD programming model, while retaining the ability to use direct peer-to-peer memory.\n",
"\n",
"Please open the next notebook: [_Monte Carlo Approximation of $\\pi$ - CUDA-Aware MPI_](06_MCπ-CUDA-MPI.ipynb)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}