{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"center\"><img src=\"./images/DLI_Header.png\"></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Monte Carlo Approximation of $\\pi$ - Multiple GPUs with Peer Access"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we will introduce direct peer-to-peer memory access across GPUs, and refactor the multi-GPU code from the previous notebook to use it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Objectives"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By the time you complete this notebook you will:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Understand how to check for and enable direct peer-to-peer memory for applications running on multiple GPUs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Unified Virtual Address Space"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "CUDA uses a [unified virtual address (UVA) space](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-virtual-address-space). All CUDA allocations (including both `cudaMalloc` and `cudaMallocHost`) that occur on CPUs and GPUs in this UVA space are guaranteed to have unique virtual addresses. This is, for example, what allows you to allocate pinned host memory with `cudaMallocHost` or `cudaHostAlloc` and take its address directly in device code (along with the virtual-to-physical address translation being fixed so that the GPU does not need to talk to the CPU's memory management unit).\n",
    "\n",
    "In the UVA paradigm, CUDA knows which device a given address belongs to because by construction the same address is not used for different allocations on different devices."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><img src=\"images/UVA.png\" width=\"1000\"></center>\n",
    "\n",
    "<p style=\"text-align:center;\"><em><b>Note:</b> the image above depicts the GPUs as connected via PCIe, but when UVA is supported it works over NVLink and/or NVSwitch as well.</em></p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Direct Peer Memory Access"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "UVA also supports [direct access of peer GPU memory](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#peer-to-peer-memory-access), which is sometimes called _GPUDirect Peer-to-Peer (P2P)_. GPU Direct P2P, which is possible when multiple GPUs are connected to the same PCI-e tree or via NVLINK, is a distinct concept from UVA, but is orthogonal to and facilitated by it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><img src=\"images/GPUDirectP2P.png\" width=\"1000\"></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Enabling Direct Peer Memory Access"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With some exceptions (depending on the system PCIe, NVLink, or NVSwitch topology), one GPU can directly read from and write to an address on another GPU on the same server. We use the CUDA API call [cudaDeviceCanAccessPeer()](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__PEER.html) to check if this is possible on a given configuration. Assuming it is, we enable this peer access at the beginning of a program with [cudaDeviceEnablePeerAccess()](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__PEER.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```cpp\n",
    "int this_device = 0;\n",
    "int peer_device = 1;\n",
    "\n",
    "cudaSetDevice(this_device);\n",
    "\n",
    "int can_access_peer;\n",
    "\n",
    "cudaDeviceCanAccessPeer(&can_access_peer, this_device, peer_device);\n",
    "\n",
    "if (can_access_peer) {\n",
    "    cudaDeviceEnablePeerAccess(peer_device, 0); // Note: `0` is the required value passed to this 2nd positional argument which is being reserved for future use.\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise: Enable Direct Peer Memory Access"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try this out on our application. Our strategy will be for every thread to update the *same* hits counter, rather than having one counter per GPU. We'll arbitrarily place this counter on GPU 0.\n",
    "\n",
    "This allows the application to look more like the original single GPU case since we no longer need to allocate and copy memory for each available GPU. On the flip-side, at least for this application, we are increasing the number of possible atomic collisions on the counter.\n",
    "\n",
    "Open [exercises/monte_carlo_mgpu_cuda_peer.cpp](exercises/monte_carlo_mgpu_cuda_peer.cpp), and as before, look for `FIXME` in the code for the parts you should write yourself. You should get the same result as the previous exercise -- we're not doing different work, we're just updating the results to a different memory location, so the answer should be identical.\n",
    "\n",
    "Consult [the solution](solutions/monte_carlo_mgpu_cuda_peer.cpp) if you need help."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run the Code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvcc -x cu -arch=sm_70 -o monte_carlo_mgpu_cuda_peer exercises/monte_carlo_mgpu_cuda_peer.cpp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "!./monte_carlo_mgpu_cuda_peer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next notebook we will look at a different paradigm for managing multiple GPUs: the single-program multiple-data (SPMD) paradigm, as offered by MPI.\n",
    "\n",
    "Please open the next notebook: [_Monte Carlo Approximation of $\\pi$ - MPI_](05_MCπ-MPI.ipynb)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}