{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Welcome to NVIDIA's NVSHMEM training! In this course you will learn how to use [NVSHMEM](https://developer.nvidia.com/nvshmem), a parallel programming model for efficient and scalable communication across multiple NVIDIA GPUs. NVSHMEM, which is based on [OpenSHMEM](http://openshmem.org/site/), provides a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA streams. NVSHMEM offers a compelling alternative to other multi-GPU programming models for many application use cases, and in this course you will compare these various multi-GPU programming models and learn about the cases where NVSHMEM makes sense to use." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll talk about the details later. For now, we can say that NVSHMEM is especially valuable on modern GPU servers that have a high density of GPUs per server node and complex interconnects such as [NVIDIA NVSwitch](https://www.nvidia.com/en-us/data-center/nvlink/) on the [NVIDIA DGX A100 server](https://www.nvidia.com/en-us/data-center/dgx-a100/).\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Motivation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Traditionally, communication patterns involving GPUs on multiple servers may look like the following: compute happens on the GPU, while communication happens on the CPU after synchronizing the GPU (to ensure that the data we send is valid). While this approach is very easy to program, it inserts the latency of initiating the communication and/or launching the kernel on the application's critical path. We are losing out on the ability to overlap communication with compute. If we do overlap communication with compute by pipelining the work, we can partially hide the latency, but at the cost of icreased application complexity.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By contrast, in a model with GPU-initiated rather than CPU-initiated communication, we do *both* compute and communication directly from the GPU. We can write extremely fine-grained communication patterns this way, and we can hide communication latency by the very nature of the GPU architecture (where warps that are computing can continue on while other warps are stalled waiting for data).\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this course you will have access to multiple NVIDIA GPUs. To see which ones are available on this node, execute the following cell (by selecting it and clicking the Run button above, or by selecting it and typing Shift + Enter). Note that any executed command starting with \"!\" means that we want to run the command as if we are in a terminal." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!nvidia-smi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's store the number of devices in a variable for easy reference later." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of devices = 4\n" ] } ], "source": [ "NUM_DEVICES = !nvidia-smi -L | wc -l\n", "NUM_DEVICES = int(NUM_DEVICES[0])\n", "print(\"Number of devices = {}\".format(NUM_DEVICES))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## An example to warm up" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start with a simple parallel programming example and implement it several different ways; this will serve as a warmup exercise and give us our first introduction to NVSHMEM." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The example we'll use is the parallel calculation of the value of $\\pi$. A [well-known technique](https://en.wikipedia.org/wiki/Approximations_of_%CF%80#Summing_a_circle's_area) for numerically estimating $\\pi$ is to select a large number of random points within the unit square and count the fraction that fall within the unit circle. Since the area of the square is 1 and the area of the circle is $\\pi / 4$, the fraction of points that fall in the circle (multiplied by 4) is a good approximation of $\\pi$.\n", "\n", "
\n", "\n", "© [User:nicoguaro](https://commons.wikimedia.org/wiki/User:Nicoguaro) / [Wikimedia Commons](https://commons.wikimedia.org/wiki/Main_Page) / [CC-BY-3.0](https://creativecommons.org/licenses/by/3.0/deed.en)\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A nice property of this problem from the perspective of parallel programming is that each random point can be evaluated independently. We only need to know its coordinate to evaluate whether it falls within the circle; that is, given a point with coordinates $(x, y)$, all we need to do is check whether $x^2 + y^2 <= 1$. If it is, we increment our counter that measures the number of points in the circle. This can be done independently of every other point (although we do have to avoid race conditions in updating the counter)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With that in mind, let's see how this looks in CUDA (starting off with just a single GPU). We've provided a sample implementation; click on [code/monte_carlo_pi.cpp](code/monte_carlo_pi.cpp) to open it in a new tab and review the code. Note that this code is just meant for instructional purposes, it is not meant to be especially high performance. In particular:\n", "\n", "- We're using the [device-side API](https://docs.nvidia.com/cuda/curand/device-api-overview.html) of [cuRAND](https://developer.nvidia.com/curand) to generate random numbers directly in the kernel. It's OK if you're unfamiliar with cuRAND, just know that every CUDA thread will have its own unique random numbers.\n", "- We're having every thread only evaluate a single value, so the arithmetic intensity is quite low.\n", "- We'll have a lot of atomic collisions while updating the `hits` counter.\n", "\n", "Nevertheless, we can quickly estimate $\\pi$ using one million sample points and we should get an error compared to the correct value of only about 0.05%.\n", "\n", "To run the code, execute the next cell." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Estimated value of pi = 3.14319\n", "Error = 0.000507708\n" ] } ], "source": [ "!nvcc -x cu -arch=sm_70 -rdc=true -o monte_carlo code/monte_carlo_pi.cpp\n", "!./monte_carlo" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" } }, "nbformat": 4, "nbformat_minor": 4 }