{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Peer access" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CUDA uses a [universal virtual address (UVA) space](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-virtual-address-space). All CUDA allocations (including both `cudaMalloc` and `cudaMallocHost`) that occur on CPUs and GPUs in this UVA space are guaranteed to have unique virtual addresses. This is, for example, what allows you to allocate pinned host memory with `cudaMallocHost` or `cudaHostAlloc` and take its address directly in device code (along with the virtual-to-physical address translation being fixed so that the GPU does not need to talk to the CPU's memory management unit). That is, in the UVA paradigm, CUDA knows which device a given address belongs to because by construction the same address is not used for different allocations on different devices.\n", "\n", "
\n", "\n", "(Note the image above depicts the GPUs as connected via PCIe, but when UVA is supported it works over NVLink and/or NVSwitch as well.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This also supports [direct access of peer GPU memory](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#peer-to-peer-memory-access), which is sometimes called \"GPUDirect Peer-to-Peer (P2P).\" From one GPU we can directly read from and write to an address on another GPU on the same server. (This is not true for every server; it depends on the system PCIe, NVLink, or NVSwitch topology. You can use the CUDA API call [cudaDeviceCanAcessPeer()](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__PEER.html) to check if this is possible on your configuration.) The only requirement is that you enable this peer access at the beginning of your program with [cudaDeviceEnablePeerAccess()](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__PEER.html).\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a simple example of how this works, let's try this out on our application, in [exercises/monte_carlo_mgpu_cuda_peer.cpp](exercises/monte_carlo_mgpu_cuda_peer.cpp). Our strategy will be for every thread to update the *same* hits counter, rather than having one counter per GPU. We'll arbitrarily place this counter on GPU 0. This allows the application to look more like the original case (at the expense of increasing the number of possible atomic collisions on the counter). As before, look for `FIXME` in the code for the parts you should write yourself. You can consult [the solution](solutions/monte_carlo_mgpu_cuda_peer.cpp) if you need help. Check to make sure that we get the same result as above -- we're not doing different work, we're just updating the results to a different memory location, so the answer should be identical." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "!nvcc -x cu -arch=sm_70 -rdc=true -o monte_carlo_mgpu_cuda_peer solutions/monte_carlo_mgpu_cuda_peer.cpp" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Estimated value of pi = 3.14072\n", "Error = 0.000277734\n", "CPU times: user 12.8 ms, sys: 16.4 ms, total: 29.2 ms\n", "Wall time: 2.05 s\n" ] } ], "source": [ "%%time\n", "!./monte_carlo_mgpu_cuda_peer" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" } }, "nbformat": 4, "nbformat_minor": 4 }