"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# NVSHMEM Histogram: Distributed Approach"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook instead of duplicating the histogram across GPUs and then reducing it, you will distribute parts of the histogram to each GPU and concatenate them. In addition to increasing your NVSHMEM capabilities, the refactor will also give you a chance to observe performance trade-offs you should consider when distributing work to multiple GPUs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Objectives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By the time you complete this notebook you will:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Be able to write multiple fully functional histogram programs that operate over multiple GPUs using NVSHMEM.\n",
"- Be able to empirically experiment with multi GPU algorithm design to find the best approach."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## NVSHMEM Implementation for the Distributed Approach"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at another way to solve this problem. A feature of the previous solution is that all of the histogram calculations are done locally. We then have a barrier across processes and a final reduction.\n",
"\n",
"Another way to approach it would be to partition the histogram itself across GPUs. When an entry in the input data belongs to a histogram location that is not on the resident GPU, we will atomically increment the relevant histogram entry in a remote PE. We then have to concatenate the histogram partitions at the end. We'll call this the \"distributed\" approach."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Trade-offs Between the Duplicated and Distributed Approaches"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the distributed approach, compared to the duplicated approach, we decrease the amount of GPU memory needed for the histogram. We also decrease the local atomic pressure on the histogram, but in return, increase message passing pressure as well as atomic pressure on the remote GPU."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise: Refactor to the Distributed Approach"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at this in [exercises/histogram_step2.cpp](exercises/histogram_step2.cpp). We'll arbitrarily divide up the histogram into as many segments as there are GPUs and distribute the partitions sequentially across them. We'll also assume the histogram partitioning inside the kernel so that we can arithmetically calculate the PE to send the data to (though it would be straightforward to generalize this to the case where this information is not known *a priori* and needs to be sent to the kernel as input data).\n",
"\n",
"In order to update the histogram on a remote PE, we want to do the equivalent of the cuda `atomicAdd()` function. The relevant NVSHMEM function is [nvshmem_int_atomic_add()](https://docs.nvidia.com/hpc-sdk/nvshmem/api/gen/api/amo.html#nvshmem-atomic-add):"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```cpp\n",
"nvshmem_int_atomic_add(destination, value, target_pe);\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Where `value` is the amount to add and `target_pe` is the remote PE to update. The `destination` must be a symmetric address (e.g. allocated with `nvshmem_malloc()`).\n",
"\n",
"For the combination step that concatenates the histogram, we'll use the handy [nvshmem_int_collect()](https://docs.nvidia.com/hpc-sdk/nvshmem/api/gen/api/collectives.html#nvshmem-fcollect) API which concatenates an array across all PEs, placing the array from PE 0 in the first section, the array from PE 1 in the second section, etc."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```cpp\n",
"nvshmem_int_collect(team, destination, source, nelems);\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Where `destination` is the concatenated array (will be the same on all PEs), and `source` is the source array with length `nelems`. Since the histogram is evenly distributed among PEs, the length of the destination array should be `n_pes * nelems`, which should match the length of the full histogram. Remember that for global collectives we use the team `NVSHMEM_TEAM_WORLD` which contains all PEs.\n",
"\n",
"Look for the FIXME steps and consult [the solution](solutions/histogram_step2.cpp) if you need help."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!nvcc -x cu -arch=sm_70 -rdc=true -I $NVSHMEM_HOME/include -L $NVSHMEM_HOME/lib -lnvshmem -lcuda -o histogram_step2 exercises/histogram_step2.cpp\n",
"!nvshmrun -np $NUM_DEVICES ./histogram_step2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comparing the Duplicated and Distributed Approaches"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Until now we've been focusing on writing syntactically correct code and haven't looked at performance. Now let's examine performance for the distributed and duplicated approaches. Vary the `NUM_BUCKETS` parameters and `NUM_INPUTS` parameters in both cases and note both the histogram tabulation and combination time. Is one approach generally faster than the other? If so, is there a case where the performance ratio reverses?\n",
"\n",
"For convenience we provide the solutions for both implementations below."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Duplicated Approach"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Duplicated approach source code](solutions/histogram_step1.cpp)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!nvcc -x cu -arch=sm_70 -rdc=true -I $NVSHMEM_HOME/include -L $NVSHMEM_HOME/lib -lnvshmem -lcuda -o histogram_step1 solutions/histogram_step1.cpp\n",
"!nvshmrun -np $NUM_DEVICES ./histogram_step1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Distributed Approach"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Distributed approach source code](solutions/histogram_step2.cpp)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!nvcc -x cu -arch=sm_70 -rdc=true -I $NVSHMEM_HOME/include -L $NVSHMEM_HOME/lib -lnvshmem -lcuda -o histogram_step2 solutions/histogram_step2.cpp\n",
"!nvshmrun -np $NUM_DEVICES ./histogram_step2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NVSHMEM enables easy distribution of a problem across an arbitrary number of GPUs. The decomposition strategy you use will often require experimentation to find the best approach."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the next 2 notebooks we will look at a Laplace equation solver that utilizes Jacobi iteration. Because of its demands to exchange boundary points between partitions of distributed data, it will serve as an excellent next step in our exploration of multi GPU coding with NVSHMEM.\n",
"\n",
"Please open the next notebook: [_Jacobi Iteration_](12_Jacobi.ipynb)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}