{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"center\"><img src=\"./images/DLI_Header.png\"></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Jacobi Iteration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook and the next we will look at a Laplace equation solver using Jacobi iteration. This problem will give us excellent opportunity to explore the common motif of handling data communication at boundary points between distributed data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Objectives"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By the time you complete this notebook you will:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Be able to use NVSHMEM to handle boundary point communications in multi GPU algorithms with distributed data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction to the Algorithm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By no means do you need to fully understand this algorithm to be able to accomplish the objectives of the course. However, for the curious, and for context, we will start by spending some time discussing the algorithm."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Laplace Equation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A common motif in finite-element/finite-volume/finite-difference applications is the solution of elliptic partial differential equations with relaxation methods. Perhaps the simplest elliptic PDE is the Laplace equation:\n",
    "\n",
    "$$\n",
    "\\nabla^2\\, f = 0\n",
    "$$\n",
    "\n",
    "where $\\nabla^2 = \\nabla \\cdot \\nabla$ is the Laplacian operator (sum of second derivatives for all coordinate directions) and $f = f(\\mathbf{r})$ is a scalar field as a function of the spatial vector coordinate $\\mathbf{r}$. The Laplace equation can be used to solve, for example, the equilibrium distribution of temperature on a metal plate that is heated to a fixed temperature on its edges.\n",
    "\n",
    "In one dimension, where $f = f(x)$, this equation is:\n",
    "\n",
    "$$\n",
    "\\frac{\\partial^2}{\\partial x^2} f = 0\n",
    "$$\n",
    "\n",
    "Suppose we want to solve this equation over a domain $x = [0, L]$, given fixed boundary conditions $f(0) = T_{\\rm left}$ and $f(L) = T_{\\rm right}$. That is, we want to know what the temperature distribution looks like in the interior of the domain as a function of $x$. A common approach is to discretize space into a set of $N$ points, located at $0, L\\, /\\, (N - 1),\\, 2L\\,/\\,(N - 1),\\, ...,\\, (N - 2)\\,L\\, /\\, (N - 1),\\, L$. The leftmost and rightmost points will remain at the fixed temperatures $T_{\\rm left}$ and $T_{\\rm right}$ respectively, while the interior $N-2$ points are the unknowns we need to solve for. The distance between the points is $\\Delta x = L\\, /\\, (N - 1)$, and we will store the points in an array of length $N$. For each index $i$ in the (zero-indexed) array, the coordinate position is $i\\, L\\, /\\, (N - 1) = i\\, \\Delta x$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In a discretized spatial domain, the derivatives of the field at index $i$ are some function of the nearby points. For example, a simple discretization of the first derivative would be:\n",
    "\n",
    "$$\n",
    "\\frac{\\partial}{\\partial x} f_i = (f_{i+1} - f_{i-1})\\ /\\ (2\\, \\Delta x)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><img src=\"images/1D_finite_differencing.png\" width=\"700\"></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While a simple discretization of the second derivative would be:\n",
    "\n",
    "$$\n",
    "\\frac{\\partial^2}{\\partial x^2} f_i = (f_{i+1} - 2\\, f_{i} + f_{i-1})\\ /\\ (\\Delta x^2)\n",
    "$$\n",
    "\n",
    "If we set this expression equal to zero to satisfy the Laplace equation, we get:\n",
    "\n",
    "$$\n",
    "f_{i+1} - 2\\, f_{i} + f_{i-1} = 0\n",
    "$$\n",
    "\n",
    "Solving this for $f_{i}$, we get:\n",
    "\n",
    "$$\n",
    "f_{i} = (f_{i+1} + f_{i-1})\\ / \\ 2\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Jacobi Iteration to Solve"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although $f_{i+1}$ and $f_{i-1}$ are also varying (except at the boundary points $i == 0$ and $i == N-1$), it turns out that we can simply *iterate* on this solution for $f_{i}$ many times until the solution is sufficiently equilibrated. That is, if in every iteration we take the old solution to $f$, and then at every point in the new solution set it equal to the average of the two neighboring points from the old solution, we will eventually solve for the equilibrium distribution of $f$.\n",
    "\n",
    "Depicting this approach in (serial) pseudo-code we get:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "while (error > tolerance):\n",
    "    l2_norm = 0\n",
    "    for i = 1, N-2:\n",
    "        f[i] = 0.5 * (f_old[i-1] + f_old[i+1])\n",
    "        l2_norm += (f[i] - f_old[i]) * (f[i] - f_old[i])\n",
    "    error = sqrt(l2_norm / N)\n",
    "    swap(f_old, f)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A Single GPU CUDA Implementation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example is implemented in standard CUDA for a single GPU in [code/jacobi.cpp](code/jacobi.cpp). Take some time to review the algorithm and its parallel implementation. As before, we're not aiming for the highest-performing solution, just something that sketches out the basic idea."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvcc -x cu -arch=sm_70 -o jacobi code/jacobi.cpp\n",
    "!./jacobi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise: Distributing with NVSHMEM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A very simple distribution strategy to multiple GPUs is to divide the domain into $M$ chunks (where $M$ is the number of GPUs). PE 0 will have points $[0, N\\, /\\, M - 1]$, PE 1 will have points $[N\\, /\\, M,\\, 2\\, N\\, /\\, M - 1]$, etc. In this approach, the communication between PEs needs to happen at the boundary points between PEs. For example, the update at point $i = N\\, /\\, M - 1$ on PE 0 is:\n",
    "\n",
    "$f[N\\, /\\, M - 1] = (f[N\\, /\\, M] + f[N\\, /\\, M-2])\\ /\\ 2$\n",
    "\n",
    "But this PE doesn't own the data point at $i = N\\, /\\, M$, it is owned by PE 1. So we will need to get that data point from the remote PE. To do so, we can use the [nvshmem_float_g()](https://docs.nvidia.com/hpc-sdk/nvshmem/api/gen/api/rma.html#nvshmem-get) API to get a scalar quantity on the remote PE.\n",
    "\n",
    "```\n",
    "float r = nvshmem_float_g(source, target_pe);\n",
    "```\n",
    "\n",
    "This then looks like the following. Note that with respect to PE 0, location `N / M` corresponds to index `0` of PE 1.\n",
    "\n",
    "```\n",
    "f_left = f[N / M - 2]\n",
    "f_right = nvshmem_float_g(&f[0], 1)\n",
    "f[N / M - 1] = (f_right + f_left) / 2\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's implement this in [exercises/jacobi_step1.cpp](exercises/jacobi_step1.cpp), dealing with the FIXME statements. If you get stuck, consult the [solution](solutions/jacobi_step1.cpp)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvcc -x cu -arch=sm_70 -rdc=true -I $NVSHMEM_HOME/include -L $NVSHMEM_HOME/lib -lnvshmem -lcuda -o jacobi_step1 exercises/jacobi_step1.cpp\n",
    "!nvshmrun -np $NUM_DEVICES ./jacobi_step1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next, and final notebook prior to the assessment, we will take a brief aside to introduce kernel profiling with Nsight Compute, and will look at using the [*cub*](https://docs.nvidia.com/cuda/cub/index.html) library to improve the performance of the reductions in our NVSHMEM distributed Jacobi algorithm.\n",
    "\n",
    "Please open the next notebook: [_Improving the Reduction Performance with `cub`_](13_Jacobi-cub.ipynb)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}