<div align="center"><img src="./images/DLI_Header.png"></div>

# Monte Carlo Approximation of $\pi$ - Multiple GPUs

In this notebook we will refactor the single GPU implementation of the monte carlo approximation of $\pi$ algorithm to run on multiple GPUs using a technique of looping over available GPU devices to perform work on each. While this is a perfectly valid technique, we hope to begin demonstrating that it can quickly add significant complexity to your code.

## Objectives

By the time you complete this notebook you will:

- Be able to utilize multiple GPUs by looping over them to perform work on each.

## Extending to Multiple GPUs

A simple way to extend our example to multiple GPUs is to use a single host process that manages multiple GPUs. If we have *M* GPUs and *N* sample points to evaluate, we can distribute *N/M* to each GPU, and in principle calculate the result up to *M* times more quickly.

To enact this approach, we:
- Use `cudaGetDeviceCount` to ascertain the number of available GPUs.
- Loop over the number of GPUs, using `cudaSetDevice` in each loop iteration.
- Perform the correct fraction of the work for the set GPU.

```cpp
int device_count;
cudaGetDeviceCount(&device_count);

for (int i = 0; i < device_count; ++i) {
    cudaSetDevice(i);
    # Do single GPU worth of work.
}
```

## Exercise: Complete the Refactor to Multiple GPUs

[exercises/monte_carlo_mgpu_cuda.cpp](exercises/monte_carlo_mgpu_cuda.cpp) is an incomplete example of this approach. Note that in this example we are giving each GPU a different seed for the random number generator so that each GPU is doing different work. As a result our answer will change a little.

We've given you a few simple tasks to do in the code focused on the extra work this approach requires to give a single GPU the correct amount of work. Look for locations denoted by `FIXME` for where you should work.

If you get stuck, you can consult [the solution](solutions/monte_carlo_mgpu_cuda.cpp).

### Run the Code

After completing your work, compile and run the code using the following cells.

In [None]:
!nvcc -x cu -arch=sm_70 -o monte_carlo_mgpu_cuda exercises/monte_carlo_mgpu_cuda.cpp

In [None]:
%%time
!./monte_carlo_mgpu_cuda

## Next

In the next notebook you will refactor this loop-over-the-GPUs code to utilize _GPUDirect Peer-to-Peer_ which enables GPU-to-GPU memory copies as well as loads and stores directly over the memory fabric. Its use can really boost application performance, and as you will see, can also simplify memory management in your code by allowing each of several GPUs to work with a single allocation of memory.

Please open the next notebook: [_Monte Carlo Approximation of $\pi$ - Multiple GPUs with Peer Access_](04_MCÏ€-P2P.ipynb).