CUDA Pro Tip: Always Set the Current Device to Avoid Multithreading Bugs

We often say that to reach high performance on GPUs you should expose as much parallelism in your code as possible, and we don’t mean just parallelism within one GPU, but also across multiple GPUs and CPUs. It’s common for high-performance software to parallelize across multiple GPUs by assigning one or more CPU threads to each GPU. In this post I’ll cover a common but subtle bug and a simple rule that will help you avoid it within your own software (spoiler alert: it’s in the title!).
Let’s review how to select which GPU to execute CUDA calls on. The CUDA runtime API is state-based, and threads executecudaSetDevice()to set the current GPU.

cudaError_t cudaSetDevice(int device)

After this call all CUDA API commands go to the current set device untilcudaSetDevice()is called again with a different device ID. The CUDA runtime API is thread-safe, which means it maintains per-thread state about the current device. This is very important as it allows threads to concurrently submit work to different devices, but forgetting to set the current device in each thread can lead to subtle and hard-to-find bugs like the following example.

cudaSetDevice(1);
cudaMalloc(&a,bytes);
#pragma omp parallel
 {
       kernel<<<blocks,threads>>>(a);
 }

While at first glance this code may seem bug free, it is incorrect. The problem here is that we have set device 1 current on the OpenMP master thread but then used OpenMP to spawn more threads which will use the default device (device 0) because they never callcudaSetDevice(). This code would actually launch multiple kernels that run on device 0 but access memory allocated on device 1. This will cause either invalid memory access errors or (in the case where peer-to-peer access is enabled) it will be limited by low PCIe memory bandwidth to the arraya.
Here is a correct implementation of the code, where every thread sets the correct device.

cudaSetDevice(1);
 cudaMalloc(&a,bytes);
 #pragma omp parallel
 {
         cudaSetDevice(1);
         kernel<<<blocks,threads>>>(a);
 }

If it’s not obvious from the title of this post, there’s a simple rule to follow to avoid bugs like this…

Always Set the Device in New Host Threads

Make it a habit to callcudaSetDevice()wherever your code could potentially spawn new host threads. The following example has a potential bug depending on whether the OpenMP library chooses to spawn new threads or reuse old ones.

cudaSetDevice(1);
cudaMalloc(&a,bytes);
#pragma omp parallel
 {
       cudaSetDevice(1);
        kernel<<<blocks,threads>>>(a);
}
#pragma omp parallel
 {
        kernel<<<blocks,threads>>>(a);
}

In this example, threads in the secondomp parallelregion don’t set the current device so there is no guarantee that it is set for each thread. This problem is not restricted to OpenMP; it can easily happen with any threading library, and in any CUDA-accelerated Language.
To save yourself from a variety of multithreading bugs, remember: always callcudaSetDevice()first when you spawn a new host thread.

∥∀

 

About Justin Luitjens

Justin Luitjens

Justin Luitjens is a member of the Developer Technology team at NVIDIA where he works on accelerating applications on GPUs. He holds a Ph.D in Scientific Computing from the University of Utah.

http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/

Leave a Reply

Your email address will not be published. Required fields are marked *