CUDA Dynamic Parallelism

 
CUDA 5.0中引入动态并行化,使得在device端执行的kernel的线程也能跟在host上一样launch kernels,只有支持CC3.5或者以上的设备中才能支持。动态并行化使用CUDA Device Runtime library(cudadevrt),它是一个能在device code中调用的CUDA runtime子集。

编译链接

为了支持动态并行化,必须使用两步分离编译和链接的过程:首先,设定-c和-rdc=true(–relocatable-device-code=true)来生成relocatable device code来进行后续链接,可以使用-dc(–device -c)来合并这两个选项;然后将上一步目标文件和cudadevrt库进行连接生成可执行文件,-lcudadevrt。过程如下图Figure 4: The Separate Compilation and Linking Process for Dynamic Parallelism.

nvcc -arch=sm_35 -dc myprog.cu -o myprog.o
nvcc -arch=sm_35 myprog.o -lcudadevrt -o myprog

或者简化成一步

nvcc -arch=sm_35 -rdc=true myprog.cu -lcudadevrt -o myprog.o

执行、同步

在CUDA编程模型中,一组执行的kernel的线程块叫做一个grid。在CUDA动态并行化,parent grid能够调用child grids。child grid继承parant grid的特定属性和限制,如L1 cache、shared_memory、栈大小。如果一个parent grid有M个block和N个thread,如果对child kernel launch没有控制的话,那个将产生M*N个child kernel launch。如果想一个block产生一个child kernel,那么只需要其中一个线程launch a kernel就行。如下

if(threadIdx.x == 0) {
  child_k <<< (n + bs - 1) / bs, bs >>> ();
}

grid lanuch是完全嵌套的,child grids总是在发起它们的parent grids结束前完成,这可以看作是一个一种隐式的同步。

如果parent kernel需要使用child kernel的计算结果,也可以使用CudaDeviceSynchronize(void)进行显示的同步,这个函数会等待一个线程块发起的所有子kernel结束。往往不知道一个线程块中哪些子kernel已经执行,可以通过下述方式进行一个线程块级别的同步

void threadBlockDeviceSynchronize(void) {
  __syncthreads();
  if(threadIdx.x == 0)
    cudaDeviceSynchronize();
  __syncthreads();
}

CudaDeviceSynchronize(void)调用开销较大,不是必须的时候,尽量减少使用,同时不要在父kernel退出时调用,因为结束时存在上述介绍的隐式同步。

内存一致

当子 grids开始与结束之间,父grids和子grids有完全一致的global memory view。

当子kernel launch的时候,global memory视图不一致。

__device__ int v = 0;
__global__ void child_k(void) {
  printf("v = %d\n", v);
}
__global__ void parent_k(void) {
  v = 1;
  child_k <<< 1, 1 >>>> ();
  v = 2; // RACE CONDITION
  cudaDeviceSynchronize();
}

在子kernel launch之后,显示同步之前,parent grid不能对 child grid读取的内存做写入操作,否则会造成race condition。

向Child grids传递指针

指针的传递存在限制:

  • 可以传递的指针:global memory(包括__device__变量和malloc分配的内存),zero-copy host端内存,常量内存。
  • 不可以传递的指针:shared_memory(__shared__变量), local memory(包括stack变量)

Device Streams和Events

所有在device上创建的streams都是non-blocking的,不支持默认NULL stream的隐式同步。创建流的方式如下

cudaStream_t s;
cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);

一旦一个device stream被创建,它能被一个线程块中其他线程使用。只有当这个线程块完成执行的时候,这个stream才能被其他线程块或者host使用。反之亦然。
Event也是支持的,不过有限制,只支持在不同stream之间使用cudaStreamWaitEvent()指定执行顺序,而不能使用event来计时或者同步。

Recursion Depth和Device Limits

递归深度包括两个概念:

  • nesting depth:递归grids的最大嵌套层次,host端的为0;
  • synchronization depth:cudaDeviceSynchronize()能调用的最大嵌套层次,host端为1,cudaLimitDevRuntimeSyncDepth应该设定为maximum 所以你吃肉你咋体on depth加1,设定方式如 cudaDeviceLimit(cudaLimitDevRuntimeSyncDepth, 4).

maximum nesting depth有硬件限制,在CC3.5中, 对depth 的限制为24. synchronization depth也一样。
从外到内,直到最大同步深度,每一次层会保留一部分内存来保存父block的上下文数据,即使这些内存没有被使用。所以递归深度的设定需要考虑到每一层所预留的内存。
另外还有一个限制是待处理的子grid数量。pending launch buffer用来维持launch queue和追踪当前执行kernel的状态。通过

cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, 32768);

来设定合适的限制。否则通过cudaGetLastError()调用可以返回CudaErrorLaunchPendingCountExceeded的错误。
动态并行化执行有点类似树的结构,但与CPU上树处理也有些不同。类似深度小,分支多,比较茂密的树的执行结构,比较适合动态并行化的处理。深度大,每层节点少的树的执行结构,则不适合动态并行化。

characteristic tree processing dynamic parallelism
node thin (1 thread) thick (many threads)
branch degree small (usually < 10) large (usually > 100)
depth large small

参考:http://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/

cuda 同步与计时

同步block

_syncthreads()

同步kernel

cudaDeviceSynchronize()
waits until all preceding commands in all streams of all host threads have completed.

同步stream

cudaStreamSynchronize()
takes a stream as a parameter and waits until all preceding commands in the given stream have completed. It can be used to synchronize the host with a specific stream, allowing other streams to continue executing on the device.

Although CUDA kernel launches are asynchronous, all GPU-related tasks placed in one stream (which is default behaviour) are executed sequentially.
如果在kernel中使用printf,因为kernel调用是异步的,所以要使用DeviceSynchronize()进行同步,否则没有输出。
CUDA提供了两种对kernel进行同步的方式:

  • 使用cudaThreadSynchronize()进行显示同步,使主机进入阻塞状态,停止运行并等待所有已经提交的kernel执行完毕。
  • 利用cudaMemcpy()实现阻塞式数据传输,实际上内部调用了cudaThreadSynchronize()。

 
 

marvin安装与使用

 
cuDNN安装

cp lib* cudnn_dir/lib64/
cp cudnn.h cudnn_dir/include/
cd cudnn_dir
export LD_LIBRARY_PATH=`pwd`:$LD_LIBRARY_PATH

如出现error while loading shared libraries: libcudnn.so.4: cannot open shared object file: No such file or directory错误,是文件权限问题,可进行如下操作

cd cudnn_dir
rm -rf libcudnn.so libcudnn.so.4
chmod u=rwx,g=rx,o=rx libcudnn.so.4.0.4
ln -s libcudnn.so.4.0.4 libcudnn.so.4
ln -s libcudnn.so.4 libcudnn.so

 
marvin依赖cuda 7.5和cuDNN 4rc
curl -L https://github.com/PrincetonVision/marvin/tarball/master | tar zx
mv PrincetonVision* marvin && cd marvin
./compile.sh

Bitonic sort(也称双调排序)

Bitonic sequence. A sequence (a_{{1}}, a_{{2}},..., a_{{m}} ) is said to be bitonic if and only if:
(a) Either there is an integer j, 1 ≤ j ≤ 2m, such that a_{{1}} leq   a_{{2}}leq ... a_{{j}}geq  a_{{j+1}}geq  a_{{j+2}}geq ...geq  a_{{m}}
(b) Or the sequence does not initially satisfy the condition in (a), but can be shifted cyclically until the condition is satisfied.

/*
 * binotic.h
 *
 *  Created on: 2015年5月17日
 *      Author: zhangjun
 */
#ifndef BITONIC_SORT_H_
#define BITONIC_SORT_H_
#include<iostream>
#include<iterator>
#include<algorithm>
using namespace std;
class bitonic_sorter
{
public:
	bitonic_sorter(int a[], int len);
	void sort(bool direction = true);
	void sort_for_arbitary_length(bool direction = true);
private:
	int *array;
	int length;
	void bitonic_sort(int low, int len, bool direction);
	void bitonic_sort_for_arbitary_length(int low, int len, bool direction);
	void bitonic_merge(int low, int len, bool direction);
	void bitonic_merge_for_arbitary_length(int low, int len, bool direction);
	void compare_and_swap(int i, int j, bool direction);
	int greatest_power_of_2_lessthan(int len);
};
#endif /* BINOTIC_H_ */

 

/*
 * bitonic_sort.cpp
 *
 *  Created on: 2015年5月17日
 *      Author: zhangjun
 */
#include "bitonic_sort.h"
bitonic_sorter::bitonic_sorter(int a[], int len)
{
	array = a;
	length = len;
}
void bitonic_sorter::sort(bool direction)
{
	bitonic_sort(0, length, direction);
}
void bitonic_sorter::sort_for_arbitary_length(bool direction)
{
	bitonic_sort_for_arbitary_length(0, length, direction);
}
void bitonic_sorter::bitonic_sort(int low, int len, bool direction)                   // bitonic_sort
{
	if(len > 1)
	{
		int m = len/2;
		bitonic_sort(low, m, direction);
		bitonic_sort(low+m, m, !direction);
		bitonic_merge(low, len, direction);
	}
}
void bitonic_sorter::bitonic_sort_for_arbitary_length(int low, int len, bool direction)               // bitonic_sort_for_arbitary
{
	if(len > 1)
	{
		int m = len/2;
		if(direction == true)
		{
			bitonic_sort_for_arbitary_length(low, m, !direction);                                                                   // len-m > m
			bitonic_sort_for_arbitary_length(low+m, len-m, direction);                                                       // the big end
			bitonic_merge_for_arbitary_length(low, len, direction);
		}
		else
		{
			int half = greatest_power_of_2_lessthan(len);
			bitonic_sort_for_arbitary_length(low, len-half, !direction);                                                        // half > hen -half
			bitonic_sort(low+len-half, half, direction);                                                // the big end
			bitonic_merge_for_arbitary_length(low, len, direction);
		}
	}
}
void bitonic_sorter::bitonic_merge(int low, int len, bool direction)
{
	if(len > 1)
	{
		int m = len/2;
		for( int i = low; i < low + m; i++)
			compare_and_swap(i, i+m, direction);
		bitonic_merge(low, m, direction);
		bitonic_merge(low+m, m, direction);
	}
}
void bitonic_sorter::bitonic_merge_for_arbitary_length(int low, int len, bool direction)
{
	if(len > 1)
	{
		int m = greatest_power_of_2_lessthan(len);                                             // low+m >= low+len-m
		for( int i = low; i < low + len - m; i++)
			compare_and_swap(i, i+m, direction);
		bitonic_merge(low, m, direction);                                                                   // m >= len -m
		bitonic_merge(low+m, len-m, direction);
	}
}
void bitonic_sorter::compare_and_swap(int i, int j, bool direction)
{
	if(direction ==(array[i]>array[j]))
		std::swap(array[i], array[j]);
}
int bitonic_sorter::greatest_power_of_2_lessthan(int len)
{
	int p = 1;
	while(p<len)
		p = p<<1;
	return p>>1;
}

 

/*
 * test.cpp
 *
 *  Created on: 2015年5月6日
 *      Author: zhangjun
 */
#include<stdio.h>
#include<string.h>
#include"bitonic_sort.h"
int main()
{
	int num1[8] = {3, 67, 3, 5, 8, 4, 7, 9};
	int num2[34] = {7, 5, 8, 3, 5, 78, 9, 5, 6, 23,24,1,8,10,32, 2, 3, 8, 9, 21, 15, 3, 4, 8, 9, 6, 3, 2, 1,78,43, 56, 23, 41};
	bitonic_sorter s1(num1, 8);
	bitonic_sorter s2(num2, 34);
	s1.sort(false);
	std::copy(num1, num1+8, std::ostream_iterator<int>(cout, " "));
	std::cout<<"n";
	s2.sort_for_arbitary_length(true);
	std::copy(num2, num2+34, std::ostream_iterator<int>(cout, " "));
	std::cout<<"n";
	return 0;
}
/*
 * test.cpp
 *
 *  Created on: 2015年5月6日
 *      Author: zhangjun
 */
#include<stdio.h>
#include<string.h>
#include"bitonic_sort.h"
int main()
{
	int num1[8] = {3, 67, 3, 5, 8, 4, 7, 9};
	int num2[34] = {7, 5, 8, 3, 5, 78, 9, 5, 6, 23,24,1,8,10,32, 2, 3, 8, 9, 21, 15, 3, 4, 8, 9, 6, 3, 2, 1,78,43, 56, 23, 41};
	bitonic_sorter s1(num1, 8);
	bitonic_sorter s2(num2, 34);
	s1.sort(false);
	std::copy(num1, num1+8, std::ostream_iterator<int>(cout, " "));
	std::cout<<"n";
	s2.sort_for_arbitary_length(true);
	std::copy(num2, num2+34, std::ostream_iterator<int>(cout, " "));
	std::cout<<"n";
	return 0;
}

 
cuda bitonic代码

__global__ static void bitonicSort(int * values)
{
    extern __shared__ int shared[];
    const int tid = threadIdx.x;
    // Copy input to shared mem.
    shared[tid] = values[tid];
    __syncthreads();
    // Parallel bitonic sort.
    for (int k = 2; k <= NUM; k *= 2)                   // from 2 to
    {
        // Bitonic merge:
        for (int j = k / 2; j>0; j /= 2)                // from k/2 to 1
        {
            int ixj = tid ^ j;
            if (ixj > tid)           // tid 对应位为0, ixj对应位为1
            {
                if ((tid & k) == 0)               // 对应位为0,ascending
                {
                    if (shared[tid] > shared[ixj])
                    {
                        swap(shared[tid], shared[ixj]);
                    }
                }
                else                             // 对应位为1,descending
                {
                    if (shared[tid] < shared[ixj])
                    {
                        swap(shared[tid], shared[ixj]);
                    }
                }
            }
            __syncthreads();
        }
    }
    // Write result.
    values[tid] = shared[tid];
}

 
 

cuda 性能优化

 
1、minimize the tail effect
一个kernel使用更少的寄存器,在一个流处理器上能够分配更多的线程和线程块,能能够提高性能。通过__launch_bounds__设定让编译器优化寄存器的使用。

__global__ void
__launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor)
MyKernel(...) {
 ...
}

 

深度学习环境搭建(以caffe+cudnn为例,无需root权限)

(这里在redhat 6.3上安装,无root权限,全程使用源码编译方式安装)
1、系统环境与权限: GTX780,redhat 6.3, gcc 4.4.6, 无root权限
2、所需依赖库:
(1) cuda 6.5(推荐),6.0, 5.5, 5.0以及cuda6.0对应的驱动,或者cuda 5对应驱动319.*(not 331.*)
注:cuda驱动安装需要root权限,我这里安装cuda 6.5, 至少需要340.*以上驱动(6.0及以下没有测试过)
(2)BLAS(ATLAS, MKL,OpenBLAS)括号中三选一。 我这里选择了系统管理员已经默认安装的Intel MKL
(3)OpenCV(>=2.4)
(4)Boost(>= 1.55)(其实只能安装1.55,后面会提到)
(5) glog, gflags, protobuf, leveldb, snappy, hdf5, lmdb
(6) Python 2.7, numpy(>= 1.7)
(7) MATLAB
(Python和Matlab应该是可选的,我只安装了python)
3、依赖库安装
cuda、python安装比较简单,intel MKL已默认安装,不再介绍

#protobuf
tar zxvf protobuf-2.6.1.tar.gz
cd protobuf-2.6.1
chmod a+x autogen.sh
./autogen.sh
./configure -PREFIX=intstall_dir
make && make install
#leveldb
unzip leveldb-master.zip
cd leveldb-master
chmod a+x build_detect_platform
./build_detect_platform
make
#snappy
unzip snappy-master.zip
cd snappy-master
chmod a+x autogen.sh
./autogen.sh
./configure -PREFIX=install_dir
make && make install
#hdf5
tar zxvf hdf5-1.8.14.tar.gz
cd hdf5-1.8.14
./configure --PREFIX=install_dir
make && make install
# glog
tar zxvf glog-0.3.3.tar.gz
cd glog-0.3.3
./configure --prefix=install_dir
make && make install
# gflags
tar zxvf gflags-2.1.1.tar.gz
cd gflags-2.1.1.ta.gz
mkdir build && cd build
export CXXFLAGS="-fPIC" && cmake -DCMAKE_INSTALL_PREFIX=install_dir .. && make VERBOSE=1
make && make install
# lmdb
git clone git://gitorious.org/mdb/mdb.git
cd mdb/libraries/liblmdb
make && make install
#opencv
unzip opencv-2.4.10.zip
cd opencv-2.4.10
cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=install_dir ..
make && make install

4、cudnn安装
首先需要在https://developer.nvidia.com/cuDNN下载cuDNN Deep Neural Network Library。(CUDA Registered Developers 可以自由获得cuDNN library)

tar -zxvf cudnn-6.5-linux-R1.tgz
cd cudnn-6.5-linux-R1
# cp cudnn.h to CUDD_DIR/include, CUDA_DIR is the directory where the CUDA toolkit is installed
cp cudnn.h CUDA_DIR/include
# cp cudnn library to CUDA_DIR/lib64
cp libcudnn* CUDA_DIR/lib64
#这里还需要对cudnn library建立软链接
ln -s libcudnn.so.6.5 libcudnn.so.6.5.18
ln -s libcudnn.so libcudnn.so.6.5

5、caffe
http://caffe.berkeleyvision.org/installation.html上下载caffe软件包
解压后,进入caffe目录,首先复制一份Makefile.config

cp Makefile.config.example Makefile.config

然后修改Makefile.config。有几处需要修改
(1) cudnn
使用cuDNN加速,就需要取消对USE_CUDNN:=1的注释。同时设置cuda 安装路径和cuda architecture(取消对CUDA_ARCH的注释即可)

# cuda_install_dir是cuda安装路径
CUDA_DIR := cuda_install_dir

没有GPU的话,就应该使用CPU_ONLY:=1。
(2) BLAS库
我这里使用默认安装的 Intel MKL

BLAS := mkl

(3)Python
设置python目录

#to find python.h and numpy/arrayobject.h
PYTHON_INCLUDE := python_install_dir
        python_install_dir/dist_packages/numpy/core/include
#to find libpythonX.X.so or .dylib
PYTHON_LIB := python_install_dir/lib

设置到这那就应该行了吧 。高兴早了点!!!上面的依赖库是普通用户权限下使用源码安装的,即使配置了PATH和LD_LIBRARY_PATH,编译还是会出现类似如下的一堆错误

/usr/bin/ld: cannot find -lgflags
collect2: ld returned 1 exit status
make: *** [.build_release/lib/libcaffe.so] Error 1

还需进行如下修改

#=======================================================
PROTO_INCLUDE := /home/usrname/software/protobuf/include
PROTO_LIB := /home/usrname/software/protobuf/lib
GLOG_INCLUDE := /home/usrname/software/glog/include
GLOG_LIB := /home/usrname/software/glog/lib
GFLAGS_INCLUDE := /home/usrname/software/gflags/include
GFLAGS_LIB := /home/usrname/software/gflags/lib
HDF5_INCLUDE := /home/usrname/software/hdf5/include
HDF5_LIB := /home/usrname/software/hdf5/lib
LEVELDB_INCLUDE := /home/usrname/software/leveldb-master/include
LEVELDB_LIB := /home/usrname/software/leveldb-master
LMDB_INCLUDE := /home/usrname/software/lmdb/include
LMDB_LIB := /home/usrname/software/lmdb/lib
OPENCV_INCLUDE := /home/usrname/software/opencv/include
OPENCV_LIB := /home/usrname/software/opencv/lib
SNAPPY_INCLUDE := /home/usrname/software/snappy/include
SNAPPY_LIB := /home/usrname/software/snappy/lib
BOOST_INCLUDE := /home/usrname/boost_1_57_0
BOOST_LIB := /home/usrname/boost_1_57_0/stage/lib
#======================================================================
# Whatever else you find you need goes here.
INCLUDE_DIRS := $(PYTHON_INCLUDE) $(PROTO_INCLUDE) $(GLOG_INCLUDE) $(GFLAGS_INCLUDE) $(HDF5_INCLUDE) $(LEVELDB_INCLUDE) $(LMDB_INCLUDE) $(OPENCV_INCLUDE) $(SNAPPY_INCLUDE) $(BOOST_INCLUDE) /usr/local/include
LIBRARY_DIRS := $(PYTHON_LIB) $(PROTO_LIB) $(GLOG_LIB) $(GFLAGS_LIB) $(HDF5_LIB) $(LEVELDB_LIB) $(LMDB_LIB) $(OPENCV_LIB) $(SNAPPY_LIB) $(BOOST_LIB) /usr/local/lib /usr/lib

(这里使我明白了此前一直没注意的一个问题: -L是编译时查找.o或者.so文件所在的目录,用于链接生成可执行文件; LD_LIBRARY_PATH 是环境变量,用于程序执行时, 搜索.so 的路径 。参见http://bbs.csdn.net/topics/330189724
此时,编译应该没什么问题了。

make all
# 或者make all -j12 加快编译速度。j后面数字是并行编译线程数,最好是机器的核数

但是顺利执行make all, make test后,make runtest出现了错误。。

[ PASSED ] 832 tests.
[ FAILED ] 6 tests, listed below:
[ FAILED ] PowerLayerTest/0.TestPowerGradientShiftZero, where TypeParam = caffe::FloatCPU
[ FAILED ] PowerLayerTest/1.TestPowerGradientShiftZero, where TypeParam = caffe::DoubleCPU
[ FAILED ] PowerLayerTest/1.TestPowerGradient, where TypeParam = caffe::DoubleCPU
[ FAILED ] PowerLayerTest/2.TestPowerGradientShiftZero, where TypeParam = caffe::FloatGPU
[ FAILED ] PowerLayerTest/3.TestPowerGradientShiftZero, where TypeParam = caffe::DoubleGPU
[ FAILED ] PowerLayerTest/3.TestPowerGradient, where TypeParam = caffe::DoubleGPU

解决方法就是将boost 1.57换成boost 1.55,然后重新编译。。。(1.56也是不行的,参考http://blog.csdn.net/danieljianfeng/article/details/42836167)

cuda编译总结

 
Device code linking

nvcc –arch=sm_20 –dlink v3.o particle.o main.o –o gpuCode.o

将所有的device object code链接到gpuCode,o中,不链接cpu object code。

g++ gpuCode.o main.o particle.o v3.o –lcudart –o app

将剩余CPU object code链接进来,形成可执行文件
-x可以使得nvcc 将.cpp文件视为.cu文件

align your accesses on the GPU Global memory

You should simply align your accesses on the GPU Global memory. The aligned address is a multiple of the size of the object
your are reading or writing, e.g. if you want to read or write an integer, the address should be a multiple of 4. And, reading
or writing a char is always aligned.
Suppose you have a big space allocated with cudaMalloc called dummySpace, this pseudo kernel code would probably results in
CUDA_EXCEPTION_6 Warp Misaligned Address:

__global__ void func (char* stringInput, int stringSize, int* integerInput, char* dummySpace) //input: a string, an integer, output: a big space with that string and integer in it
{
//dummySpace is created by cudaMalloc, so it is aligned to at least 256 bytes
   int counter = 0;
   for(int i=0;i<stringSize;i++)
   dummySpace[counter++] = stringInput[i]; //==>this is copying several chars, sizeof(char) is one, so they are always aligned
   for(int i=0;i<sizeof(int);i++)
   dummySpace[counter++] = ((char*)integerInput)[i];   //==> this is going to be a problem because the first for has advanced the counter by stringSize which is unknown an can make the address unaligned
}

The fixed one:

__global__ void func (char* stringInput, int stringSize, int* integerInput, char* dummySpace)
{
    int counter = 0;
    for(int i=0;i<stringSize;i++)
    dummySpace[counter++] = stringInput[i];
    int sub = counter % 4; //or 8 or 16..
    counter += (4-sub);
    for(int i=0;i<sizeof(int);i++)
   dummySpace[counter++] = ((char*)integerInput)[i];   //==> everything is ok as you are saving an integer in an aligned address
}

 

cuda 内存以及内存拷贝

CUDA memory

cuda中内存分配分为三种:可分页存储(pageable host memory)和分页锁定存储(page-locked host memory), device memory

  • 可分页存储是指通过C/C++函数malloc和new等操作向操作系统申请的虚拟存储,在一定情况下,会被置换出内存,所以地址不固定。
  • 分页锁定存储使用cudaMallocHost或者cudaHostAlloc分配,分配的空间一定位于物理内存且地址固定,并且能通过直接内存存取(DMA )提高传输速度,但是分配和释放比较耗时。cudaFreeHost用于释放。cudaHostRegister和cudaHostUnregister可以pins/unpins pageable host memory, 但速度比较慢,不要经常使用。
  • device memory通过cudaMalloc分配,或者cudaMallocPitch()和cudaMalloc3D()分配,cannot be paged

CUDA memory copies

  • cudaMemcpy() 使用默认流,同步拷贝
  • cudaMemcpyAsync(…, &stream): 指定stream上传输,异步拷贝,调用后立即返回。为了实现并发性,不应该在默认流中传输,host memory必须是pinned的
  • thrust API中通过赋值来实现向量中的数据移动

实现内存并发拷贝的条件:

  • 在不同的非默认流中进行内存拷贝
  • 使用的host memory是pinned
  • 调用的是异步拷贝API
  • 一个方向上同时只能有一个拷贝操作

CUDA Pro Tip: Always Set the Current Device to Avoid Multithreading Bugs

We often say that to reach high performance on GPUs you should expose as much parallelism in your code as possible, and we don’t mean just parallelism within one GPU, but also across multiple GPUs and CPUs. It’s common for high-performance software to parallelize across multiple GPUs by assigning one or more CPU threads to each GPU. In this post I’ll cover a common but subtle bug and a simple rule that will help you avoid it within your own software (spoiler alert: it’s in the title!).
Let’s review how to select which GPU to execute CUDA calls on. The CUDA runtime API is state-based, and threads executecudaSetDevice()to set the current GPU.

cudaError_t cudaSetDevice(int device)

After this call all CUDA API commands go to the current set device untilcudaSetDevice()is called again with a different device ID. The CUDA runtime API is thread-safe, which means it maintains per-thread state about the current device. This is very important as it allows threads to concurrently submit work to different devices, but forgetting to set the current device in each thread can lead to subtle and hard-to-find bugs like the following example.

cudaSetDevice(1);
cudaMalloc(&a,bytes);
#pragma omp parallel
 {
       kernel<<<blocks,threads>>>(a);
 }

While at first glance this code may seem bug free, it is incorrect. The problem here is that we have set device 1 current on the OpenMP master thread but then used OpenMP to spawn more threads which will use the default device (device 0) because they never callcudaSetDevice(). This code would actually launch multiple kernels that run on device 0 but access memory allocated on device 1. This will cause either invalid memory access errors or (in the case where peer-to-peer access is enabled) it will be limited by low PCIe memory bandwidth to the arraya.
Here is a correct implementation of the code, where every thread sets the correct device.

cudaSetDevice(1);
 cudaMalloc(&a,bytes);
 #pragma omp parallel
 {
         cudaSetDevice(1);
         kernel<<<blocks,threads>>>(a);
 }

If it’s not obvious from the title of this post, there’s a simple rule to follow to avoid bugs like this…

Always Set the Device in New Host Threads

Make it a habit to callcudaSetDevice()wherever your code could potentially spawn new host threads. The following example has a potential bug depending on whether the OpenMP library chooses to spawn new threads or reuse old ones.

cudaSetDevice(1);
cudaMalloc(&a,bytes);
#pragma omp parallel
 {
       cudaSetDevice(1);
        kernel<<<blocks,threads>>>(a);
}
#pragma omp parallel
 {
        kernel<<<blocks,threads>>>(a);
}

In this example, threads in the secondomp parallelregion don’t set the current device so there is no guarantee that it is set for each thread. This problem is not restricted to OpenMP; it can easily happen with any threading library, and in any CUDA-accelerated Language.
To save yourself from a variety of multithreading bugs, remember: always callcudaSetDevice()first when you spawn a new host thread.

∥∀

 

About Justin Luitjens

Justin Luitjens

Justin Luitjens is a member of the Developer Technology team at NVIDIA where he works on accelerating applications on GPUs. He holds a Ph.D in Scientific Computing from the University of Utah.

http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/