CUDA Dynamic Parallelism

 
CUDA 5.0中引入动态并行化,使得在device端执行的kernel的线程也能跟在host上一样launch kernels,只有支持CC3.5或者以上的设备中才能支持。动态并行化使用CUDA Device Runtime library(cudadevrt),它是一个能在device code中调用的CUDA runtime子集。

编译链接

为了支持动态并行化,必须使用两步分离编译和链接的过程:首先,设定-c和-rdc=true(–relocatable-device-code=true)来生成relocatable device code来进行后续链接,可以使用-dc(–device -c)来合并这两个选项;然后将上一步目标文件和cudadevrt库进行连接生成可执行文件,-lcudadevrt。过程如下图Figure 4: The Separate Compilation and Linking Process for Dynamic Parallelism.

nvcc -arch=sm_35 -dc myprog.cu -o myprog.o
nvcc -arch=sm_35 myprog.o -lcudadevrt -o myprog

或者简化成一步

nvcc -arch=sm_35 -rdc=true myprog.cu -lcudadevrt -o myprog.o

执行、同步

在CUDA编程模型中,一组执行的kernel的线程块叫做一个grid。在CUDA动态并行化,parent grid能够调用child grids。child grid继承parant grid的特定属性和限制,如L1 cache、shared_memory、栈大小。如果一个parent grid有M个block和N个thread,如果对child kernel launch没有控制的话,那个将产生M*N个child kernel launch。如果想一个block产生一个child kernel,那么只需要其中一个线程launch a kernel就行。如下

if(threadIdx.x == 0) {
  child_k <<< (n + bs - 1) / bs, bs >>> ();
}

grid lanuch是完全嵌套的,child grids总是在发起它们的parent grids结束前完成,这可以看作是一个一种隐式的同步。

如果parent kernel需要使用child kernel的计算结果,也可以使用CudaDeviceSynchronize(void)进行显示的同步,这个函数会等待一个线程块发起的所有子kernel结束。往往不知道一个线程块中哪些子kernel已经执行,可以通过下述方式进行一个线程块级别的同步

void threadBlockDeviceSynchronize(void) {
  __syncthreads();
  if(threadIdx.x == 0)
    cudaDeviceSynchronize();
  __syncthreads();
}

CudaDeviceSynchronize(void)调用开销较大,不是必须的时候,尽量减少使用,同时不要在父kernel退出时调用,因为结束时存在上述介绍的隐式同步。

内存一致

当子 grids开始与结束之间,父grids和子grids有完全一致的global memory view。

当子kernel launch的时候,global memory视图不一致。

__device__ int v = 0;
__global__ void child_k(void) {
  printf("v = %d\n", v);
}
__global__ void parent_k(void) {
  v = 1;
  child_k <<< 1, 1 >>>> ();
  v = 2; // RACE CONDITION
  cudaDeviceSynchronize();
}

在子kernel launch之后,显示同步之前,parent grid不能对 child grid读取的内存做写入操作,否则会造成race condition。

向Child grids传递指针

指针的传递存在限制:

  • 可以传递的指针:global memory(包括__device__变量和malloc分配的内存),zero-copy host端内存,常量内存。
  • 不可以传递的指针:shared_memory(__shared__变量), local memory(包括stack变量)

Device Streams和Events

所有在device上创建的streams都是non-blocking的,不支持默认NULL stream的隐式同步。创建流的方式如下

cudaStream_t s;
cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);

一旦一个device stream被创建,它能被一个线程块中其他线程使用。只有当这个线程块完成执行的时候,这个stream才能被其他线程块或者host使用。反之亦然。
Event也是支持的,不过有限制,只支持在不同stream之间使用cudaStreamWaitEvent()指定执行顺序,而不能使用event来计时或者同步。

Recursion Depth和Device Limits

递归深度包括两个概念:

  • nesting depth:递归grids的最大嵌套层次,host端的为0;
  • synchronization depth:cudaDeviceSynchronize()能调用的最大嵌套层次,host端为1,cudaLimitDevRuntimeSyncDepth应该设定为maximum 所以你吃肉你咋体on depth加1,设定方式如 cudaDeviceLimit(cudaLimitDevRuntimeSyncDepth, 4).

maximum nesting depth有硬件限制,在CC3.5中, 对depth 的限制为24. synchronization depth也一样。
从外到内,直到最大同步深度,每一次层会保留一部分内存来保存父block的上下文数据,即使这些内存没有被使用。所以递归深度的设定需要考虑到每一层所预留的内存。
另外还有一个限制是待处理的子grid数量。pending launch buffer用来维持launch queue和追踪当前执行kernel的状态。通过

cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, 32768);

来设定合适的限制。否则通过cudaGetLastError()调用可以返回CudaErrorLaunchPendingCountExceeded的错误。
动态并行化执行有点类似树的结构,但与CPU上树处理也有些不同。类似深度小,分支多,比较茂密的树的执行结构,比较适合动态并行化的处理。深度大,每层节点少的树的执行结构,则不适合动态并行化。

characteristic tree processing dynamic parallelism
node thin (1 thread) thick (many threads)
branch degree small (usually < 10) large (usually > 100)
depth large small

参考:http://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/

cuda 同步与计时

同步block

_syncthreads()

同步kernel

cudaDeviceSynchronize()
waits until all preceding commands in all streams of all host threads have completed.

同步stream

cudaStreamSynchronize()
takes a stream as a parameter and waits until all preceding commands in the given stream have completed. It can be used to synchronize the host with a specific stream, allowing other streams to continue executing on the device.

Although CUDA kernel launches are asynchronous, all GPU-related tasks placed in one stream (which is default behaviour) are executed sequentially.
如果在kernel中使用printf,因为kernel调用是异步的,所以要使用DeviceSynchronize()进行同步,否则没有输出。
CUDA提供了两种对kernel进行同步的方式:

  • 使用cudaThreadSynchronize()进行显示同步,使主机进入阻塞状态,停止运行并等待所有已经提交的kernel执行完毕。
  • 利用cudaMemcpy()实现阻塞式数据传输,实际上内部调用了cudaThreadSynchronize()。

 
 

Intel intrinsics函数-SSE、AVX、MMX等指令集简单介绍

MMX指令集支持多种整数类型的运算。MMX定义了64位紧缩整数类型,,对应Intrinsic中的__m64类型,它能一次能处理2个32位整数。

  • —64-bit的MMX寄存器(8个,复用了浮点寄存器的尾部,与x87共用寄存器,缺少浮点指令)
  • —支持在打包的字,字节,双字整数上的SIMD操作
  • —MMX指令用于多媒体和通讯软件

SSE是MMX的超集。SSE指令集只支持单精度浮点运算,直到SSE2指令集才支持双精度浮点数运算。SSE2定义了128位紧缩整数类型,对应Intrinsic中的__m128i类型,它能一次能处理4个32位整数。

  • —包括了70条指令,其中50条SIMD浮点运算指令、12条MMX 整数运算增强指令、8条内存连续数据块传输指令
  • —新增8个XMM寄存器(XMM0-XMM7)
  • —在X86_64中额外增加8个(XMM8-XMM15)
SSE2指令集:

  • —使用了144个新增指令
  • —从64位扩展到了128 位
  • —提供双精度操作支持

—SSE3指令集:

  • —增加13条指令(允许寄存器内部之间运算,浮点数到整数的转换)
  • —超线程性能增强指令可以提升处理器的超线程处理能力

—SSSE3指令集:

  • —扩充了SSE3,增加16条指令
  • —绝对值、相反数等

—SSE4指令集:

  • —新增47条指令,更新至SSE4.2

AVX指令集只支持单精度和双精度浮点运算。2013年Haswell架构中的AVX2指令集才支持整数运算。

  • —数据宽度从128位扩展为256位
  • —操作数从两个增加到三个

 

Compiler Auto Vectorization

-x flag, which tells the compiler to generate specific vectorization instructions.
Using the -xHost flag enables the highest level of vectorization supported on the processor on which the user compiles. Note that the Intel compiler will try to vectorize a code with SSE2 instructions at optimizations of -O2 or higher. Disable this by specifying -no-vec.
The Intel compiler can generate a single executable with multiple levels of vectorization with the -ax flag, which takes the same options as the -x flag (i.e., AVX, …, SSE2). This flag will generate run-time checks to determine the level of vectorization support on the processor and will then choose the optimal execution path for that processor. It will also generate a baseline execution path that is taken if the -ax level of vectorization specified is not supported.
-vec-report flag, which generates diagnostic information regarding vectorization to stdout. The -vec-report flag takes an optional parameter that can be a number between 0 and 5 (e.g., -vec-report0), with 0 disabling diagnostics and 5 providing the most detailed diagnostics about what loops were optimized, what loops were not optimized, and why those loops were not optimized.
Intel intrinsics guide:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Bitonic sort(也称双调排序)

Bitonic sequence. A sequence (a_{{1}}, a_{{2}},..., a_{{m}} ) is said to be bitonic if and only if:
(a) Either there is an integer j, 1 ≤ j ≤ 2m, such that a_{{1}} leq   a_{{2}}leq ... a_{{j}}geq  a_{{j+1}}geq  a_{{j+2}}geq ...geq  a_{{m}}
(b) Or the sequence does not initially satisfy the condition in (a), but can be shifted cyclically until the condition is satisfied.

/*
 * binotic.h
 *
 *  Created on: 2015年5月17日
 *      Author: zhangjun
 */
#ifndef BITONIC_SORT_H_
#define BITONIC_SORT_H_
#include<iostream>
#include<iterator>
#include<algorithm>
using namespace std;
class bitonic_sorter
{
public:
	bitonic_sorter(int a[], int len);
	void sort(bool direction = true);
	void sort_for_arbitary_length(bool direction = true);
private:
	int *array;
	int length;
	void bitonic_sort(int low, int len, bool direction);
	void bitonic_sort_for_arbitary_length(int low, int len, bool direction);
	void bitonic_merge(int low, int len, bool direction);
	void bitonic_merge_for_arbitary_length(int low, int len, bool direction);
	void compare_and_swap(int i, int j, bool direction);
	int greatest_power_of_2_lessthan(int len);
};
#endif /* BINOTIC_H_ */

 

/*
 * bitonic_sort.cpp
 *
 *  Created on: 2015年5月17日
 *      Author: zhangjun
 */
#include "bitonic_sort.h"
bitonic_sorter::bitonic_sorter(int a[], int len)
{
	array = a;
	length = len;
}
void bitonic_sorter::sort(bool direction)
{
	bitonic_sort(0, length, direction);
}
void bitonic_sorter::sort_for_arbitary_length(bool direction)
{
	bitonic_sort_for_arbitary_length(0, length, direction);
}
void bitonic_sorter::bitonic_sort(int low, int len, bool direction)                   // bitonic_sort
{
	if(len > 1)
	{
		int m = len/2;
		bitonic_sort(low, m, direction);
		bitonic_sort(low+m, m, !direction);
		bitonic_merge(low, len, direction);
	}
}
void bitonic_sorter::bitonic_sort_for_arbitary_length(int low, int len, bool direction)               // bitonic_sort_for_arbitary
{
	if(len > 1)
	{
		int m = len/2;
		if(direction == true)
		{
			bitonic_sort_for_arbitary_length(low, m, !direction);                                                                   // len-m > m
			bitonic_sort_for_arbitary_length(low+m, len-m, direction);                                                       // the big end
			bitonic_merge_for_arbitary_length(low, len, direction);
		}
		else
		{
			int half = greatest_power_of_2_lessthan(len);
			bitonic_sort_for_arbitary_length(low, len-half, !direction);                                                        // half > hen -half
			bitonic_sort(low+len-half, half, direction);                                                // the big end
			bitonic_merge_for_arbitary_length(low, len, direction);
		}
	}
}
void bitonic_sorter::bitonic_merge(int low, int len, bool direction)
{
	if(len > 1)
	{
		int m = len/2;
		for( int i = low; i < low + m; i++)
			compare_and_swap(i, i+m, direction);
		bitonic_merge(low, m, direction);
		bitonic_merge(low+m, m, direction);
	}
}
void bitonic_sorter::bitonic_merge_for_arbitary_length(int low, int len, bool direction)
{
	if(len > 1)
	{
		int m = greatest_power_of_2_lessthan(len);                                             // low+m >= low+len-m
		for( int i = low; i < low + len - m; i++)
			compare_and_swap(i, i+m, direction);
		bitonic_merge(low, m, direction);                                                                   // m >= len -m
		bitonic_merge(low+m, len-m, direction);
	}
}
void bitonic_sorter::compare_and_swap(int i, int j, bool direction)
{
	if(direction ==(array[i]>array[j]))
		std::swap(array[i], array[j]);
}
int bitonic_sorter::greatest_power_of_2_lessthan(int len)
{
	int p = 1;
	while(p<len)
		p = p<<1;
	return p>>1;
}

 

/*
 * test.cpp
 *
 *  Created on: 2015年5月6日
 *      Author: zhangjun
 */
#include<stdio.h>
#include<string.h>
#include"bitonic_sort.h"
int main()
{
	int num1[8] = {3, 67, 3, 5, 8, 4, 7, 9};
	int num2[34] = {7, 5, 8, 3, 5, 78, 9, 5, 6, 23,24,1,8,10,32, 2, 3, 8, 9, 21, 15, 3, 4, 8, 9, 6, 3, 2, 1,78,43, 56, 23, 41};
	bitonic_sorter s1(num1, 8);
	bitonic_sorter s2(num2, 34);
	s1.sort(false);
	std::copy(num1, num1+8, std::ostream_iterator<int>(cout, " "));
	std::cout<<"n";
	s2.sort_for_arbitary_length(true);
	std::copy(num2, num2+34, std::ostream_iterator<int>(cout, " "));
	std::cout<<"n";
	return 0;
}
/*
 * test.cpp
 *
 *  Created on: 2015年5月6日
 *      Author: zhangjun
 */
#include<stdio.h>
#include<string.h>
#include"bitonic_sort.h"
int main()
{
	int num1[8] = {3, 67, 3, 5, 8, 4, 7, 9};
	int num2[34] = {7, 5, 8, 3, 5, 78, 9, 5, 6, 23,24,1,8,10,32, 2, 3, 8, 9, 21, 15, 3, 4, 8, 9, 6, 3, 2, 1,78,43, 56, 23, 41};
	bitonic_sorter s1(num1, 8);
	bitonic_sorter s2(num2, 34);
	s1.sort(false);
	std::copy(num1, num1+8, std::ostream_iterator<int>(cout, " "));
	std::cout<<"n";
	s2.sort_for_arbitary_length(true);
	std::copy(num2, num2+34, std::ostream_iterator<int>(cout, " "));
	std::cout<<"n";
	return 0;
}

 
cuda bitonic代码

__global__ static void bitonicSort(int * values)
{
    extern __shared__ int shared[];
    const int tid = threadIdx.x;
    // Copy input to shared mem.
    shared[tid] = values[tid];
    __syncthreads();
    // Parallel bitonic sort.
    for (int k = 2; k <= NUM; k *= 2)                   // from 2 to
    {
        // Bitonic merge:
        for (int j = k / 2; j>0; j /= 2)                // from k/2 to 1
        {
            int ixj = tid ^ j;
            if (ixj > tid)           // tid 对应位为0, ixj对应位为1
            {
                if ((tid & k) == 0)               // 对应位为0,ascending
                {
                    if (shared[tid] > shared[ixj])
                    {
                        swap(shared[tid], shared[ixj]);
                    }
                }
                else                             // 对应位为1,descending
                {
                    if (shared[tid] < shared[ixj])
                    {
                        swap(shared[tid], shared[ixj]);
                    }
                }
            }
            __syncthreads();
        }
    }
    // Write result.
    values[tid] = shared[tid];
}

 
 

How to run Intel MPI on Xeon Phi

Overview

The Intel® MPI Library supports the Intel® Xeon Phi™ coprocessor in 3 major ways:

  • The offload model where all MPI ranks are run on the main Xeon host, and the application utilizes offload directives to run on the Intel Xeon Phi corpocessor card,
  • The native model where all MPI ranks are run on the Intel Xeon Phi coprocessor card, and
  • The symmetric model where MPI ranks are run on both the Xeon host and the Xeon Phi coprocessor card.

This article will focus on the native and symmetric models only. If you’d like more information on the offload model, this article gives a great overview and even more details are available in the Intel® Compiler documentation.

Prerequisites

The most important thing to remember is that we’re treating the Xeon Phi coprocessor cards as simply another node in a heterogeneous cluster. To that effect, running an MPI job in either the native and symmetric modes is very similar to running a regular Xeon MPI job. On the flip side, that does require some prerequisites to be fulfilled for each coprocessor card to be completely accessible via MPI.
Uniquely accessible hosts
All coprocessor cards on the system need to have a unique IP address that’s accessible from the local host, other Xeon hosts on the system, and other Xeon Phi cards attached to those hosts.  Again, think of simply adding another node to an existing cluster.  A very simple test of this will be the ability to ssh from one Xeon Phi coprocessor (let’s call it node0-mic0) to its own Xeon host (node0), as well as ssh to any other Xeon host on the cluster (node1) and their respective Xeon Phi cards (node1-mic0).  Here’s a quick example:

[user@node0-mic0 user]$ ssh node1-mic0 hostname
node1-mic0

Access to necessary libraries
Make sure all MPI libraries are accessible from the Xeon Phi card. There are a couple of ways to do this:

  • Setup an NFS share between the Xeon host where the Intel MPI Library is installed, and the Xeon Phi corprossesor card.
  • Manually copy all Xeon Phi-specific MPI libraries to the card.  More details on which libraries to copy and where are available here.

Assuming both of those requirements have been met, you’re ready to start using the Xeon Phi corprocessors in your MPI jobs.

Running natively on the Xeon Phi corprocessor

The set of steps to run on the Xeon Phi coprocessor card exclusively can be boiled down to the following:
1. Set up the environment
Use the appropriate scripts to set your runtime environment. The following assumes all Intel® Software Tools are installed in the /opt/intel directory.

# Set your compiler
[user@host] $ source /opt/intel/composer_xe_<version>/bin/compilervars.sh intel64
#Set your MPI environment
[user@host] $ source /opt/intel/impi/<version>/bin64/mpivars.sh

2. Compile for the Xeon Phi coprocessor card
Use the -mmic option for the Intel Compiler to build your MPI sources for the card.

[user@host] $ mpiicc -mmic -o test_hello.MIC test.c

3. Copy the Xeon Phi executables to the card
Transfer the executable that you just created to the card for execution.

[user@host] $ scp ./test_hello.MIC node0-mic0:~/test_hello

This step is not required if your host and card are NFS-shared. Also note that we’re renaming this executable during the copy process. This helps us use the same mpirun command for both native and symmetric modes.
4. Launch the application
Simply use the mpirun command to start the executable remotely on the card. Note that if you’re planning on using a Xeon Phi coprocessor in your MPI job, you have to let us know by setting the I_MPI_MIC environment variable. This is a required step.

[user@host] $ export I_MPI_MIC=enable
[user@host] $ cat mpi_hosts
node0-mic0
[user@host] $ mpirun –f mpi_hosts –n 2 ~/test_hello
Hello world: rank 0 of 2 running on node0-mic0
Hello world: rank 1 of 2 running on node0-mic0

Running symmetrically on both the Xeon host and the Xeon Phi coprocessor

You’re now trying to utilize both the Xeon hosts on your cluster, and the Xeon Phi coprocessor cards attached to them.
Step 1.
will be the same here
2. Compile for the Xeon Phi coprocessor card and for the Xeon host
You’re now going to have compile two different sets of binaries:

# for the Xeon Phi comprocessor
[user@host] $ mpiicc -mmic -o test_hello.MIC test.c
# for the Xeon host
[user@host] $ mpiicc -o test_hello test.c

3. Copy the Xeon Phi executables to the card
Here, we still have to transfer the Xeon Phi coprocessor-compiled executables to the card.  And again, we’re renaming the executable during the transfer:

[user@host] $ scp ./test_hello.MIC node0-mic0:~/test_hello

Now, this will not work if your $HOME directory (where the executables live) is NFS-shared between host and card.  For more tips on what to do in NFS-sharing cases, check out this article.
4. Launch the application
Finally, you run the MPI job.  Your only difference here would be edits in your hosts file as you now have to add the Xeon hosts to the list.

[user@host] $ export I_MPI_MIC=enable
[user@host] $ cat mpi_hosts
node0
node0-mic0
[user@host] $  mpirun –f mpi_hosts –perhost 1 –n 2 ~/test_hello
Hello world: rank 0 of 2 running on node0
Hello world: rank 1 of 2 running on node0-mic0

https://software.intel.com/en-us/articles/how-to-run-intel-mpi-on-xeon-phi
https://software.intel.com/en-us/articles/using-the-intel-mpi-library-on-intel-xeon-phi-coprocessor-systems
https://software.intel.com/en-us/articles/using-xeon-phi-prefixes-and-extensions-for-intel-mpi-jobs-in-nfs-shared-environment
http://www.hpc.mcgill.ca/index.php/81-doc-pages/256-using-xeon-phis-on-guillimin

cuda 性能优化

 
1、minimize the tail effect
一个kernel使用更少的寄存器,在一个流处理器上能够分配更多的线程和线程块,能能够提高性能。通过__launch_bounds__设定让编译器优化寄存器的使用。

__global__ void
__launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor)
MyKernel(...) {
 ...
}

 

cuda编译总结

 
Device code linking

nvcc –arch=sm_20 –dlink v3.o particle.o main.o –o gpuCode.o

将所有的device object code链接到gpuCode,o中,不链接cpu object code。

g++ gpuCode.o main.o particle.o v3.o –lcudart –o app

将剩余CPU object code链接进来,形成可执行文件
-x可以使得nvcc 将.cpp文件视为.cu文件

align your accesses on the GPU Global memory

You should simply align your accesses on the GPU Global memory. The aligned address is a multiple of the size of the object
your are reading or writing, e.g. if you want to read or write an integer, the address should be a multiple of 4. And, reading
or writing a char is always aligned.
Suppose you have a big space allocated with cudaMalloc called dummySpace, this pseudo kernel code would probably results in
CUDA_EXCEPTION_6 Warp Misaligned Address:

__global__ void func (char* stringInput, int stringSize, int* integerInput, char* dummySpace) //input: a string, an integer, output: a big space with that string and integer in it
{
//dummySpace is created by cudaMalloc, so it is aligned to at least 256 bytes
   int counter = 0;
   for(int i=0;i<stringSize;i++)
   dummySpace[counter++] = stringInput[i]; //==>this is copying several chars, sizeof(char) is one, so they are always aligned
   for(int i=0;i<sizeof(int);i++)
   dummySpace[counter++] = ((char*)integerInput)[i];   //==> this is going to be a problem because the first for has advanced the counter by stringSize which is unknown an can make the address unaligned
}

The fixed one:

__global__ void func (char* stringInput, int stringSize, int* integerInput, char* dummySpace)
{
    int counter = 0;
    for(int i=0;i<stringSize;i++)
    dummySpace[counter++] = stringInput[i];
    int sub = counter % 4; //or 8 or 16..
    counter += (4-sub);
    for(int i=0;i<sizeof(int);i++)
   dummySpace[counter++] = ((char*)integerInput)[i];   //==> everything is ok as you are saving an integer in an aligned address
}

 

mpi多机执行配置

步骤:
1、设置两台机器上PATH,使得在两台机器上分别可以执行mpicc –version命令。
2、设置两台机器相互之间可以免密码登陆。
3、执行自己的程序
配置A(192.168.1.1)、B(192.168.1.1)两台机器之间的免密登陆:
登陆A机器,执行

ssh-keygen -t rsa
ssh username@192.168.1.2 mkdir -p .ssh
cat .ssh/id_rsa.pub | ssh username@192.168.1.102 'cat >> .ssh/authorized_keys'

登陆B机器完成上述类似操作,就完成了两台机器之间的免密登陆配置。
 
参考链接:http://blog.csdn.net/bendanban/article/details/40710217

cuda 内存以及内存拷贝

CUDA memory

cuda中内存分配分为三种:可分页存储(pageable host memory)和分页锁定存储(page-locked host memory), device memory

  • 可分页存储是指通过C/C++函数malloc和new等操作向操作系统申请的虚拟存储,在一定情况下,会被置换出内存,所以地址不固定。
  • 分页锁定存储使用cudaMallocHost或者cudaHostAlloc分配,分配的空间一定位于物理内存且地址固定,并且能通过直接内存存取(DMA )提高传输速度,但是分配和释放比较耗时。cudaFreeHost用于释放。cudaHostRegister和cudaHostUnregister可以pins/unpins pageable host memory, 但速度比较慢,不要经常使用。
  • device memory通过cudaMalloc分配,或者cudaMallocPitch()和cudaMalloc3D()分配,cannot be paged

CUDA memory copies

  • cudaMemcpy() 使用默认流,同步拷贝
  • cudaMemcpyAsync(…, &stream): 指定stream上传输,异步拷贝,调用后立即返回。为了实现并发性,不应该在默认流中传输,host memory必须是pinned的
  • thrust API中通过赋值来实现向量中的数据移动

实现内存并发拷贝的条件:

  • 在不同的非默认流中进行内存拷贝
  • 使用的host memory是pinned
  • 调用的是异步拷贝API
  • 一个方向上同时只能有一个拷贝操作