pytorch c++ 编译

最近在搞pytorch c++ inference,需要编译pytorch lib库。

折腾了几次,终于搞定,这里记录下来。

clone源码

git clone https://github.com/pytorch/pytorch.git

git checkout tags/v1.3.1

git checkout -b v1.3.1

git submodule sync

git submodule update –init –recursive

准备环境,miniconda、cmake(3.5+)、cuda 9.0+

miniconda安装

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

sh Miniconda3-latest-Linux-x86_64.sh

cmake安装

curl -O https://github.com/Kitware/CMake/releases/download/v3.15.5/cmake-3.15.5.tar.gz

./configure –prefix=install_dir

cuda使用yum安装

安装devtoolset-7,pytorch编译对gcc有要求,不然会出现编译错误,参考https://github.com/pytorch/builder/blob/master/update_compiler.sh

yum install -y -q devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils

source /opt/rh/devtoolset-7/enable

源码编译,可以根目录直接make,这里选择手动安装到指定目录

进入torch根目录,

mkdir build

cmake .. -DPYTHON_EXECUTABLE:FILEPATH=/search/privateData/tools/miniconda3/bin/python -DPYTHON_INCLUDE_DIR=/search/privateData/tools/miniconda3/include/python3.7m -DCMAKE_INSTALL_PREFIX=/search/privateData/tools/libtorch-1.3.0

CUDA Dynamic Parallelism

 
CUDA 5.0中引入动态并行化,使得在device端执行的kernel的线程也能跟在host上一样launch kernels,只有支持CC3.5或者以上的设备中才能支持。动态并行化使用CUDA Device Runtime library(cudadevrt),它是一个能在device code中调用的CUDA runtime子集。

编译链接

为了支持动态并行化,必须使用两步分离编译和链接的过程:首先,设定-c和-rdc=true(–relocatable-device-code=true)来生成relocatable device code来进行后续链接,可以使用-dc(–device -c)来合并这两个选项;然后将上一步目标文件和cudadevrt库进行连接生成可执行文件,-lcudadevrt。过程如下图Figure 4: The Separate Compilation and Linking Process for Dynamic Parallelism.

nvcc -arch=sm_35 -dc myprog.cu -o myprog.o
nvcc -arch=sm_35 myprog.o -lcudadevrt -o myprog

或者简化成一步

nvcc -arch=sm_35 -rdc=true myprog.cu -lcudadevrt -o myprog.o

执行、同步

在CUDA编程模型中,一组执行的kernel的线程块叫做一个grid。在CUDA动态并行化,parent grid能够调用child grids。child grid继承parant grid的特定属性和限制,如L1 cache、shared_memory、栈大小。如果一个parent grid有M个block和N个thread,如果对child kernel launch没有控制的话,那个将产生M*N个child kernel launch。如果想一个block产生一个child kernel,那么只需要其中一个线程launch a kernel就行。如下

if(threadIdx.x == 0) {
  child_k <<< (n + bs - 1) / bs, bs >>> ();
}

grid lanuch是完全嵌套的,child grids总是在发起它们的parent grids结束前完成,这可以看作是一个一种隐式的同步。

如果parent kernel需要使用child kernel的计算结果,也可以使用CudaDeviceSynchronize(void)进行显示的同步,这个函数会等待一个线程块发起的所有子kernel结束。往往不知道一个线程块中哪些子kernel已经执行,可以通过下述方式进行一个线程块级别的同步

void threadBlockDeviceSynchronize(void) {
  __syncthreads();
  if(threadIdx.x == 0)
    cudaDeviceSynchronize();
  __syncthreads();
}

CudaDeviceSynchronize(void)调用开销较大,不是必须的时候,尽量减少使用,同时不要在父kernel退出时调用,因为结束时存在上述介绍的隐式同步。

内存一致

当子 grids开始与结束之间,父grids和子grids有完全一致的global memory view。

当子kernel launch的时候,global memory视图不一致。

__device__ int v = 0;
__global__ void child_k(void) {
  printf("v = %d\n", v);
}
__global__ void parent_k(void) {
  v = 1;
  child_k <<< 1, 1 >>>> ();
  v = 2; // RACE CONDITION
  cudaDeviceSynchronize();
}

在子kernel launch之后,显示同步之前,parent grid不能对 child grid读取的内存做写入操作,否则会造成race condition。

向Child grids传递指针

指针的传递存在限制:

  • 可以传递的指针:global memory(包括__device__变量和malloc分配的内存),zero-copy host端内存,常量内存。
  • 不可以传递的指针:shared_memory(__shared__变量), local memory(包括stack变量)

Device Streams和Events

所有在device上创建的streams都是non-blocking的,不支持默认NULL stream的隐式同步。创建流的方式如下

cudaStream_t s;
cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);

一旦一个device stream被创建,它能被一个线程块中其他线程使用。只有当这个线程块完成执行的时候,这个stream才能被其他线程块或者host使用。反之亦然。
Event也是支持的,不过有限制,只支持在不同stream之间使用cudaStreamWaitEvent()指定执行顺序,而不能使用event来计时或者同步。

Recursion Depth和Device Limits

递归深度包括两个概念:

  • nesting depth:递归grids的最大嵌套层次,host端的为0;
  • synchronization depth:cudaDeviceSynchronize()能调用的最大嵌套层次,host端为1,cudaLimitDevRuntimeSyncDepth应该设定为maximum 所以你吃肉你咋体on depth加1,设定方式如 cudaDeviceLimit(cudaLimitDevRuntimeSyncDepth, 4).

maximum nesting depth有硬件限制,在CC3.5中, 对depth 的限制为24. synchronization depth也一样。
从外到内,直到最大同步深度,每一次层会保留一部分内存来保存父block的上下文数据,即使这些内存没有被使用。所以递归深度的设定需要考虑到每一层所预留的内存。
另外还有一个限制是待处理的子grid数量。pending launch buffer用来维持launch queue和追踪当前执行kernel的状态。通过

cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, 32768);

来设定合适的限制。否则通过cudaGetLastError()调用可以返回CudaErrorLaunchPendingCountExceeded的错误。
动态并行化执行有点类似树的结构,但与CPU上树处理也有些不同。类似深度小,分支多,比较茂密的树的执行结构,比较适合动态并行化的处理。深度大,每层节点少的树的执行结构,则不适合动态并行化。

characteristic tree processing dynamic parallelism
node thin (1 thread) thick (many threads)
branch degree small (usually < 10) large (usually > 100)
depth large small

参考:http://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/

cuda 同步与计时

同步block

_syncthreads()

同步kernel

cudaDeviceSynchronize()
waits until all preceding commands in all streams of all host threads have completed.

同步stream

cudaStreamSynchronize()
takes a stream as a parameter and waits until all preceding commands in the given stream have completed. It can be used to synchronize the host with a specific stream, allowing other streams to continue executing on the device.

Although CUDA kernel launches are asynchronous, all GPU-related tasks placed in one stream (which is default behaviour) are executed sequentially.
如果在kernel中使用printf,因为kernel调用是异步的,所以要使用DeviceSynchronize()进行同步,否则没有输出。
CUDA提供了两种对kernel进行同步的方式:

  • 使用cudaThreadSynchronize()进行显示同步,使主机进入阻塞状态,停止运行并等待所有已经提交的kernel执行完毕。
  • 利用cudaMemcpy()实现阻塞式数据传输,实际上内部调用了cudaThreadSynchronize()。

 
 

marvin安装与使用

 
cuDNN安装

cp lib* cudnn_dir/lib64/
cp cudnn.h cudnn_dir/include/
cd cudnn_dir
export LD_LIBRARY_PATH=`pwd`:$LD_LIBRARY_PATH

如出现error while loading shared libraries: libcudnn.so.4: cannot open shared object file: No such file or directory错误,是文件权限问题,可进行如下操作

cd cudnn_dir
rm -rf libcudnn.so libcudnn.so.4
chmod u=rwx,g=rx,o=rx libcudnn.so.4.0.4
ln -s libcudnn.so.4.0.4 libcudnn.so.4
ln -s libcudnn.so.4 libcudnn.so

 
marvin依赖cuda 7.5和cuDNN 4rc
curl -L https://github.com/PrincetonVision/marvin/tarball/master | tar zx
mv PrincetonVision* marvin && cd marvin
./compile.sh

Bitonic sort(也称双调排序)

Bitonic sequence. A sequence (a_{{1}}, a_{{2}},..., a_{{m}} ) is said to be bitonic if and only if:
(a) Either there is an integer j, 1 ≤ j ≤ 2m, such that a_{{1}} leq   a_{{2}}leq ... a_{{j}}geq  a_{{j+1}}geq  a_{{j+2}}geq ...geq  a_{{m}}
(b) Or the sequence does not initially satisfy the condition in (a), but can be shifted cyclically until the condition is satisfied.

/*
 * binotic.h
 *
 *  Created on: 2015年5月17日
 *      Author: zhangjun
 */
#ifndef BITONIC_SORT_H_
#define BITONIC_SORT_H_
#include<iostream>
#include<iterator>
#include<algorithm>
using namespace std;
class bitonic_sorter
{
public:
	bitonic_sorter(int a[], int len);
	void sort(bool direction = true);
	void sort_for_arbitary_length(bool direction = true);
private:
	int *array;
	int length;
	void bitonic_sort(int low, int len, bool direction);
	void bitonic_sort_for_arbitary_length(int low, int len, bool direction);
	void bitonic_merge(int low, int len, bool direction);
	void bitonic_merge_for_arbitary_length(int low, int len, bool direction);
	void compare_and_swap(int i, int j, bool direction);
	int greatest_power_of_2_lessthan(int len);
};
#endif /* BINOTIC_H_ */

 

/*
 * bitonic_sort.cpp
 *
 *  Created on: 2015年5月17日
 *      Author: zhangjun
 */
#include "bitonic_sort.h"
bitonic_sorter::bitonic_sorter(int a[], int len)
{
	array = a;
	length = len;
}
void bitonic_sorter::sort(bool direction)
{
	bitonic_sort(0, length, direction);
}
void bitonic_sorter::sort_for_arbitary_length(bool direction)
{
	bitonic_sort_for_arbitary_length(0, length, direction);
}
void bitonic_sorter::bitonic_sort(int low, int len, bool direction)                   // bitonic_sort
{
	if(len > 1)
	{
		int m = len/2;
		bitonic_sort(low, m, direction);
		bitonic_sort(low+m, m, !direction);
		bitonic_merge(low, len, direction);
	}
}
void bitonic_sorter::bitonic_sort_for_arbitary_length(int low, int len, bool direction)               // bitonic_sort_for_arbitary
{
	if(len > 1)
	{
		int m = len/2;
		if(direction == true)
		{
			bitonic_sort_for_arbitary_length(low, m, !direction);                                                                   // len-m > m
			bitonic_sort_for_arbitary_length(low+m, len-m, direction);                                                       // the big end
			bitonic_merge_for_arbitary_length(low, len, direction);
		}
		else
		{
			int half = greatest_power_of_2_lessthan(len);
			bitonic_sort_for_arbitary_length(low, len-half, !direction);                                                        // half > hen -half
			bitonic_sort(low+len-half, half, direction);                                                // the big end
			bitonic_merge_for_arbitary_length(low, len, direction);
		}
	}
}
void bitonic_sorter::bitonic_merge(int low, int len, bool direction)
{
	if(len > 1)
	{
		int m = len/2;
		for( int i = low; i < low + m; i++)
			compare_and_swap(i, i+m, direction);
		bitonic_merge(low, m, direction);
		bitonic_merge(low+m, m, direction);
	}
}
void bitonic_sorter::bitonic_merge_for_arbitary_length(int low, int len, bool direction)
{
	if(len > 1)
	{
		int m = greatest_power_of_2_lessthan(len);                                             // low+m >= low+len-m
		for( int i = low; i < low + len - m; i++)
			compare_and_swap(i, i+m, direction);
		bitonic_merge(low, m, direction);                                                                   // m >= len -m
		bitonic_merge(low+m, len-m, direction);
	}
}
void bitonic_sorter::compare_and_swap(int i, int j, bool direction)
{
	if(direction ==(array[i]>array[j]))
		std::swap(array[i], array[j]);
}
int bitonic_sorter::greatest_power_of_2_lessthan(int len)
{
	int p = 1;
	while(p<len)
		p = p<<1;
	return p>>1;
}

 

/*
 * test.cpp
 *
 *  Created on: 2015年5月6日
 *      Author: zhangjun
 */
#include<stdio.h>
#include<string.h>
#include"bitonic_sort.h"
int main()
{
	int num1[8] = {3, 67, 3, 5, 8, 4, 7, 9};
	int num2[34] = {7, 5, 8, 3, 5, 78, 9, 5, 6, 23,24,1,8,10,32, 2, 3, 8, 9, 21, 15, 3, 4, 8, 9, 6, 3, 2, 1,78,43, 56, 23, 41};
	bitonic_sorter s1(num1, 8);
	bitonic_sorter s2(num2, 34);
	s1.sort(false);
	std::copy(num1, num1+8, std::ostream_iterator<int>(cout, " "));
	std::cout<<"n";
	s2.sort_for_arbitary_length(true);
	std::copy(num2, num2+34, std::ostream_iterator<int>(cout, " "));
	std::cout<<"n";
	return 0;
}
/*
 * test.cpp
 *
 *  Created on: 2015年5月6日
 *      Author: zhangjun
 */
#include<stdio.h>
#include<string.h>
#include"bitonic_sort.h"
int main()
{
	int num1[8] = {3, 67, 3, 5, 8, 4, 7, 9};
	int num2[34] = {7, 5, 8, 3, 5, 78, 9, 5, 6, 23,24,1,8,10,32, 2, 3, 8, 9, 21, 15, 3, 4, 8, 9, 6, 3, 2, 1,78,43, 56, 23, 41};
	bitonic_sorter s1(num1, 8);
	bitonic_sorter s2(num2, 34);
	s1.sort(false);
	std::copy(num1, num1+8, std::ostream_iterator<int>(cout, " "));
	std::cout<<"n";
	s2.sort_for_arbitary_length(true);
	std::copy(num2, num2+34, std::ostream_iterator<int>(cout, " "));
	std::cout<<"n";
	return 0;
}

 
cuda bitonic代码

__global__ static void bitonicSort(int * values)
{
    extern __shared__ int shared[];
    const int tid = threadIdx.x;
    // Copy input to shared mem.
    shared[tid] = values[tid];
    __syncthreads();
    // Parallel bitonic sort.
    for (int k = 2; k <= NUM; k *= 2)                   // from 2 to
    {
        // Bitonic merge:
        for (int j = k / 2; j>0; j /= 2)                // from k/2 to 1
        {
            int ixj = tid ^ j;
            if (ixj > tid)           // tid 对应位为0, ixj对应位为1
            {
                if ((tid & k) == 0)               // 对应位为0,ascending
                {
                    if (shared[tid] > shared[ixj])
                    {
                        swap(shared[tid], shared[ixj]);
                    }
                }
                else                             // 对应位为1,descending
                {
                    if (shared[tid] < shared[ixj])
                    {
                        swap(shared[tid], shared[ixj]);
                    }
                }
            }
            __syncthreads();
        }
    }
    // Write result.
    values[tid] = shared[tid];
}

 
 

深度学习环境搭建(以caffe+cudnn为例,无需root权限)

(这里在redhat 6.3上安装,无root权限,全程使用源码编译方式安装)
1、系统环境与权限: GTX780,redhat 6.3, gcc 4.4.6, 无root权限
2、所需依赖库:
(1) cuda 6.5(推荐),6.0, 5.5, 5.0以及cuda6.0对应的驱动,或者cuda 5对应驱动319.*(not 331.*)
注:cuda驱动安装需要root权限,我这里安装cuda 6.5, 至少需要340.*以上驱动(6.0及以下没有测试过)
(2)BLAS(ATLAS, MKL,OpenBLAS)括号中三选一。 我这里选择了系统管理员已经默认安装的Intel MKL
(3)OpenCV(>=2.4)
(4)Boost(>= 1.55)(其实只能安装1.55,后面会提到)
(5) glog, gflags, protobuf, leveldb, snappy, hdf5, lmdb
(6) Python 2.7, numpy(>= 1.7)
(7) MATLAB
(Python和Matlab应该是可选的,我只安装了python)
3、依赖库安装
cuda、python安装比较简单,intel MKL已默认安装,不再介绍

#protobuf
tar zxvf protobuf-2.6.1.tar.gz
cd protobuf-2.6.1
chmod a+x autogen.sh
./autogen.sh
./configure -PREFIX=intstall_dir
make && make install
#leveldb
unzip leveldb-master.zip
cd leveldb-master
chmod a+x build_detect_platform
./build_detect_platform
make
#snappy
unzip snappy-master.zip
cd snappy-master
chmod a+x autogen.sh
./autogen.sh
./configure -PREFIX=install_dir
make && make install
#hdf5
tar zxvf hdf5-1.8.14.tar.gz
cd hdf5-1.8.14
./configure --PREFIX=install_dir
make && make install
# glog
tar zxvf glog-0.3.3.tar.gz
cd glog-0.3.3
./configure --prefix=install_dir
make && make install
# gflags
tar zxvf gflags-2.1.1.tar.gz
cd gflags-2.1.1.ta.gz
mkdir build && cd build
export CXXFLAGS="-fPIC" && cmake -DCMAKE_INSTALL_PREFIX=install_dir .. && make VERBOSE=1
make && make install
# lmdb
git clone git://gitorious.org/mdb/mdb.git
cd mdb/libraries/liblmdb
make && make install
#opencv
unzip opencv-2.4.10.zip
cd opencv-2.4.10
cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=install_dir ..
make && make install

4、cudnn安装
首先需要在https://developer.nvidia.com/cuDNN下载cuDNN Deep Neural Network Library。(CUDA Registered Developers 可以自由获得cuDNN library)

tar -zxvf cudnn-6.5-linux-R1.tgz
cd cudnn-6.5-linux-R1
# cp cudnn.h to CUDD_DIR/include, CUDA_DIR is the directory where the CUDA toolkit is installed
cp cudnn.h CUDA_DIR/include
# cp cudnn library to CUDA_DIR/lib64
cp libcudnn* CUDA_DIR/lib64
#这里还需要对cudnn library建立软链接
ln -s libcudnn.so.6.5 libcudnn.so.6.5.18
ln -s libcudnn.so libcudnn.so.6.5

5、caffe
http://caffe.berkeleyvision.org/installation.html上下载caffe软件包
解压后,进入caffe目录,首先复制一份Makefile.config

cp Makefile.config.example Makefile.config

然后修改Makefile.config。有几处需要修改
(1) cudnn
使用cuDNN加速,就需要取消对USE_CUDNN:=1的注释。同时设置cuda 安装路径和cuda architecture(取消对CUDA_ARCH的注释即可)

# cuda_install_dir是cuda安装路径
CUDA_DIR := cuda_install_dir

没有GPU的话,就应该使用CPU_ONLY:=1。
(2) BLAS库
我这里使用默认安装的 Intel MKL

BLAS := mkl

(3)Python
设置python目录

#to find python.h and numpy/arrayobject.h
PYTHON_INCLUDE := python_install_dir
        python_install_dir/dist_packages/numpy/core/include
#to find libpythonX.X.so or .dylib
PYTHON_LIB := python_install_dir/lib

设置到这那就应该行了吧 。高兴早了点!!!上面的依赖库是普通用户权限下使用源码安装的,即使配置了PATH和LD_LIBRARY_PATH,编译还是会出现类似如下的一堆错误

/usr/bin/ld: cannot find -lgflags
collect2: ld returned 1 exit status
make: *** [.build_release/lib/libcaffe.so] Error 1

还需进行如下修改

#=======================================================
PROTO_INCLUDE := /home/usrname/software/protobuf/include
PROTO_LIB := /home/usrname/software/protobuf/lib
GLOG_INCLUDE := /home/usrname/software/glog/include
GLOG_LIB := /home/usrname/software/glog/lib
GFLAGS_INCLUDE := /home/usrname/software/gflags/include
GFLAGS_LIB := /home/usrname/software/gflags/lib
HDF5_INCLUDE := /home/usrname/software/hdf5/include
HDF5_LIB := /home/usrname/software/hdf5/lib
LEVELDB_INCLUDE := /home/usrname/software/leveldb-master/include
LEVELDB_LIB := /home/usrname/software/leveldb-master
LMDB_INCLUDE := /home/usrname/software/lmdb/include
LMDB_LIB := /home/usrname/software/lmdb/lib
OPENCV_INCLUDE := /home/usrname/software/opencv/include
OPENCV_LIB := /home/usrname/software/opencv/lib
SNAPPY_INCLUDE := /home/usrname/software/snappy/include
SNAPPY_LIB := /home/usrname/software/snappy/lib
BOOST_INCLUDE := /home/usrname/boost_1_57_0
BOOST_LIB := /home/usrname/boost_1_57_0/stage/lib
#======================================================================
# Whatever else you find you need goes here.
INCLUDE_DIRS := $(PYTHON_INCLUDE) $(PROTO_INCLUDE) $(GLOG_INCLUDE) $(GFLAGS_INCLUDE) $(HDF5_INCLUDE) $(LEVELDB_INCLUDE) $(LMDB_INCLUDE) $(OPENCV_INCLUDE) $(SNAPPY_INCLUDE) $(BOOST_INCLUDE) /usr/local/include
LIBRARY_DIRS := $(PYTHON_LIB) $(PROTO_LIB) $(GLOG_LIB) $(GFLAGS_LIB) $(HDF5_LIB) $(LEVELDB_LIB) $(LMDB_LIB) $(OPENCV_LIB) $(SNAPPY_LIB) $(BOOST_LIB) /usr/local/lib /usr/lib

(这里使我明白了此前一直没注意的一个问题: -L是编译时查找.o或者.so文件所在的目录,用于链接生成可执行文件; LD_LIBRARY_PATH 是环境变量,用于程序执行时, 搜索.so 的路径 。参见http://bbs.csdn.net/topics/330189724
此时,编译应该没什么问题了。

make all
# 或者make all -j12 加快编译速度。j后面数字是并行编译线程数,最好是机器的核数

但是顺利执行make all, make test后,make runtest出现了错误。。

[ PASSED ] 832 tests.
[ FAILED ] 6 tests, listed below:
[ FAILED ] PowerLayerTest/0.TestPowerGradientShiftZero, where TypeParam = caffe::FloatCPU
[ FAILED ] PowerLayerTest/1.TestPowerGradientShiftZero, where TypeParam = caffe::DoubleCPU
[ FAILED ] PowerLayerTest/1.TestPowerGradient, where TypeParam = caffe::DoubleCPU
[ FAILED ] PowerLayerTest/2.TestPowerGradientShiftZero, where TypeParam = caffe::FloatGPU
[ FAILED ] PowerLayerTest/3.TestPowerGradientShiftZero, where TypeParam = caffe::DoubleGPU
[ FAILED ] PowerLayerTest/3.TestPowerGradient, where TypeParam = caffe::DoubleGPU

解决方法就是将boost 1.57换成boost 1.55,然后重新编译。。。(1.56也是不行的,参考http://blog.csdn.net/danieljianfeng/article/details/42836167)

cuda编译总结

 
Device code linking

nvcc –arch=sm_20 –dlink v3.o particle.o main.o –o gpuCode.o

将所有的device object code链接到gpuCode,o中,不链接cpu object code。

g++ gpuCode.o main.o particle.o v3.o –lcudart –o app

将剩余CPU object code链接进来,形成可执行文件
-x可以使得nvcc 将.cpp文件视为.cu文件

align your accesses on the GPU Global memory

You should simply align your accesses on the GPU Global memory. The aligned address is a multiple of the size of the object
your are reading or writing, e.g. if you want to read or write an integer, the address should be a multiple of 4. And, reading
or writing a char is always aligned.
Suppose you have a big space allocated with cudaMalloc called dummySpace, this pseudo kernel code would probably results in
CUDA_EXCEPTION_6 Warp Misaligned Address:

__global__ void func (char* stringInput, int stringSize, int* integerInput, char* dummySpace) //input: a string, an integer, output: a big space with that string and integer in it
{
//dummySpace is created by cudaMalloc, so it is aligned to at least 256 bytes
   int counter = 0;
   for(int i=0;i<stringSize;i++)
   dummySpace[counter++] = stringInput[i]; //==>this is copying several chars, sizeof(char) is one, so they are always aligned
   for(int i=0;i<sizeof(int);i++)
   dummySpace[counter++] = ((char*)integerInput)[i];   //==> this is going to be a problem because the first for has advanced the counter by stringSize which is unknown an can make the address unaligned
}

The fixed one:

__global__ void func (char* stringInput, int stringSize, int* integerInput, char* dummySpace)
{
    int counter = 0;
    for(int i=0;i<stringSize;i++)
    dummySpace[counter++] = stringInput[i];
    int sub = counter % 4; //or 8 or 16..
    counter += (4-sub);
    for(int i=0;i<sizeof(int);i++)
   dummySpace[counter++] = ((char*)integerInput)[i];   //==> everything is ok as you are saving an integer in an aligned address
}

 

cuda 内存以及内存拷贝

CUDA memory

cuda中内存分配分为三种:可分页存储(pageable host memory)和分页锁定存储(page-locked host memory), device memory

  • 可分页存储是指通过C/C++函数malloc和new等操作向操作系统申请的虚拟存储,在一定情况下,会被置换出内存,所以地址不固定。
  • 分页锁定存储使用cudaMallocHost或者cudaHostAlloc分配,分配的空间一定位于物理内存且地址固定,并且能通过直接内存存取(DMA )提高传输速度,但是分配和释放比较耗时。cudaFreeHost用于释放。cudaHostRegister和cudaHostUnregister可以pins/unpins pageable host memory, 但速度比较慢,不要经常使用。
  • device memory通过cudaMalloc分配,或者cudaMallocPitch()和cudaMalloc3D()分配,cannot be paged

CUDA memory copies

  • cudaMemcpy() 使用默认流,同步拷贝
  • cudaMemcpyAsync(…, &stream): 指定stream上传输,异步拷贝,调用后立即返回。为了实现并发性,不应该在默认流中传输,host memory必须是pinned的
  • thrust API中通过赋值来实现向量中的数据移动

实现内存并发拷贝的条件:

  • 在不同的非默认流中进行内存拷贝
  • 使用的host memory是pinned
  • 调用的是异步拷贝API
  • 一个方向上同时只能有一个拷贝操作