Intel intrinsics函数-SSE、AVX、MMX等指令集简单介绍

MMX指令集支持多种整数类型的运算。MMX定义了64位紧缩整数类型,,对应Intrinsic中的__m64类型,它能一次能处理2个32位整数。

  • —64-bit的MMX寄存器(8个,复用了浮点寄存器的尾部,与x87共用寄存器,缺少浮点指令)
  • —支持在打包的字,字节,双字整数上的SIMD操作
  • —MMX指令用于多媒体和通讯软件

SSE是MMX的超集。SSE指令集只支持单精度浮点运算,直到SSE2指令集才支持双精度浮点数运算。SSE2定义了128位紧缩整数类型,对应Intrinsic中的__m128i类型,它能一次能处理4个32位整数。

  • —包括了70条指令,其中50条SIMD浮点运算指令、12条MMX 整数运算增强指令、8条内存连续数据块传输指令
  • —新增8个XMM寄存器(XMM0-XMM7)
  • —在X86_64中额外增加8个(XMM8-XMM15)
SSE2指令集:

  • —使用了144个新增指令
  • —从64位扩展到了128 位
  • —提供双精度操作支持

—SSE3指令集:

  • —增加13条指令(允许寄存器内部之间运算,浮点数到整数的转换)
  • —超线程性能增强指令可以提升处理器的超线程处理能力

—SSSE3指令集:

  • —扩充了SSE3,增加16条指令
  • —绝对值、相反数等

—SSE4指令集:

  • —新增47条指令,更新至SSE4.2

AVX指令集只支持单精度和双精度浮点运算。2013年Haswell架构中的AVX2指令集才支持整数运算。

  • —数据宽度从128位扩展为256位
  • —操作数从两个增加到三个

 

Compiler Auto Vectorization

-x flag, which tells the compiler to generate specific vectorization instructions.
Using the -xHost flag enables the highest level of vectorization supported on the processor on which the user compiles. Note that the Intel compiler will try to vectorize a code with SSE2 instructions at optimizations of -O2 or higher. Disable this by specifying -no-vec.
The Intel compiler can generate a single executable with multiple levels of vectorization with the -ax flag, which takes the same options as the -x flag (i.e., AVX, …, SSE2). This flag will generate run-time checks to determine the level of vectorization support on the processor and will then choose the optimal execution path for that processor. It will also generate a baseline execution path that is taken if the -ax level of vectorization specified is not supported.
-vec-report flag, which generates diagnostic information regarding vectorization to stdout. The -vec-report flag takes an optional parameter that can be a number between 0 and 5 (e.g., -vec-report0), with 0 disabling diagnostics and 5 providing the most detailed diagnostics about what loops were optimized, what loops were not optimized, and why those loops were not optimized.
Intel intrinsics guide:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/

How to run Intel MPI on Xeon Phi

Overview

The Intel® MPI Library supports the Intel® Xeon Phi™ coprocessor in 3 major ways:

  • The offload model where all MPI ranks are run on the main Xeon host, and the application utilizes offload directives to run on the Intel Xeon Phi corpocessor card,
  • The native model where all MPI ranks are run on the Intel Xeon Phi coprocessor card, and
  • The symmetric model where MPI ranks are run on both the Xeon host and the Xeon Phi coprocessor card.

This article will focus on the native and symmetric models only. If you’d like more information on the offload model, this article gives a great overview and even more details are available in the Intel® Compiler documentation.

Prerequisites

The most important thing to remember is that we’re treating the Xeon Phi coprocessor cards as simply another node in a heterogeneous cluster. To that effect, running an MPI job in either the native and symmetric modes is very similar to running a regular Xeon MPI job. On the flip side, that does require some prerequisites to be fulfilled for each coprocessor card to be completely accessible via MPI.
Uniquely accessible hosts
All coprocessor cards on the system need to have a unique IP address that’s accessible from the local host, other Xeon hosts on the system, and other Xeon Phi cards attached to those hosts.  Again, think of simply adding another node to an existing cluster.  A very simple test of this will be the ability to ssh from one Xeon Phi coprocessor (let’s call it node0-mic0) to its own Xeon host (node0), as well as ssh to any other Xeon host on the cluster (node1) and their respective Xeon Phi cards (node1-mic0).  Here’s a quick example:

[user@node0-mic0 user]$ ssh node1-mic0 hostname
node1-mic0

Access to necessary libraries
Make sure all MPI libraries are accessible from the Xeon Phi card. There are a couple of ways to do this:

  • Setup an NFS share between the Xeon host where the Intel MPI Library is installed, and the Xeon Phi corprossesor card.
  • Manually copy all Xeon Phi-specific MPI libraries to the card.  More details on which libraries to copy and where are available here.

Assuming both of those requirements have been met, you’re ready to start using the Xeon Phi corprocessors in your MPI jobs.

Running natively on the Xeon Phi corprocessor

The set of steps to run on the Xeon Phi coprocessor card exclusively can be boiled down to the following:
1. Set up the environment
Use the appropriate scripts to set your runtime environment. The following assumes all Intel® Software Tools are installed in the /opt/intel directory.

# Set your compiler
[user@host] $ source /opt/intel/composer_xe_<version>/bin/compilervars.sh intel64
#Set your MPI environment
[user@host] $ source /opt/intel/impi/<version>/bin64/mpivars.sh

2. Compile for the Xeon Phi coprocessor card
Use the -mmic option for the Intel Compiler to build your MPI sources for the card.

[user@host] $ mpiicc -mmic -o test_hello.MIC test.c

3. Copy the Xeon Phi executables to the card
Transfer the executable that you just created to the card for execution.

[user@host] $ scp ./test_hello.MIC node0-mic0:~/test_hello

This step is not required if your host and card are NFS-shared. Also note that we’re renaming this executable during the copy process. This helps us use the same mpirun command for both native and symmetric modes.
4. Launch the application
Simply use the mpirun command to start the executable remotely on the card. Note that if you’re planning on using a Xeon Phi coprocessor in your MPI job, you have to let us know by setting the I_MPI_MIC environment variable. This is a required step.

[user@host] $ export I_MPI_MIC=enable
[user@host] $ cat mpi_hosts
node0-mic0
[user@host] $ mpirun –f mpi_hosts –n 2 ~/test_hello
Hello world: rank 0 of 2 running on node0-mic0
Hello world: rank 1 of 2 running on node0-mic0

Running symmetrically on both the Xeon host and the Xeon Phi coprocessor

You’re now trying to utilize both the Xeon hosts on your cluster, and the Xeon Phi coprocessor cards attached to them.
Step 1.
will be the same here
2. Compile for the Xeon Phi coprocessor card and for the Xeon host
You’re now going to have compile two different sets of binaries:

# for the Xeon Phi comprocessor
[user@host] $ mpiicc -mmic -o test_hello.MIC test.c
# for the Xeon host
[user@host] $ mpiicc -o test_hello test.c

3. Copy the Xeon Phi executables to the card
Here, we still have to transfer the Xeon Phi coprocessor-compiled executables to the card.  And again, we’re renaming the executable during the transfer:

[user@host] $ scp ./test_hello.MIC node0-mic0:~/test_hello

Now, this will not work if your $HOME directory (where the executables live) is NFS-shared between host and card.  For more tips on what to do in NFS-sharing cases, check out this article.
4. Launch the application
Finally, you run the MPI job.  Your only difference here would be edits in your hosts file as you now have to add the Xeon hosts to the list.

[user@host] $ export I_MPI_MIC=enable
[user@host] $ cat mpi_hosts
node0
node0-mic0
[user@host] $  mpirun –f mpi_hosts –perhost 1 –n 2 ~/test_hello
Hello world: rank 0 of 2 running on node0
Hello world: rank 1 of 2 running on node0-mic0

https://software.intel.com/en-us/articles/how-to-run-intel-mpi-on-xeon-phi
https://software.intel.com/en-us/articles/using-the-intel-mpi-library-on-intel-xeon-phi-coprocessor-systems
https://software.intel.com/en-us/articles/using-xeon-phi-prefixes-and-extensions-for-intel-mpi-jobs-in-nfs-shared-environment
http://www.hpc.mcgill.ca/index.php/81-doc-pages/256-using-xeon-phis-on-guillimin

mpi多机执行配置

步骤:
1、设置两台机器上PATH,使得在两台机器上分别可以执行mpicc –version命令。
2、设置两台机器相互之间可以免密码登陆。
3、执行自己的程序
配置A(192.168.1.1)、B(192.168.1.1)两台机器之间的免密登陆:
登陆A机器,执行

ssh-keygen -t rsa
ssh username@192.168.1.2 mkdir -p .ssh
cat .ssh/id_rsa.pub | ssh username@192.168.1.102 'cat >> .ssh/authorized_keys'

登陆B机器完成上述类似操作,就完成了两台机器之间的免密登陆配置。
 
参考链接:http://blog.csdn.net/bendanban/article/details/40710217

cuda 内存以及内存拷贝

CUDA memory

cuda中内存分配分为三种:可分页存储(pageable host memory)和分页锁定存储(page-locked host memory), device memory

  • 可分页存储是指通过C/C++函数malloc和new等操作向操作系统申请的虚拟存储,在一定情况下,会被置换出内存,所以地址不固定。
  • 分页锁定存储使用cudaMallocHost或者cudaHostAlloc分配,分配的空间一定位于物理内存且地址固定,并且能通过直接内存存取(DMA )提高传输速度,但是分配和释放比较耗时。cudaFreeHost用于释放。cudaHostRegister和cudaHostUnregister可以pins/unpins pageable host memory, 但速度比较慢,不要经常使用。
  • device memory通过cudaMalloc分配,或者cudaMallocPitch()和cudaMalloc3D()分配,cannot be paged

CUDA memory copies

  • cudaMemcpy() 使用默认流,同步拷贝
  • cudaMemcpyAsync(…, &stream): 指定stream上传输,异步拷贝,调用后立即返回。为了实现并发性,不应该在默认流中传输,host memory必须是pinned的
  • thrust API中通过赋值来实现向量中的数据移动

实现内存并发拷贝的条件:

  • 在不同的非默认流中进行内存拷贝
  • 使用的host memory是pinned
  • 调用的是异步拷贝API
  • 一个方向上同时只能有一个拷贝操作