list of vector instructions

aketh_t_ · ‎06-03-2015

Hi all,

I am currently working on MIC, and I have 4 doubts related to vectorization.

Is there document (white paper) with intel or where I could find the list of instruction that can be performed on a vector.

I guess + , -, *, / are

but what about >= or <=,<,>?

1)in short would logical operators vectorize?

2) would this code vectorize?

#pragma SIMD

for(i=0,i<=n;i++)

{

a[i ]= b[i ] <= c;

}

3) how do constants affect vectorization? is it detrimental to vectorization ?

i.e is this code vectorizable

#pragma SIMD

for(i=0,i<=n;i++)

{

a=(b+2) * c -q0;

}

4) how would operators applied to combinations of float and int affect vectorization.

TimP · ‎06-04-2015

You may be better off writing your own test code, if published ones like mine don't answer your questions.

your first example probably should be vectorizable with certain data types and pragma simd or vector always.

your example with constants might be more efficient if rewritten to translate to a single fma.

Mixed data types depend on availability of suitable instructions. Presumably future Mic with avx 512 will give more coverage.

McCalpinJohn · ‎06-04-2015

If you want a full list of vector instructions on Xeon Phi you should look at the Xeon Phi Instruction Set Reference Manual (Intel document 327364), which is currently available at https://software.intel.com/sites/default/files/forum/278102/327364001en.pdf

The short answer is that the Xeon Phi ISA has a reasonably full set of vector instructions for vectors of 32-bit floats, vectors of 64-bit floats, and vectors of 32-bit integers. For vectors of 64-bit integers the bitwise operations are supported, but not the standard arithmetic operations. There is no support for vectors of 8-bit or 16-bit integers, or for vectors of any data type bigger than 64 bits.

Frances_R_Intel · ‎06-04-2015

Tim, you mentioned that you have published test codes that might be used to explore this issue. Could you post a pointer to it?

As John points out, you can look at the ISA for the coprocessor to see what vector instructions exist. This will give you part of your answer.

Aketh, you are specifically interested in whether the compiler can vectorize some particular operation or not. If you look in the instruction set manual and see an instruction which does just exactly what you want, then it is a good bet that the compiler will vectorize that code. However, remember that life is never so simple as your little test cases. Real code could bring in issues that cause the compiler to worry about aliasing and alignment or whether there will be enough work to make vectorization worthwhile, among other things. Also, just because there is no single vector instruction that does exactly what you want, it doesn't mean that the compiler won't recognize a pattern in your code and cobble together multiple vector instructions that work for your particular loop.

TimP · ‎06-06-2015

Frances Roth (Intel) wrote:

Tim, you mentioned that you have published test codes that might be used to explore this issue. Could you post a pointer to it?

As John points out, you can look at the ISA for the coprocessor to see what vector instructions exist. This will give you part of your answer.

Aketh, you are specifically interested in whether the compiler can vectorize some particular operation or not. If you look in the instruction set manual and see an instruction which does just exactly what you want, then it is a good bet that the compiler will vectorize that code. However, remember that life is never so simple as your little test cases. Real code could bring in issues that cause the compiler to worry about aliasing and alignment or whether there will be enough work to make vectorization worthwhile, among other things. Also, just because there is no single vector instruction that does exactly what you want, it doesn't mean that the compiler won't recognize a pattern in your code and cobble together multiple vector instructions that work for your particular loop.

https://github.com/tprince/lcd contains the current "lcd" benchmark for Fortran(optimum mixture of f77/f95), Fortran 95, C, C++, Cilk(tm) Plus, (all with ifort or gfortran driver).

By the use of various directives (and, as John mentioned, default signed int size), the compilers do vectorize effectively practically all cases which can be expected to vectorize. You could make cases using various data types other than float/default real if those are your interest.

These cases demonstrate some peculiarities of Intel and gnu compilers with respect to vectorization. The directives included work with Intel 14.0 through 16.0 compilers, the latter being less dependent on them in some cases. Directives are not used for those cases where neither Intel nor gnu compilers need them, as they are sometimes counter-productive.

What I refer to as peculiarities in particular are those cases where Intel compilers need a non-standard pragma or directive (only a few such left in 16.0 beta compiler) or require changing pragma between MIC and host, for example. Prior to 16.0, there were cases which icpc could vectorize only with source code modified to use std::max or min (some, with icpc 15.0, under omp simd), while gcc/g++ could vectorize only with fmaxf/fminf coupled with -ffast-math.

Cases which can vectorize only by breaking the floating point exceptions model (that is, when the result of an arithmetic expression is used conditionally), will require pragma [omp] simd or vector always. It's not simply a question of whether appropriate simd instructions are available (although AVX2 and MIC made useful additions in this respect). Setting Intel option -fp-model strict, or normal gnu options, will prevent vectorization of these. Microsoft compilers have no options to vectorize these other than writing intrinsics.

In case you are interested, the biggest peculiarity in gcc/g++ is the need to split files into code which requires -ffast-math to optimize (but that is too risky to use along with OpenMP) or can optimize with normal options equivalent to Intel -fp-model source along with OpenMP. Intel -fp-model fast=1 is a better balance between standards compliance and performance than the gnu options, but it may need -assume protect_parens, which Intel offers only for Fortran, while gcc/g++ also offer an equivalent. I don't know whether anyone has tested (unreleased) MIC hardware which could work with an optimized gcc, besides the gcc offered in MPSS which doesn't use VPU instructions.

I use __restrict throughout for C++ rather than depending on ivdep directives which take the place of restrict, but do so only in a compiler-dependent way. Of course, __restrict also lies outside the C++ standard, although it is the same as C99 restrict and Fortran default.

Combined simd and openmp parallel optimizations are demonstrated where they are useful. Some of the omp parallel simd combinations aren't useful on less than 12 cores, but can show an advantage for Intel compilers given enough cores.

You may have noticed a recent forum thread of Jim Dempsey's on vectorizing partial_sum() by intrinsics. Although there is a gnu parallel implementation of it, that doesn't show any advantage in my tests, nor does icpc improve on g++. I don't think any of the usual STL implementations are parallelized either with threads or simd.