Intel's Advisor tool

Po-yen_C_Intel · ‎01-13-2016

Hi,

Recently I did a Vtune analysis on a machine that support both avx & avx2 instructions.

In the Vtune profile I saw the function __intel_avx_rep_memcpy, and later I diassemble its binary and see some intructions below:

14a3c3: c5 fe 6f be c0 00 00 vmovdqu 0xc0(%rsi),%ymm7
14a3ca: 00
14a3cb: c5 7e 6f 86 e0 00 00 vmovdqu 0xe0(%rsi),%ymm8
14a3d2: 00
14a3d3: c5 fd 7f 0f           vmovdqa %ymm1,(%rdi)
14a3d7: c5 fd 7f 57 20        vmovdqa %ymm2,0x20(%rdi)
14a3dc: c5 fd 7f 5f 40        vmovdqa %ymm3,0x40(%rdi)

I know an instruction start with "v" means it is an avx instruction, but how do I tell whether I am using avx or avx2 instruction?

Thank you.

Po-Yen Chou

TimP · ‎01-13-2016

Avx2 simply added some instructions to the avx ISA, notably vfma and vperm instructions and a bunch of integer ones. Avx2 compilation will be using a large number of avx instructions. Sometimes the identical code would be chosen for avx or avx2, and functions like your avx memcpy may be used in either case.

If the compiler is targeting avx, it may be more likely to use 128-bit mov instructions (those using xmm registers) where misalignment is expected, while avx2 targets are more favorable for misaligned 256-bit instructions (ymm registers).

Po-yen_C_Intel · ‎01-13-2016

Tim P. wrote:

Avx2 simply added some instructions to the avx ISA, notably vfma and vperm instructions and a bunch of integer ones. Avx2 compilation will be using a large number of avx instructions. Sometimes the identical code would be chosen for avx or avx2, and functions like your avx memcpy may be used in either case.

If the compiler is targeting avx, it may be more likely to use 128-bit mov instructions (those using xmm registers) where misalignment is expected, while avx2 targets are more favorable for misaligned 256-bit instructions (ymm registers).

Hi Tim,

Thank you for the explanation. In this case, will I be able to determine this from the compiling command? I remember I use -axAVX flag in when I use icpc to compile. Does that mean the compiler will use AVX instead of AVX2?

Bernard · ‎01-14-2016

In order to check for existance of AVX2 instruction e.g VFMAx write code [d = a*b+c] which can benefit from fused add-multiply instruction , compile and later disassemble that binary.

I was surprised when ifort emmited vfmax instruction while compiling Fortran 77 library RKSUITE.

MarkC_Intel · ‎01-14-2016

FWIW, if you download Intel® XED (http://www.intel.com/software/xed) and you disassemble your object file or executable (xed -i foo.exe > dis), there will be a column for each instruction that indicates if instructions are AVX, or AVX2, etc.

McCalpinJohn · ‎01-14-2016

I don't know how it might work on Windows, but on Linux systems the Intel compiler will generate code in the prolog to the executable that compares the ISA selection from the compilation to the ISA support (both HW and SW) on the current platform. If the executable was compiled with "base" ISA options that the HW/SW does not support, the job will abort with a message that (fairly clearly) describes the inconsistency.

This is far better than aborting on an "illegal instruction" fault, since differences in the input data may cause the illegal code path to not be executed during testing.

Tim P. alluded to this above, but the Intel target ISA options are used to both define the ISA and to define the target platform for tuning. Under gcc these can be specified separately using "-march" and "-mtune", respectively, if I recall correctly. So the target "-xAVX" will both select the AVX instruction set *and* tune for the particular performance characteristics of the Sandy Bridge/Ivy Bridge processor cores. Similarly, the target "-xAVX2" will select the AVX2 instruction set *and* tune for the performance characteristics of the Haswell/Broadwell processor cores. The code compiled with "-xAVX" will certainly run on a Haswell/Broadwell processor, but for this target the compiler will often choose 128-bit vectors over 256-bit vectors for data that is not guaranteed to be 32-byte-aligned. This is because Sandy Bridge/Ivy Bridge have relatively high penalties for unaligned 256-bit loads/stores (compared to 128-bit loads/stores) and only a small benefit in load/store performance for aligned 256-bit loads/stores (compared to 128-bit loads/stores).

Sometimes it is hard to tell if an instruction is an AVX or AVX2 instruction -- I usually have a copy of Volume 2 of the Intel Architecture Software Developer's Manual open on one of my monitors so I can look it up. The easy differences to remember are:

Processors supporting AVX2 also support FMA instructions.
AVX2 includes 256-bit packed integer instructions (AVX only supported 32-bit and 64-bit floating-point in the 256-bit instructions)

The permute instructions (VPERM*) are harder to remember because AVX includes several permute instructions (with more limited functionality), while the more general permutation instructions are limited to AVX2. There are also several special cases that are not easy to remember because the same mnemonic exists in AVX and AVX2, but AVX2 supports more options on what types of operands are allowed. For example, the VBROADCASTSD instruction broadcasts a 64-bit (double-precision) floating point value across the 4 fields of a 256-bit register. The AVX version only allows the input to be from memory, while AVX2 systems allow the input to be either a memory location or (the low 64-bits of) and AVX register.

Bernard · ‎01-14-2016

@John,

IIRC and I am not wrong on my assumption the same code (ISA checking) is inserted by the Compiler in Windows executable.

TimP · ‎01-14-2016

Intel's Advisor tool categorizes compiled loops according to whether they use avx2 instructions. Even when the compilation was done by Intel compiler with avx2 code generation, it will report avx, for example, if there was no occasion to use the additional avx2 instructions.

Intel compilers on Windows have much the same multiple code path options as John describes for linux. When the -ax option is set for main program, an internal flag is set during program initialization for later use in selecting among multiple code paths. It relies on an undocumented combination of CPU feature flags and CPU model numbers. You may be able to infer which flags were set by examining the code path versions, but it's written explicitly only when you save the .asm file (the Windows .asm file is less useful than the linux .s).

Choice of compiler language and standards version doesn't impact directly the choice of ISA. There are lower level implications of array assignments; Fortran 90 array assignments tend toward resolving suspected data overlaps by allocating a temporary result array and later copy so as to use parallel simd safely (but sometimes with performance penalty), while CEAN in Intel compilers implies the most aggressive ivdep and ignore exception settings so as to use simd vectorization.

As John said, gcc g++ gfortran offer possibilities such as setting march to avx and mtune to avx2 which might involve more use of 256 bit instructions than avx alone.

Po-yen_C_Intel · ‎01-14-2016

Hi,

Thanks for supplying such fruitful answer, especially for those from Tim and John! I took me a while to Google some words you mentioned to understand the whole picture.

Thank you so much

Po-Yen Chou

knowing whether AVX or AVX2 is used.