with _mm256 instruction, does it matter to use -xAVX to compiler?

missing__zlw · ‎05-10-2013

I wrote some AVX instructions like :

    __m256 x0 = _mm256_load_ps(f);
    __m256 y0 = _mm256_load_ps(f+8);
    __m256 z0 = _mm256_add_ps(x0, y0);
    _mm256_store_ps( s, z0);

When I compiler this, does it matter whether I use -xAVX compiler option or not? I am using icpc 2013 on Linux

TimP · ‎05-11-2013

As opposed to -mavx?

Bernard · ‎05-11-2013

Maybe this link will be helpful ://software.intel.com/en-us/articles/how-to-compile-for-intel-avx

missing__zlw · ‎05-11-2013

Sorry, I meant using -xavx and not using anything.

I compared the generated assembly code and it seems even without -xavx, the AVX assemble code are there. I wonder whether it is because I use AVX instruction like mm256_add.

SergeyKostrov · ‎05-11-2013

>>...I compared the generated assembly code and it seems even without -xavx, the AVX assemble code are there. >>I wonder whether it is because I use AVX instruction like mm256_add... Intel C++ compiler should compile it without additional command line options. Another thing is you're actually talking about intrinsic functions, not instructions: ... /* * Add Packed Double Precision Floating-Point Values * **** VADDPD ymm1, ymm2, ymm3/m256 * Performs an SIMD add of the four packed double-precision floating-point * values from the first source operand to the second source operand, and * stores the packed double-precision floating-point results in the * destination */ extern __m256d __ICL_INTRINCC _mm256_add_pd( __m256d, __m256d ); ...

missing__zlw · ‎05-11-2013

So using -xAVX and not using it will generate same code for those intrinsic functions, right?Thanks

SergeyKostrov · ‎05-11-2013

You could look at compiler options to verify what it acctually does. For example, help of Intel C++ compiler for Windows shows: ... Code Generation /Qx


          generate specialized code to run exclusively on processors
          indicated by  as described below
...
            AVX     May generate Intel(R) Advanced Vector Extensions (Intel(R)
                    AVX), Intel(R) SSE4.2, SSE4.1, SSSE3, SSE3,
                    SSE2, and SSE instructions for Intel(R) processors.
                    Optimizes for a future Intel processor.
...

TimP · ‎05-12-2013

You still haven't stated your purpose, but you could compare the pre-processed and .s and .o codes generated with your choices of options.

missing__zlw · ‎05-12-2013

I dont get a straight yes or no answer,

My purpose is to find for explicit AVX intrinsic functions, whether using -xAVX and not using this options will generate same binary.

bronxzv · ‎05-13-2013

zlw wrote:

I wrote some AVX instructions like :

    __m256 x0 = _mm256_load_ps(f);
    __m256 y0 = _mm256_load_ps(f+8);
    __m256 z0 = _mm256_add_ps(x0, y0);
    _mm256_store_ps( s, z0);

When I compiler this, does it matter whether I use -xAVX compiler option or not? I am using icpc 2013 on Linux

AFAIK -xAVX vs. no flag will typically not change the code generated which directly map to the intrinsics but it will ensure that the proper "zeroing" instructions are inserted when calling functions to avoid the SSE to/from AVX transitions penalties, also if you have regular C++ code without intrinsics it will target AVX instead of SSE2 which is the default fallback, even for scalar code it's better to use AVX instead of SSE, particularly when mixing scalar and vector code

also note that the very same source code with intrinsics may generate a completely different binary with -xCORE-AVX2

for example a pair of _mm256_add_ps + _mm256_mul_ps may generate a single FMA instruction, it's actually a very convenient way to generate FMA code from the same source

another example: _mm256_loadu_ps will generate a series of instructions with -xAVX (to handle split loads) and a single instruction with -xCORE-AVX2

conclusion: this is a sensible practice to specify -xAVX for Sandy Bridge (and Ivy Bridge) targets and -xCORE-AVX2 for Haswell

it may be advantageous to define -xCORE-AVX-I for Ivy Brige but I personnaly don't use it

SergeyKostrov · ‎05-13-2013

>>...My purpose is to find for explicit AVX intrinsic functions, whether using -xAVX and not using this options >>will generate same binary. Take a look at immintrin.h header file. There are 2801 code lines and there are more than 250 intrinsic functions with a prefix _mm256_..._...( ... ). So, it is simply a matter of time to create a test case for all of them and it will answer your question for 100%.

bronxzv · ‎05-14-2013

zlw wrote:

When I compiler this, does it matter whether I use -xAVX compiler option or not? I am using icpc 2013 on Linux

I have written a very simple example showing the difference, I tested with the Windows version of C++ (Intel(R) 64 Compiler XE Version 13.1.1.171 Build 20130), as you can see in the attached file there may be indeed a huge difference

without the /QxAVX flag there is no final vzeroupper instruction and 256-bit unaligned moves aren't split in two parts, unlike with the flag

as you can also see the code generated with /QxCORE-AVX2 is yet another variant with vfmadd213ps used instead of vmulps + vaddps

TimP · ‎05-14-2013

bronxzv wrote:

Quote:

also note that the very same source code with intrinsics may generate a completely different binary with -xCORE-AVX2

for example a pair of _mm256_add_ps + _mm256_mul_ps may generate a single FMA instruction, it's actually a very convenient way to generate FMA code from the same source

another example: _mm256_loadu_ps will generate a series of instructions with -xAVX (to handle split loads) and a single instruction with -xCORE-AVX2

conclusion: this is a sensible practice to specify -xAVX for Sandy Bridge (and Ivy Bridge) targets and -xCORE-AVX2 for Haswell

it may be advantageous to define -xCORE-AVX-I for Ivy Brige but I personnaly don't use it

Interesting points. I haven't seen the CORE-AVX-I change code generation from splitting loads, nor have I observed any performance gains over CORE-AVX. This is not to say that the Ivy Bridge platforms I've bried are representative of production quality stability or performance.

After reading these informative posts, I still believe the OP would need to compare the code generation if for some reason s/he still wants to use AVX intrinsics without the matching compile option (perhaps to look for performance stalls or security loopholes associated with missing vzeroupper?)