- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wrote some AVX instructions like :
__m256 x0 = _mm256_load_ps(f);
__m256 y0 = _mm256_load_ps(f+8);
__m256 z0 = _mm256_add_ps(x0, y0);
_mm256_store_ps( s, z0);
When I compiler this, does it matter whether I use -xAVX compiler option or not? I am using icpc 2013 on Linux
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As opposed to -mavx?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, I meant using -xavx and not using anything.
I compared the generated assembly code and it seems even without -xavx, the AVX assemble code are there. I wonder whether it is because I use AVX instruction like mm256_add.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So using -xAVX and not using it will generate same code for those intrinsic functions, right?Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
generate specialized code to run exclusively on processors
indicated by as described below
...
AVX May generate Intel(R) Advanced Vector Extensions (Intel(R)
AVX), Intel(R) SSE4.2, SSE4.1, SSSE3, SSE3,
SSE2, and SSE instructions for Intel(R) processors.
Optimizes for a future Intel processor.
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You still haven't stated your purpose, but you could compare the pre-processed and .s and .o codes generated with your choices of options.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I dont get a straight yes or no answer,
My purpose is to find for explicit AVX intrinsic functions, whether using -xAVX and not using this options will generate same binary.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
zlw wrote:
I wrote some AVX instructions like :
__m256 x0 = _mm256_load_ps(f);
__m256 y0 = _mm256_load_ps(f+8);
__m256 z0 = _mm256_add_ps(x0, y0);
_mm256_store_ps( s, z0);When I compiler this, does it matter whether I use -xAVX compiler option or not? I am using icpc 2013 on Linux
AFAIK -xAVX vs. no flag will typically not change the code generated which directly map to the intrinsics but it will ensure that the proper "zeroing" instructions are inserted when calling functions to avoid the SSE to/from AVX transitions penalties, also if you have regular C++ code without intrinsics it will target AVX instead of SSE2 which is the default fallback, even for scalar code it's better to use AVX instead of SSE, particularly when mixing scalar and vector code
also note that the very same source code with intrinsics may generate a completely different binary with -xCORE-AVX2
for example a pair of _mm256_add_ps + _mm256_mul_ps may generate a single FMA instruction, it's actually a very convenient way to generate FMA code from the same source
another example: _mm256_loadu_ps will generate a series of instructions with -xAVX (to handle split loads) and a single instruction with -xCORE-AVX2
conclusion: this is a sensible practice to specify -xAVX for Sandy Bridge (and Ivy Bridge) targets and -xCORE-AVX2 for Haswell
it may be advantageous to define -xCORE-AVX-I for Ivy Brige but I personnaly don't use it
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
zlw wrote:
When I compiler this, does it matter whether I use -xAVX compiler option or not? I am using icpc 2013 on Linux
I have written a very simple example showing the difference, I tested with the Windows version of C++ (Intel(R) 64 Compiler XE Version 13.1.1.171 Build 20130), as you can see in the attached file there may be indeed a huge difference
without the /QxAVX flag there is no final vzeroupper instruction and 256-bit unaligned moves aren't split in two parts, unlike with the flag
as you can also see the code generated with /QxCORE-AVX2 is yet another variant with vfmadd213ps used instead of vmulps + vaddps
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
bronxzv wrote:
Quote:
also note that the very same source code with intrinsics may generate a completely different binary with -xCORE-AVX2
for example a pair of _mm256_add_ps + _mm256_mul_ps may generate a single FMA instruction, it's actually a very convenient way to generate FMA code from the same source
another example: _mm256_loadu_ps will generate a series of instructions with -xAVX (to handle split loads) and a single instruction with -xCORE-AVX2
conclusion: this is a sensible practice to specify -xAVX for Sandy Bridge (and Ivy Bridge) targets and -xCORE-AVX2 for Haswell
it may be advantageous to define -xCORE-AVX-I for Ivy Brige but I personnaly don't use it
Interesting points. I haven't seen the CORE-AVX-I change code generation from splitting loads, nor have I observed any performance gains over CORE-AVX. This is not to say that the Ivy Bridge platforms I've bried are representative of production quality stability or performance.
After reading these informative posts, I still believe the OP would need to compare the code generation if for some reason s/he still wants to use AVX intrinsics without the matching compile option (perhaps to look for performance stalls or security loopholes associated with missing vzeroupper?)

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page