- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a piece of code that I cannot disclose right now (I will try to reproduce it in a shorter example), the thing is when I compile it with /QAVX, it generate this code:
Address Source Line Assembly Clockticks: Total Clockticks: Self Instructions Retired: Total Instructions Retired: Self CPI Rate: Total CPI Rate: Self General Retirement Microcode Sequencer Bad Speculation Back-end Bound Front-end Bound DTLB Overhead Loads Blocked by Store Forwarding Split Loads 4K Aliasing L2 Bound L3 Bound DRAM Bound Store Bound Core Bound ICache Misses ITLB Overhead Branch Resteers DSB Switches Length Changing Prefixes Front-End Bandwidth DSB Front-End Bandwidth MITE 0x100b931e 962 vmovdqu xmm5, xmmword ptr [ebx] 0.0% 4,983,341 0.0% 6,386,398 0.780 0.780 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0x100b9322 962 vmovdqu xmmword ptr [esp+0xa0], xmm5 1.8% 554,297,692 1.4% 698,936,006 0.793 0.793 0.130 0.000 0.000 0.846 0.024 0.031 0.000 0.000 0.020 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.032 0.000 0x100b932b 962 vmovdqu xmm0, xmmword ptr [ebx+0x10] 0.1% 22,394,000 0.1% 28,464,010 0.787 0.787 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.322 0.000 0.000 0.000 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0x100b9330 962 vmovdqu xmmword ptr [esp+0xb0], xmm0 0.0% 6,031,618 0.0% 8,129,718 0.742 0.742 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0x100b9339 962 test esi, esi 0.0% 14,815,452 0.0% 16,931,739 0.875 0.875 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0x100b933b 962 jz 0x100b941e <Block 50> 0.0% 1,904,153 0.0% 2,000,911 0.952 0.952 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0x100b9341 Block 49: 0.0% 0.0% 0x100b9341 962 mov edi, dword ptr [esp+0x54] 0.0% 0.0% 0x100b9345 962 vpaddw xmm3, xmm5, xmm4 0.0% 3,352,093 0.0% 4,271,375 0.785 0.785 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0x100b9349 962 vpaddw xmm2, xmm0, xmm4 0.0% 2,928,452 0.0% 3,751,111 0.781 0.781 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0x100b934d 962 mov eax, dword ptr [edi+0x4] 0.0% 1,376,764 0.0% 1,987,314 0.693 0.693 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0x100b9350 962 vpextrw edi, xmm3, 0x0 0.0% 0.0% 0x100b9355 962 vmovdqu xmm7, xmmword ptr [esp+0x80] 0.0% 4,592,970 0.0% 5,534,012 0.830 0.830 0.983 0.000 0.983 0.017 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0x100b935e 962 vmovdqu xmm1, xmmword ptr [esp+0x90] 0.0% 0.0%
When I generate it with /QSSE4.1 /QaAVX, the AVX code path is as such:
Address Source Line Assembly Clockticks: Total Clockticks: Self Instructions Retired: Total Instructions Retired: Self CPI Rate: Total CPI Rate: Self General Retirement Microcode Sequencer Bad Speculation Back-end Bound Front-end Bound DTLB Overhead Loads Blocked by Store Forwarding Split Loads 4K Aliasing L2 Bound L3 Bound DRAM Bound Store Bound Core Bound ICache Misses ITLB Overhead Branch Resteers DSB Switches Length Changing Prefixes Front-End Bandwidth DSB Front-End Bandwidth MITE
0x100b9594 Block 47: 0.0% 0.0%
0x100b9594 962 mov dword ptr [esp+0x8], eax 0.0% 2,106,301 0.0% 2,485,774 0.847 0.847 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0x100b9598 962 xor edx, edx 0.0% 0.0%
0x100b959a 962 mov esi, dword ptr [esp+0x28] 0.0% 0.0%
0x100b959e Block 48: 0.0% 0.0%
0x100b959e 962 movdqu xmm5, xmmword ptr [ebx] 0.0% 6,178,264 0.0% 7,860,224 0.786 0.786 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0x100b95a2 962 movdqa xmmword ptr [esp+0xa0], xmm5 1.7% 518,966,716 1.3% 644,970,879 0.805 0.805 0.176 0.000 0.019 0.805 0.019 0.012 0.000 0.000 0.035 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.125 0.000
0x100b95ab 962 movdqu xmm0, xmmword ptr [ebx+0x10] 0.1% 18,849,972 0.0% 22,961,024 0.821 0.821 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.239 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0x100b95b0 962 movdqa xmmword ptr [esp+0xb0], xmm0 0.0% 11,067,064 0.0% 14,796,040 0.748 0.748 0.000 0.000 0.000 1.000 0.000 0.569 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0x100b95b9 962 test esi, esi 0.0% 13,519,245 0.0% 16,668,546 0.811 0.811 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.333 0.000 0.000 0.000 0.000 0.266 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0x100b95bb 962 jz 0x100b96a1 <Block 50> 0.0% 1,146,115 0.0% 1,366,417 0.839 0.839 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0x100b95c1 Block 49: 0.0% 0.0%
0x100b95c1 962 mov edi, dword ptr [esp+0x54] 0.0% 7,793,002 0.0% 9,372,588 0.831 0.831 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0x100b95c5 962 vzeroupper 0.0% 0.0%
0x100b95c8 962 vpaddw xmm3, xmm5, xmm4 0.0% 11,755,160 0.0% 15,379,015 0.764 0.764 0.768 0.000 0.768 0.232 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0x100b95cc 962 vpaddw xmm2, xmm0, xmm4 0.0% 2,202,864 0.0% 2,551,664 0.863 0.863 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0x100b95d0 962 mov eax, dword ptr [edi+0x4] 0.0% 1,229,030 0.0% 1,534,467 0.801 0.801 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0x100b95d3 962 vpextrw edi, xmm3, 0x0 0.0% 4,683,935 0.0% 6,584,619 0.711 0.711 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0x100b95d8 962 vmovdqa xmm7, xmmword ptr [esp+0x80] 0.0% 5,390,732 0.0% 7,425,202 0.726 0.726 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0x100b95e1 962 vmovdqa xmm1, xmmword ptr [esp+0x90] 0.0% 4,387,615 0.0% 5,483,645 0.800 0.800 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Note that I had to put manually the _mm256_zeroupper(), to workaround the huge penalty effected by this generated code.
I think in that case it should only generate VEX functions...
I'll try to reproduce it, but you really should investigate into it.
Best regards
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is a code portion to reproduce the supposed bug.
#include <immintrin.h> typedef unsigned short int16_t; __declspec(noinline) static bool SomeFunction(int16_t * outputPtr, int16_t const * inputPtr) { for (int x = 0; x < 16; x++) { __m256 input = _mm256_loadu_ps((float *)(inputPtr + 16 * x)); __m128i in_low = _mm_castps_si128(_mm256_extractf128_ps(input, 0)); __m128i in_high = _mm_castps_si128(_mm256_extractf128_ps(input, 1)); __m128i out_low = _mm_add_epi16(in_low, _mm_set1_epi16(1)); __m128i out_high = _mm_add_epi16(in_low, _mm_set1_epi16(1)); __m256 output = _mm256_insertf128_ps(_mm256_insertf128_ps(_mm256_undefined_ps(), _mm_castsi128_ps(out_low), 0), _mm_castsi128_ps(out_high), 1); _mm256_store_ps((float *)(outputPtr + 16 * x), output); } return true; } #include <stdlib.h> #include <stdio.h> int main(int argc, char ** argv) { int16_t * dest = (int16_t *)_aligned_malloc(65536*sizeof(int16_t), 32); int16_t * src = (int16_t *)_aligned_malloc(65536*sizeof(int16_t), 32); // Prevent input optimisations for (int i = 0; i < 32768; i++) src = rand(); SomeFunction(dest, src); // Prevent output optimisations for (int i = 0; i < 32768; i++) printf("%d\n", dest); }
Here is the assembler code generated with /QxAVX for "SomeFunction":
01271090 push esi 01271091 sub esp,18h 01271094 xor ecx,ecx 01271096 vmovdqu xmm0,xmmword ptr [___xi_z+34h (1273120h)] 0127109E mov esi,eax 012710A0 xor eax,eax 012710A2 vmovups xmm1,xmmword ptr [eax+edx] 012710A7 inc ecx 012710A8 vinsertf128 ymm2,ymm1,xmmword ptr [eax+edx+10h],1 012710B0 vpaddw xmm4,xmm2,xmm0 012710B4 vinsertf128 ymm5,ymm4,xmm4,1 012710BA vmovups ymmword ptr [eax+esi],ymm5 012710BF add eax,20h 012710C2 cmp ecx,10h 012710C5 jl SomeFunction+12h (12710A2h) 012710C7 vzeroupper 012710CA add esp,18h 012710CD pop esi 012710CE ret
Now with /QxSSE4.1 /QaxAVX (in the AVX code path):
SomeFunction: 008D1090 push esi 008D1091 sub esp,18h 008D1094 xor ecx,ecx 008D1096 movdqa xmm0,xmmword ptr [___xi_z+34h (8D3120h)] 008D109E mov esi,eax 008D10A0 xor eax,eax 008D10A2 vmovups ymm1,ymmword ptr [eax+edx] 008D10A7 inc ecx 008D10A8 vpaddw xmm3,xmm1,xmm0 008D10AC vinsertf128 ymm4,ymm3,xmm3,1 008D10B2 vmovaps ymmword ptr [eax+esi],ymm4 008D10B7 add eax,20h 008D10BA cmp ecx,10h 008D10BD jl SomeFunction+12h (8D10A2h) 008D10BF add esp,18h 008D10C2 pop esi 008D10C3 ret
There SHOULD not be a MOVDQA but a VMOVUPS (or VMOVDQA) when loading into register the constant "_mm_set1_epi16(1)"
If i change SomeFunction by removing the for loop (no factorisation of the _mm_set1_epi16(1) needed), it gets normal again:
SomeFunction: 01101090 sub esp,1Ch 01101093 vmovups ymm0,ymmword ptr [edx] 01101097 vpaddw xmm2,xmm0,xmmword ptr [___xi_z+34h (1103120h)] 0110109F vinsertf128 ymm3,ymm2,xmm2,1 011010A5 vmovaps ymmword ptr [eax],ymm3 011010A9 add esp,1Ch 011010AC ret
This bug is really annoying. Especially since /QaxAVX is mandatory when we want to mix SSE4 and AVX code in the same binary, since it's not possible with current versions of Intel Compiler to compile just one cpp file with /QxAVX without "border effects" (since it pollute with AVX every STL classes for instance or any shared classes between the 2 cpp).
It took some time to reproduced it, so I hope it will be taked into consideration.
Best Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And i almost forgot:
Workaround is to insert _mm256_zeroupper() depending on where the AVX and SSE instructions are emitted (so it is a little bit empirical....).
But really, when using intrinsics and Intel Compiler, we shouldn't have to add these (plus it might throw away useful info in the register that the processor would have to reload).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Emma,
You are right .
As to 'since it's not possible with current versions of Intel Compiler to compile just one cpp file with /QxAVX without "border effects" (since it pollute with AVX every STL classes for instance or any shared classes between the 2 cpp).'
When using -xavx or /arch:AVX ,it is known that "A disadvantage of this method is that it requires access to the relevant source files, so it cannot avoid AVX-SSE transitions resulting from calls to functions that are not compiled with the –xavx or –mavx flag. Another possible disadvantage is that all Intel® SSE code within a file compiled with the –xavx or –mavx flag will be converted to VEX format and will only run on Intel® AVX supported processors."(https://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf)
The Intel compiler, when /arch:AVX is set so as to support AVX intrinsics, generates equivalent AVX-128 code from SSE intrinsics, so there should be no transition penalty. So 'There SHOULD not be a MOVDQA but a VMOVUPS ' should be a compiler bug's failue to deal with this ,I will investigate this and will report a internal bug if it is confirmed.
Thank you.
--
QIAOMIN.Q
Intel Developer Support
Please participate in our redesigned community support web site:
User forums: http://software.intel.com/en-us/forums/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I find your example code interesting, as I've kind of assumed, possibly wrongly, that its bad to mix SSE with AVX code.
If I dont' have all the functionality I want in AVX, then I only use SSE on the whole piece of code, I never mix, precisely because of issues like the SSE performance penalty if you don't clear AVX upper regs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Richard,
I have code mixing float and fixed point precision, and in that matter, mixing AVX and SSE offer a very good performance improvement that I cannot lost by moving backward to SSE4.1 only.
The problem is that I try to have the same binary having a SSE4.1 optimized code path and a SSE4.1/AVX hybrid optimized code path (because of my customer requirement) which is tedious (its not my first problem with the compiler). FYI the hybrid code is 20% faster to give you an idea (when of course I fix all the penalties).
And this allows me to notice that this /QxSSE4.1 /QxaAVX combination is very very much flawed... (I guess IPO wrongly mixes part of code that should be isolated...).
Best Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tim,
This issue is specific to the fact that AVX is an alternate code path.
Regards

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page