BUG: Poor hybrid SSE/AVX code generated

emmanuel_attia · ‎04-23-2014

Hi,

I have a piece of code that I cannot disclose right now (I will try to reproduce it in a shorter example), the thing is when I compile it with /QAVX, it generate this code:

Address	Source Line	Assembly	Clockticks: Total	Clockticks: Self	Instructions Retired: Total	Instructions Retired: Self	CPI Rate: Total	CPI Rate: Self	General Retirement	Microcode Sequencer	Bad Speculation	Back-end Bound	Front-end Bound	DTLB Overhead	Loads Blocked by Store Forwarding	Split Loads	4K Aliasing	L2 Bound	L3 Bound	DRAM Bound	Store Bound	Core Bound	ICache Misses	ITLB Overhead	Branch Resteers	DSB Switches	Length Changing Prefixes	Front-End Bandwidth DSB	Front-End Bandwidth MITE
0x100b931e	962	vmovdqu xmm5, xmmword ptr [ebx]	0.0%	4,983,341	0.0%	6,386,398	0.780	0.780	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b9322	962	vmovdqu xmmword ptr [esp+0xa0], xmm5	1.8%	554,297,692	1.4%	698,936,006	0.793	0.793	0.130	0.000	0.000	0.846	0.024	0.031	0.000	0.000	0.020	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.032	0.000
0x100b932b	962	vmovdqu xmm0, xmmword ptr [ebx+0x10]	0.1%	22,394,000	0.1%	28,464,010	0.787	0.787	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.322	0.000	0.000	0.000	0.000	0.001	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b9330	962	vmovdqu xmmword ptr [esp+0xb0], xmm0	0.0%	6,031,618	0.0%	8,129,718	0.742	0.742	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b9339	962	test esi, esi	0.0%	14,815,452	0.0%	16,931,739	0.875	0.875	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b933b	962	jz 0x100b941e <Block 50>	0.0%	1,904,153	0.0%	2,000,911	0.952	0.952	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b9341		Block 49:	0.0%		0.0%																								
0x100b9341	962	mov edi, dword ptr [esp+0x54]	0.0%		0.0%																								
0x100b9345	962	vpaddw xmm3, xmm5, xmm4	0.0%	3,352,093	0.0%	4,271,375	0.785	0.785	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b9349	962	vpaddw xmm2, xmm0, xmm4	0.0%	2,928,452	0.0%	3,751,111	0.781	0.781	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b934d	962	mov eax, dword ptr [edi+0x4]	0.0%	1,376,764	0.0%	1,987,314	0.693	0.693	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b9350	962	vpextrw edi, xmm3, 0x0	0.0%		0.0%																								
0x100b9355	962	vmovdqu xmm7, xmmword ptr [esp+0x80]	0.0%	4,592,970	0.0%	5,534,012	0.830	0.830	0.983	0.000	0.983	0.017	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
0x100b935e	962	vmovdqu xmm1, xmmword ptr [esp+0x90]	0.0%		0.0%

When I generate it with /QSSE4.1 /QaAVX, the AVX code path is as such:

Address   Source Line   Assembly   Clockticks: Total   Clockticks: Self   Instructions Retired: Total   Instructions Retired: Self   CPI Rate: Total   CPI Rate: Self   General Retirement   Microcode Sequencer   Bad Speculation   Back-end Bound   Front-end Bound   DTLB Overhead   Loads Blocked by Store Forwarding   Split Loads   4K Aliasing   L2 Bound   L3 Bound   DRAM Bound   Store Bound   Core Bound   ICache Misses   ITLB Overhead   Branch Resteers   DSB Switches   Length Changing Prefixes   Front-End Bandwidth DSB   Front-End Bandwidth MITE
0x100b9594       Block 47:   0.0%       0.0%
0x100b9594   962   mov dword ptr [esp+0x8], eax   0.0%   2,106,301   0.0%   2,485,774   0.847   0.847   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
0x100b9598   962   xor edx, edx   0.0%       0.0%
0x100b959a   962   mov esi, dword ptr [esp+0x28]   0.0%       0.0%
0x100b959e       Block 48:   0.0%       0.0%
0x100b959e   962   movdqu xmm5, xmmword ptr [ebx]   0.0%   6,178,264   0.0%   7,860,224   0.786   0.786   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
0x100b95a2   962   movdqa xmmword ptr [esp+0xa0], xmm5   1.7%   518,966,716   1.3%   644,970,879   0.805   0.805   0.176   0.000   0.019   0.805   0.019   0.012   0.000   0.000   0.035   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.125   0.000
0x100b95ab   962   movdqu xmm0, xmmword ptr [ebx+0x10]   0.1%   18,849,972   0.0%   22,961,024   0.821   0.821   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.239   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
0x100b95b0   962   movdqa xmmword ptr [esp+0xb0], xmm0   0.0%   11,067,064   0.0%   14,796,040   0.748   0.748   0.000   0.000   0.000   1.000   0.000   0.569   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
0x100b95b9   962   test esi, esi   0.0%   13,519,245   0.0%   16,668,546   0.811   0.811   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.333   0.000   0.000   0.000   0.000   0.266   0.000   0.000   0.000   0.000   0.000   0.000   0.000
0x100b95bb   962   jz 0x100b96a1 <Block 50>   0.0%   1,146,115   0.0%   1,366,417   0.839   0.839   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
0x100b95c1       Block 49:   0.0%       0.0%
0x100b95c1   962   mov edi, dword ptr [esp+0x54]   0.0%   7,793,002   0.0%   9,372,588   0.831   0.831   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
0x100b95c5   962   vzeroupper    0.0%       0.0%
0x100b95c8   962   vpaddw xmm3, xmm5, xmm4   0.0%   11,755,160   0.0%   15,379,015   0.764   0.764   0.768   0.000   0.768   0.232   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
0x100b95cc   962   vpaddw xmm2, xmm0, xmm4   0.0%   2,202,864   0.0%   2,551,664   0.863   0.863   0.000   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
0x100b95d0   962   mov eax, dword ptr [edi+0x4]   0.0%   1,229,030   0.0%   1,534,467   0.801   0.801   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
0x100b95d3   962   vpextrw edi, xmm3, 0x0   0.0%   4,683,935   0.0%   6,584,619   0.711   0.711   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
0x100b95d8   962   vmovdqa xmm7, xmmword ptr [esp+0x80]   0.0%   5,390,732   0.0%   7,425,202   0.726   0.726   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
0x100b95e1   962   vmovdqa xmm1, xmmword ptr [esp+0x90]   0.0%   4,387,615   0.0%   5,483,645   0.800   0.800   0.000   0.000   0.000   0.000   1.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000

Note that I had to put manually the _mm256_zeroupper(), to workaround the huge penalty effected by this generated code.
I think in that case it should only generate VEX functions...

I'll try to reproduce it, but you really should investigate into it.

Best regards

emmanuel_attia · ‎04-23-2014

Here is a code portion to reproduce the supposed bug.

#include <immintrin.h>

typedef unsigned short int16_t;

__declspec(noinline)
static bool SomeFunction(int16_t * outputPtr, int16_t const * inputPtr)
{
    for (int x = 0; x < 16; x++)
    {
        __m256 input    = _mm256_loadu_ps((float *)(inputPtr + 16 * x));
        __m128i in_low  = _mm_castps_si128(_mm256_extractf128_ps(input, 0));
        __m128i in_high = _mm_castps_si128(_mm256_extractf128_ps(input, 1));

        __m128i out_low  = _mm_add_epi16(in_low, _mm_set1_epi16(1));
        __m128i out_high = _mm_add_epi16(in_low, _mm_set1_epi16(1));

        __m256 output = _mm256_insertf128_ps(_mm256_insertf128_ps(_mm256_undefined_ps(), _mm_castsi128_ps(out_low), 0), _mm_castsi128_ps(out_high), 1);
        _mm256_store_ps((float *)(outputPtr + 16 * x), output);
    }

    return true;
}

#include <stdlib.h>
#include <stdio.h>

int main(int argc, char ** argv)
{
    int16_t * dest = (int16_t *)_aligned_malloc(65536*sizeof(int16_t), 32);
    int16_t * src = (int16_t *)_aligned_malloc(65536*sizeof(int16_t), 32);

    // Prevent input optimisations
    for (int i = 0; i < 32768; i++)
        src = rand();

    SomeFunction(dest, src);

    // Prevent output optimisations
    for (int i = 0; i < 32768; i++)
        printf("%d\n", dest);
}

Here is the assembler code generated with /QxAVX for "SomeFunction":

01271090  push        esi  
01271091  sub         esp,18h  
01271094  xor         ecx,ecx  
01271096  vmovdqu     xmm0,xmmword ptr [___xi_z+34h (1273120h)]  
0127109E  mov         esi,eax  
012710A0  xor         eax,eax  
012710A2  vmovups     xmm1,xmmword ptr [eax+edx]  
012710A7  inc         ecx  
012710A8  vinsertf128 ymm2,ymm1,xmmword ptr [eax+edx+10h],1  
012710B0  vpaddw      xmm4,xmm2,xmm0  
012710B4  vinsertf128 ymm5,ymm4,xmm4,1  
012710BA  vmovups     ymmword ptr [eax+esi],ymm5  
012710BF  add         eax,20h  
012710C2  cmp         ecx,10h  
012710C5  jl          SomeFunction+12h (12710A2h)  
012710C7  vzeroupper  
012710CA  add         esp,18h  
012710CD  pop         esi  
012710CE  ret

Now with /QxSSE4.1 /QaxAVX (in the AVX code path):

SomeFunction:
008D1090  push        esi  
008D1091  sub         esp,18h  
008D1094  xor         ecx,ecx  
008D1096  movdqa      xmm0,xmmword ptr [___xi_z+34h (8D3120h)]  
008D109E  mov         esi,eax  
008D10A0  xor         eax,eax  
008D10A2  vmovups     ymm1,ymmword ptr [eax+edx]  
008D10A7  inc         ecx  
008D10A8  vpaddw      xmm3,xmm1,xmm0  
008D10AC  vinsertf128 ymm4,ymm3,xmm3,1  
008D10B2  vmovaps     ymmword ptr [eax+esi],ymm4  
008D10B7  add         eax,20h  
008D10BA  cmp         ecx,10h  
008D10BD  jl          SomeFunction+12h (8D10A2h)  
008D10BF  add         esp,18h  
008D10C2  pop         esi  
008D10C3  ret

There SHOULD not be a MOVDQA but a VMOVUPS (or VMOVDQA) when loading into register the constant "_mm_set1_epi16(1)"

If i change SomeFunction by removing the for loop (no factorisation of the _mm_set1_epi16(1) needed), it gets normal again:

SomeFunction:
01101090  sub         esp,1Ch  
01101093  vmovups     ymm0,ymmword ptr [edx]  
01101097  vpaddw      xmm2,xmm0,xmmword ptr [___xi_z+34h (1103120h)]  
0110109F  vinsertf128 ymm3,ymm2,xmm2,1  
011010A5  vmovaps     ymmword ptr [eax],ymm3  
011010A9  add         esp,1Ch  
011010AC  ret

This bug is really annoying. Especially since /QaxAVX is mandatory when we want to mix SSE4 and AVX code in the same binary, since it's not possible with current versions of Intel Compiler to compile just one cpp file with /QxAVX without "border effects" (since it pollute with AVX every STL classes for instance or any shared classes between the 2 cpp).

It took some time to reproduced it, so I hope it will be taked into consideration.

Best Regards

emmanuel_attia · ‎04-23-2014

And i almost forgot:

Workaround is to insert _mm256_zeroupper() depending on where the AVX and SSE instructions are emitted (so it is a little bit empirical....).

But really, when using intrinsics and Intel Compiler, we shouldn't have to add these (plus it might throw away useful info in the register that the processor would have to reload).

QIAOMIN_Q_ · ‎04-24-2014

Hello Emma,

You are right .

As to 'since it's not possible with current versions of Intel Compiler to compile just one cpp file with /QxAVX without "border effects" (since it pollute with AVX every STL classes for instance or any shared classes between the 2 cpp).'

When using -xavx or /arch:AVX ,it is known that "A disadvantage of this method is that it requires access to the relevant source files, so it cannot avoid AVX-SSE transitions resulting from calls to functions that are not compiled with the –xavx or –mavx flag. Another possible disadvantage is that all Intel® SSE code within a file compiled with the –xavx or –mavx flag will be converted to VEX format and will only run on Intel® AVX supported processors."(https://software.intel.com/sites/default/files/m/d/4/1/d/8/11MC12_Avoiding_2BAVX-SSE_2BTransition_2BPenalties_2Brh_2Bfinal.pdf)

The Intel compiler, when /arch:AVX is set so as to support AVX intrinsics, generates equivalent AVX-128 code from SSE intrinsics, so there should be no transition penalty. So 'There SHOULD not be a MOVDQA but a VMOVUPS ' should be a compiler bug's failue to deal with this ,I will investigate this and will report a internal bug if it is confirmed.

Thank you.
--
QIAOMIN.Q
Intel Developer Support
Please participate in our redesigned community support web site:

User forums: http://software.intel.com/en-us/forums/

Richard_Nutman · ‎04-30-2014

Hi,

I find your example code interesting, as I've kind of assumed, possibly wrongly, that its bad to mix SSE with AVX code.

If I dont' have all the functionality I want in AVX, then I only use SSE on the whole piece of code, I never mix, precisely because of issues like the SSE performance penalty if you don't clear AVX upper regs.

emmanuel_attia · ‎04-30-2014

Hi Richard,

I have code mixing float and fixed point precision, and in that matter, mixing AVX and SSE offer a very good performance improvement that I cannot lost by moving backward to SSE4.1 only.

The problem is that I try to have the same binary having a SSE4.1 optimized code path and a SSE4.1/AVX hybrid optimized code path (because of my customer requirement) which is tedious (its not my first problem with the compiler). FYI the hybrid code is 20% faster to give you an idea (when of course I fix all the penalties).

And this allows me to notice that this /QxSSE4.1 /QxaAVX combination is very very much flawed... (I guess IPO wrongly mixes part of code that should be isolated...).

Best Regards

TimP · ‎04-30-2014

I haven't worked with the multiple path build. In a single path build (e.g. -QxCORE-AVX2) ICL translates SSE4 intrinsics automatically to AVX-128 and skips _mm256_zeroupper() when it becomes unnecessary. gcc doesn't translate to AVX and requires a block of SSE code to be followed by #ifdef __AVX__ _mm256_zeroupper(); #endif to maintain performance. I avoid IPO unless I have a case which benefits from it. Intel 15.0 beta compiler and gcc 4.9 make more effective use of shuffles in translation of C and Fortran so have less need of intrinsics.

emmanuel_attia · ‎05-06-2014

Hi Tim,

This issue is specific to the fact that AVX is an alternate code path.

Regards