AVX Performance Reduction Vs 128 SIMD

inteleverywhere · ‎07-19-2011

Dear All,

Legacy 128 Bit SIMD code replaced with AVX code, expected a performance gain or atleast status quo. But the performance has fallen to 10% of the SSE code. Can't figure out why?

128 Bit SIMD

=========

{

Load 128Bit

Load 128 Bit

Store 128 Bit

}

AVX code

======

{

VzeroUpper

Load 256 Bit

Store 256 Bit

VzeroUpper

}

Is there anything else to be taken care. How came the performance is 10%.

Thanks

D

Matthias_Kretz · ‎07-19-2011

The Sandy-Bridge Load-Store unit can only issue one 128-bit store per cycle (and two 128-bit loads). Thus the code you show can't get faster with AVX. By adding the vzeroupper instructions you can only make it slower (sure you really need them there? Especially the one at the beginning of the block should be superfluous.)

inteleverywhere · ‎07-19-2011

Thanks.

The answer to the second part: It has been clearly mentioned in the AVX Instrucion set reference that performance is guaranteed to improve when Vzero all/upper are used when jumpingfrom 256 to128 bit modes. removing them does not improve performance either.

The first part: If the performance does not improve with one 256 bit load in place of two 128 bit loads then it is still acceptable, butas I have mentioned earlierthe AVXtime taken is 10 times more than theSSE time.Something must be wrong!!

bronxzv · ‎07-20-2011

On Sandy Bridge 256-bit unaligned moves are significantly slower than 2x 128-bit unaligned moves
for example the Intel compiler expands _mm256_loadu_ps and _mm256_storeu_ps to a series of instructions (128-bit moves / inserts)

it may be part of the explanation if you work with unaligned data

for better advices I'll suggest to postactual code of your inner loop / hotspot

inteleverywhere · ‎07-25-2011

Hi,
The code is more or less a replica of the pseudo I had written in the first post. A loop with a series of loads and stores. A few basic operations in between. For the sake of simplicity I have removed those operations and am testing the code just for loads/stores. The question is not aboutthe performanceimprovement with AVX,rather it aboutwhy the performance degrades to half or quarter of the 128BitSIMD performance.here it is:

Loop_count
=========
{
//Reg = _mm_loadu_si128(POINTER);
//_mm_storeu_si128(POINTER, Reg);

Reg = _mm_load_si128(POINTER);
_mm_store_si128(POINTER, Reg);

POINTER = POINTER + 8;
}

Loop_count/2
=========
{
//Reg = _mm256_loadu_si256(POINTER);
//_mm256_storeu_si256(POINTER, Reg);

Reg = _mm256_load_si256(POINTER);
_mm256_store_si256(POINTER, Reg);

POINTER = POINTER + 16;
}

bronxzv · ‎07-25-2011

it will be more useful to have full(compilable) code for your example, I'll like to see yourmemory allocation, the exact typeof "POINTER" (I suppose the casts are missing in your example code), the typical loop counts (i.e. cache footprint),...

anyway,do you work with 32B aligned addresses? I ask becauseit's possible that_mm256_load_si256 is generating VMOVDQU insteadof VMOVDQA (seen with the Intel compiler) so 32B alignment isn't certain simply by seeing your example

inteleverywhere · ‎08-04-2011

The code is placed below. Is there anything else required. The AVX performanceis worse than the SSE-128performance?

main()

{
long int Buf1_Size =1024, Buf2_Size =1024;
short *Buf1_ptr = _aligned_malloc(Buf1_Size, 32);
short *Buf2_ptr = _aligned_malloc(Buf2_Size, 32);
int counter;

//SSE-128
for(counter =0; counter < Buf1_Size; counter +=8)
{
__m128i A;
A = _mm_load_si128((__m128i *) (Buf1_ptr + counter));
_mm_store_si128((__m128i *) (Buf2_ptr + counter), A);
}

//AVX-256
for(counter =0; counter < Buf1_Size; counter +=16)
{
__m256i B;
B = _mm256_load_si256((__m256i*) (Buf1_ptr + counter));
_mm256_store_si256((__m256i*) (Buf2_ptr + counter), B);
}
}

inteleverywhere · ‎08-04-2011

Sorry, there were a lot of errors in the code sent previously. You can use this one.

void main()
{

long int Buf1_Size =1024*1204, Buf2_Size =1024*1024;
short *Buf1_ptr = (short *) _aligned_malloc(Buf1_Size, 16);
short *Buf2_ptr = (short *) _aligned_malloc(Buf2_Size, 16);
int counter;

//SSE-128
for(counter =0; counter < 1024; counter +=8)
{
__m128i A;
A = _mm_load_si128((__m128i *) (Buf1_ptr + counter));
_mm_store_si128((__m128i *) (Buf2_ptr + counter), A);
}

_aligned_free(Buf1_ptr);
_aligned_free(Buf2_ptr);

Buf1_ptr = (short *) _aligned_malloc(Buf1_Size, 32);
Buf2_ptr = (short *) _aligned_malloc(Buf2_Size, 32);

//AVX-256
for(counter =0; counter < 1024; counter +=16)
{
__m256i B;
B = _mm256_load_si256((__m256i*) (Buf1_ptr + counter));
_mm256_store_si256((__m256i*) (Buf2_ptr + counter), B);
}

_aligned_free(Buf1_ptr);
_aligned_free(Buf2_ptr);
}

bronxzv · ‎08-04-2011

Thanks, I'll test it later today, from the look of your code the 2 paths should run at the same speed (store bandwidth to L1D main limiter), I will be really stuned to see the AVX path 10x slower as you said

Just a question: why do you allocate 1MB buffers and iterate over only2 KB?

bronxzv · ‎08-04-2011

I just plugged your examples in a test framework, I slightly modified your code this way:

void Copy128(const short *Buf1_ptr, short *Buf2_ptr)
{
for (int counter=0; counter<1024; counter+=8)
{
const __m128i A = _mm_load_si128((const __m128i *)(Buf1_ptr + counter));
_mm_store_si128((__m128i *) (Buf2_ptr + counter), A);
}
}

void Copy256(const short *Buf1_ptr, short *Buf2_ptr)
{
for (int counter=0; counter<1024; counter+=16)
{
const __m256i B = _mm256_load_si256((const __m256i *)(Buf1_ptr + counter));
_mm256_store_si256((__m256i *)(Buf2_ptr + counter),B);
}
}

here is the ASMgeneratedby the Intel compiler (32-bit mode):

SSE :

.B1.2: ; Preds .B1.2 .B1.1
;;; {
movdqa xmm0, XMMWORD PTR [ecx+eax*2] ;303.56
movdqa XMMWORD PTR [edx+eax*2], xmm0 ;304.34
add eax, 8 ;301.37
cmp eax, 1024 ;301.31
jl .B1.2 ; Prob 99% ;301.31

AVX-128 :

.B1.2: ; Preds .B1.2 .B1.1
;;; for (int counter=0; counter<1024; counter+=8)
;;; {
mov esi, eax ;303.23
inc eax ;301.3
shl esi, 5 ;303.23
cmp eax, 64 ;301.3
vmovdqu xmm0, XMMWORD PTR [ecx+esi] ;299.6
vmovdqu XMMWORD PTR [edx+esi], xmm0 ;299.6
vmovdqu xmm1, XMMWORD PTR [16+ecx+esi] ;299.6
vmovdqu XMMWORD PTR [16+edx+esi], xmm1 ;299.6
jb .B1.2 ; Prob 98% ;301.3

AVX-256 :
.B2.2: ; Preds .B2.2 .B2.1
mov esi, eax ;313.22
inc eax ;311.2
shl esi, 6 ;313.22
cmp eax, 32 ;311.2
vmovdqu ymm0, YMMWORD PTR [ecx+esi] ;313.22
vmovdqu YMMWORD PTR [edx+esi], ymm0 ;314.4
vmovdqu ymm1, YMMWORD PTR [32+ecx+esi] ;313.22
vmovdqu YMMWORD PTR [32+edx+esi], ymm1 ;314.4
jb .B2.2 ; Prob 96% ;311.2

I got these timings on a Core i7 2600K at stock clock with speedstep and turbo disabled:
runtime for 10,000,000 runs :

SSE : 818 ms
AVX-128 : 654 ms
AVX-256: 420 ms

so AVX-256 is nearly twice faster than SSE in this particular example, I'm pretty surprised since I was thinking the difference will be less due to the L1D cache bandwidth limitation

inteleverywhere · ‎08-09-2011

Thanks for the analysis from your side. I am not getting the same performance as mentioned by you. However there is a drastic improvement in the performance of the AVX code when only the load/store functions areretained in the entire source (Time(Copy128())= Time(Copy256()*0.9 ). This is more realistic. Whenever there is an additional set of operations the performance falls big time. This, I am studying.

Thanks Again.

bronxzv · ‎08-09-2011

>. I am not getting the same performance as mentioned by you.

the timings I get are very repeatable(note that I run the outer loop 10 million times),so I supposeyou're using a different compiler, it will be interesting to see the ASM for your inner loops and to knowthe absolutetimings you get(same array size, 10,000,000 iterations at 3.4 GHz for an easy comparison)

NB: the throughput I measured for the Copy256 case is very near the theoretical peak L1D$write bandwidth of 16B/clock

inteleverywhere · ‎08-09-2011

It is also better to use the file immintrin.h and not avxintrin_emu.h. The latter will reduce the performance on the processor and must be used for emulation. The performance improves when immintrin.h is used since then the actual hardware would be in use.

Thanks.
D

bronxzv · ‎08-11-2011

oughhh if you were using an emulation librarythisis clearlythe explanation of your strange findings, hint:it's always a good ideato watchthe ASM dumps to understand performance issues