Solved: AVX sometimes slower than SSE

Eric_Nuckols · ‎05-18-2011

Has anyone experienced a slow down by a factor of around 2 for certain functions that are converted from SSE to AVX-128?

My setup:

Intel Compiler icc V12.0.0.20101116

Linux Kernel: 2.6.32-71.el6.x86_64

processor Intel Core i7-2600K CPU @ 3.4GHz

Intel Speed Step *DISABLED*

Affinity, locked to 1 core

Memory allocated 32 byte aligned

My compiler flags:

SSE: -m64 -msse3 -axSSE3 -align

AVX: -m64 -xavx -align

I have compiled the following function:

inline void vec_vec_add_overwrite( float *vec1, float *vec2, int n )

{

long ii;

for( ii = 0; ii < n; ii++ )

{

vec1[ii] += vec2[ii];

}

My tests go along as follows:

SetAffinity( core 0 )

overhead = GetClockOverhead(NUMTESTS)

memset( clocks, 0, NUMTESTS *sizeof(clocks) )

n = 5123 /*vector lengths*/

for( i = 0 ; i < NUMTESTS; i++ )

{

vec1 = malloc( aligned 32, n length )

vec2 = malloc( aligned 32, n length )

fill_with_random( vec1 )

fill_with_random( vec2 )

_mm_clflush( vec1 );

_mm_clflush( vec2 );

_mm_fence();

before = ReadTSC() /* uses assembly CPUID call */

vec_vec_add_overwrite( vec1, vec2, n );

clocks = ReadTSC() - before;

}

RemoveOverhead(clocks, NUMTESTS, overhead)

print average(clocks[IGNORED_START_INDEX : END]) /* I THROW OUT A HANDFUL OF BEGINNING RESULTS TO REMOVE INITIAL TRANSIENTS */

The SSE version looks roughly like this (unix style assembly dest on right):

movss vec1, xmm1

addss vec2, xmm1

movss xmm1, vec1

...

.L_aligned:

movaps vec1, xmm1

addps vec1, xmm1

movaps xmm1, vec1 /*this block unrolled twice*/

...

.L_unaligned

movups vec1, xmm1

movups vec2, xmm2 /*this block unrolled twice*/

addps vec2, xmm1

movups xmm1, vec1

.L_finishup:

...

movss vec1, xmm1

addss vec2, xmm1

movss xmm1, vec1

...

ret

The AVX version looks roughly like this (unix style assembly dest on right):

vmovss vec1, xmm1

vaddss vec2, xmm1, xmm2

vmovss xmm2, vec1

...

.L_partially_aligned

...

vmovups vec1[0:3], xmm0

vinsertf128 $1, vec1[4:7], ymm0, ymm1 /* this block unrolled twice*/

vaddps vec2[0:7], ymm1, ymm2

vmovups ymm2, vec1[0:7]

....

.L_finishup:

...

vmovss vec1, xmm1

vaddss vec2, xmm1, xmm2

vmovss xmm2, vec1

...

vzeroupper

ret

TimP · ‎05-19-2011

As Matthias pointed out, the upper limit for store performance of AVX on Sandy Bridge is the same as for SSE. That limit is approached only with nontemporal stores (which aren't applicable to your code), but the AVX compilation doesn't use nontemporal. I've asked for that to change, but it's not likely to change for the foreseeable future.
As you found that your code is unrolled by 2, the maximum length of the scalar remainder loops has increased from 7 to 15. With the Intel compilers, AVX performance requires attention to alignment and making the loop end as well as begin on a cache boundary. Other compilers may produce more efficient remainder loops.
You haven't communicated the alignment you assert to the compiler (e.g. by #pragma vector aligned). The use of unaligned move for aligned data in itself gives you no performance penalty, but you could avoid the vinsertf128 step if you persuaded the compiler to specialize to 32-byte alignment.

View solution in original post

Matthias_Kretz · ‎05-19-2011

Your C code is a little vague. Anyway, questions:

You're benchmarking on a dataset size of 5123 Bytes?
You flush the first cacheline of that array before you start your measurement, why?
How do you really measure the overhead? My experience with the rdtscp call is to rather use a long-running test (>= 1ms). Overhead subtraction always gave funny numbers.

And some answers:

Sandy-Bridge can do two 128-bit loads + one 128-bit store per cycle. Thus, with perfect ILP and unrolling, both loops (SSE and AVX) reach a 128-bit per cycle throughput.
You're doing one FLOP per two moves (load + store) per value. Even if your problem is contained in the L1 cache you're thus limited by the number of stores Sandy-Bridge can do. It could do four times more FLOP with AVX than your code could possibly reach.

TimP · ‎05-19-2011

As Matthias pointed out, the upper limit for store performance of AVX on Sandy Bridge is the same as for SSE. That limit is approached only with nontemporal stores (which aren't applicable to your code), but the AVX compilation doesn't use nontemporal. I've asked for that to change, but it's not likely to change for the foreseeable future.
As you found that your code is unrolled by 2, the maximum length of the scalar remainder loops has increased from 7 to 15. With the Intel compilers, AVX performance requires attention to alignment and making the loop end as well as begin on a cache boundary. Other compilers may produce more efficient remainder loops.
You haven't communicated the alignment you assert to the compiler (e.g. by #pragma vector aligned). The use of unaligned move for aligned data in itself gives you no performance penalty, but you could avoid the vinsertf128 step if you persuaded the compiler to specialize to 32-byte alignment.

Eric_Nuckols · ‎05-19-2011

@Matthias Kretz

Answers/Reasoning

1. That particular number 5123 was the length of the floating point buffers. I chose smaller array lengths for a few reasons:

a. I don't want worry about any kind of pre-emption or OS related noise in my results so I want to get in and out quickly.

b. I have a lot of other functions and variations of each of the functions that I test repeatedly in a test bed and many times I just want to see if quick compiler options have any significance.

2. I flush cache lines because I am trying to get apples to apples comparisons between my functions that are c code/compiler generated assembly, hand coded assembly, MKL api, IPP api, etc. and also I am comparing gcc and icc performance. Basically attempting to eliminate variables in the benchmark.

3. I am using the ideas from Agner Fog's optimization guide and samples for removing overhead and measuring clock cycles. So it consists of something like this:

loop

{

before = ReadTSC()

after = ReadTSC() - before

}

overhead = max of (after buffer)

I used to get funny numbers but that seemed to be corrected by following Mr. Fog's directions and disabling the Speed Step and any other dynamic clocking things in the BIOS.

@Tim:

I noticed that the compiler definitely did not do the same job for AVX of alignment tricks as it did with SSE.

I hand coded an AVX version that had a bit better alignment and was able to get it faster, but never was able to approach the SSE speed. I am obviously still learning.

@All...

In my tests, I am seeing that operations that are heavy on streaming data and light on actually doing math on the data the effort required to jump from SSE to AVX does not bring reasonable returns in value to the table (currently). (i.e. the 2 loads, but 1 store per cycle -- not to mention the burden of devising new alignment tricks )

Are we to expect the Intel pro compilers and MKL/IPP libraries to change quickly in the near future to address better alignment algorithms so that, at the least the auto generated AVX code doesn't drop below the SSE performance, or the compiler is sharp enough (with out #pragma awesomeness enabled) to use SSE where it is as fast or faster?

TimP · ‎05-19-2011

I've already hinted that I expect the use of "#pragma vector aligned" to be necessary to take advantage of 32-byte alignment, along with measures to avoid spending more time in remainder loops when loop count isn't a multiple of 16. I've heard of efforts to improve performance of the remainder loops, but no assurance that it will appear in the "near future."
As you've seen, the compiler drops back to 128-bit memory access when it doesn't know the alignment.
I haven't checked the Sandy Bridge for an effect which is prominent on Westmere, where alignment at odd multiples of 8 bytes is handled much better by gnu compilers (by splitting memory access into 64-bit moves, similar to the way your code is split explicitly by the compler into 128-bit moves). For double precision, this can produce an effect where icc -msse2 is faster than icc -xhost. From all I've heard about architecture presentations, the compiler team has been directed not to look for or optimize for this situation.

Eric_Nuckols · ‎05-20-2011

I have noticed the #pragma vector aligned statement doesn't produce much faster code than without the statement.

The only significant difference that I can see is without the pragma, the compiler uses the aliased xmm* regs and with the pragma, it uses the full ymm* moves.

It never generates separate loops for different alignment cases.

I guess I am just confused about why the performance wouldn't closly match that of SSE when all buffers are aligned and when the SSE is directly translated from movaps to vmovaps, and a vzeroupper is added before the ret.

It doesn't seem to be related to remainder loops, because I have setup the length of the vector to be multiples of 32 bytes, so I'm getting all of my work done in the primary loop. Additionally there is less upfront logic in this version than the SSE because there is only 1 big loop followed by the remainder loop.

There are about 42 bytes worth of instructions in the loop. It's unrolled twice (16 bytes worth of data in, 8 bytes out).

Is there a glaring error in my approach? I know that I am througput limited on the stores, but that limitation should be the same regardless of SSE or AVX. The vzeroupper call is supposed to have no time penalty.

Thanks for the comments and help thus far and for any further responses. If I'm doing everything right, I will stop beating on this and just fall back to the SSE for the time being.

bronxzv · ‎05-20-2011

.L_partially_aligned
...
vmovups vec1[0:3], xmm0
vinsertf128 $1, vec1[4:7], ymm0, ymm1 /* this block unrolled twice*/
vaddps vec2[0:7], ymm1, ymm2
vmovups ymm2, vec1[0:7]
....

if "vec1" isn't 32B aligned (and it looks like it's the case since you do 2 128-bit loads) it should be significantly faster to also split the final store in 2 x 128-bit store

Eric_Nuckols · ‎05-20-2011

yeah, the vinsertf128 was auto compiler generated since the compiler doesn't do those 32byte alignment optimizations without the #pragma vector aligned.

my arrays are allocated on 32 byte boundaries and I've made the lengths multiples of 32bytes to avoid remainder loops.

I have also changed the code to:

.L_aligned:

vmovaps (vec1), ymm0

vaddps (vec2), ymm0, ymm1

vmovaps ymm1, vec1

vmovaps 32(vec1), ymm0

vaddps 32(vec2), ymm0, ymm1

vmovaps ymm1, 32(vec1)

and seen only a slight improvement...

on this particular box, my cycle count for SSE is ~ 2800, and my fastest AVX loop yet is ~3500

bronxzv · ‎05-20-2011

from my experience the best speedup from SSE to AVX-256 for such code is at best 1.5x with 100% L1D hit and something like 1.25x with a datasetfitting in L2, not sure about LLCand obviouslyno speedup at allif you're RAM bandwidth bound

you'll be maybe able to improve itslightlyby grouping the adjacent moves like that? :

vmovaps (vec1), ymm0
vmovaps 32(vec1), ymm2
vaddps (vec2), ymm0, ymm1
vaddps 32(vec2), ymm2, ymm3
vmovaps ymm1, vec1
vmovaps ymm3, 32(vec1)