Solved: Comparing scalar, SSE and AVX basics....poor performances ??

Joaquin_Tarraga · ‎06-06-2012

Hi,

Please, look at these pieces of code, consisting of three versions to

calculate the length of a set of 3-D vectors.

Let's assume, vector v, with components x, y, z.

The length (l) of vector v, is l = sqrt((x*x) + (y*y) + (z*z))

I implemented three versions based on scalar, SSE and AVX instruction, to compute the

length of 90 000 000 vectors. I hope to get much better performance using SSE,

and AVX, but no...., here the results:

=======================================

TEST 0: l = sqrt((x*x) + (y*y) + (z*z))

=======================================

Scalar time: 0.46051

SSE time : 0.18613

AVX time : 0.19043

Speed-up Scalar vs SSE : 2.47

Speed-up Scalar vs AVX : 2.42

I hope a speed-up of 4 when using SSE, a much more with AVX,

but there is no difference between SSE and AVX.

Target architecture:

Intel Xeon CPU E31245 @ 3.30GHz
4 CPU dual-core (but I only use one core)

Command line to compile:

gcc -O3 -std=c99 -mavx main.c -o main -lm

And the code:

Allocating memory for the SSE version:

x = (float*)_mm_malloc(len * sizeof(float), 16);

y =(float*)_mm_malloc(len * sizeof(float), 16);

....

Allocating memory for the AVX version:

x = (float*)_mm_malloc(len * sizeof(float), 32);

y =(float*)_mm_malloc(len * sizeof(float), 32);

....

//----------------------------------------------------------------------------------------------------------------------

void length_scalar(float *x, float *y, float *z, float *l, unsigned int length) {

for (int i = 0; i

l = sqrt((x*x) + (y*y) + (z*z));

}

//----------------------------------------------------------------------------------------------------------------------

void length_sse(float *x, float *y, float *z, float *l, unsigned int length) {

__m128 xmm0, xmm1, xmm2, xmm3;

for (int i = 0; i

xmm0 = _mm_load_ps(&x);

xmm1 = _mm_load_ps(&y);

xmm2 = _mm_load_ps(&z);

xmm3 = _mm_add_ps(_mm_mul_ps(xmm0, xmm0), _mm_mul_ps(xmm1, xmm1));

xmm3 = _mm_add_ps(_mm_mul_ps(xmm2, xmm2), xmm3);

xmm3 = _mm_sqrt_ps(xmm3);

_mm_store_ps(&l, xmm3);

}

//----------------------------------------------------------------------------------------------------------------------

void length_avx(float *x, float *y, float *z, float *l, unsigned int length) {

__m256 ymm0, ymm1, ymm2, ymm3;

for (int i = 0; i

ymm0 = _mm256_load_ps(&x);

ymm1 = _mm256_load_ps(&y);

ymm2 = _mm256_load_ps(&z);

ymm3 = _mm256_add_ps(_mm256_mul_ps(ymm0, ymm0), _mm256_mul_ps(ymm1, ymm1));

ymm3 = _mm256_add_ps(_mm256_mul_ps(ymm2, ymm2), ymm3);

ymm3 = _mm256_sqrt_ps(ymm3);

_mm256_store_ps(&l, ymm3);

}

//----------------------------------------------------------------------------------------------------------------------

Could you, please, give me some hints, suggestions....to explain that?

I think it is due to the 4 instructions to move data (memory /register, i.e., the load and store instructions),

what do you think?

If I ran a example more simple (addition of the 3 components of a vector, for 90 000 000 vectors)

and I got worse results:

=======================================

TEST 1: l = x + y + z

=======================================

Scalar time: 0.61573

SSE time : 0.34304

AVX time : 0.34770

Speed-up Scalar vs SSE : 1.79

Speed-up Scalar vs AVX : 1.77

Any idea?

Thanks a lot

--

Joaqun

bronxzv · ‎06-06-2012

I have tested yoursimple add example and I find a very good 2x speedup(AVX-256 vs AVX-128) for the workloads fitting in the L1D cache, that's a lot better than I was expecting, when the L1D cache is overflowed timings are sometimes worse for AVX-256 than for AVX-128, though, then for big workloads timings are much the same (as expected) since we are mostly RAM bandwidth bound

my timings are as follows:

Core i7 3770K@ 3.5 GHz, enhanced speedstep disabled, turbo off
woking set size: AVX-128 time AVX-256 time
[plain] 128: 22.5 ms 12.9 ms 256: 19.3 ms 11.2 ms 512: 19.4 ms 9.63 ms 1024: 19.3 ms 9.84 ms 2048: 19.2 ms 9.72 ms 4096: 20.8 ms 10.1 ms 8192: 20.1 ms 10.7 ms 16384: 19.7 ms 10.1 ms 32768: 19.5 ms 17.6 ms 65536: 24.5 ms 28.8 ms 131072: 23.6 ms 28.3 ms 262144: 28.4 ms 34 ms 524288: 38.8 ms 40.6 ms 1048576: 39 ms 41.5 ms 2097152: 39.1 ms 41.2 ms 4194304: 41 ms 43.1 ms 8388608: 53.6 ms 50.8 ms 16777216: 94.6 ms 85.9 ms 33554432: 110 ms 109 ms 67108864: 115 ms 113 ms 134217728: 118 ms 113 ms[/plain]
source code:

[cpp]template inline T *AAlloc(size_t size) { return (T *)_aligned_malloc(sizeof(T)*size,32); } inline void AFree(void *p) { if (p) _aligned_free(p); } void AddTestAVX128(const float *x, const float *y, const float *z, float *l, unsigned int length) { for (unsigned int i=0; i(chunkSize), *y = AAlloc(chunkSize), *z = AAlloc(chunkSize), *l = AAlloc(chunkSize); for (int j=0; j = j*1.0; y = j*2.0; z = j*3.0;} Chrono chrono(""); const float start = chrono.getTime(); for (int i=0; i
// main call:
for (int chunkSize=8; chunkSize<10000000; chunkSize<<=1) JTTest(chunkSize);[/cpp]

ASM dumps :

[plain].B51.3:: ; Preds .B51.3 .B51.2 ;;; { ;;; const __m128 px = _mm_load_ps(x+i), py = _mm_load_ps(y+i), pz = _mm_load_ps(z+i); vmovups xmm0, XMMWORD PTR [rcx+r10*4] ;478.35 add eax, 4 ;476.36 ;;; _mm_store_ps(l+i,_mm_add_ps(_mm_add_ps(px,py),pz)); vaddps xmm1, xmm0, XMMWORD PTR [rdx+r10*4] ;479.33 vaddps xmm2, xmm1, XMMWORD PTR [r8+r10*4] ;479.22 vmovups XMMWORD PTR [r9+r10*4], xmm2 ;479.18 mov r10d, eax ;476.36 cmp eax, r11d ;476.28 jb .B51.3 ; Prob 82% ;476.28[/plain]

[plain] .B52.3:: ; Preds .B52.3 .B52.2 ;;; { ;;; const __m256 px = _mm256_load_ps(x+i), py = _mm256_load_ps(y+i), pz = _mm256_load_ps(z+i); vmovups ymm0, YMMWORD PTR [rcx+r10*4] ;487.38 add eax, 8 ;485.36 ;;; _mm256_store_ps(l+i,_mm256_add_ps(_mm256_add_ps(px,py),pz)); vaddps ymm1, ymm0, YMMWORD PTR [rdx+r10*4] ;488.39 vaddps ymm2, ymm1, YMMWORD PTR [r8+r10*4] ;488.25 vmovups YMMWORD PTR [r9+r10*4], ymm2 ;488.21 mov r10d, eax ;485.36 cmp eax, r11d ;485.28 jb .B52.3 ; Prob 82% ;485.28 [/plain]

View solution in original post

bronxzv · ‎06-06-2012

if you do that with 90M vectors your application is more a memory bandwidth testthan anything else I'll say

what I'llsuggest you to do totest the speed of the kernel in isolation:
- work with a small working set that fit in the L1D cache, for example 1000 vectors only
- repeat the test a lof of time with an outer loop (for example executed1 million times) to have accurate timings

then the speedupvs the scalar version shouldbe better (provided that the scalar version is really scalar, i.e. not vectorized by the compiler),forAVX vs SSE the botlleneckis clearly the(high latency / low throughput) sqrt whichhas the same throughput on Sandy Bridge with SSE and AVX-256, Ivy Bridge will enjoy a better speedup heresince the throughput is doubled for AVX-256

if youtry a fast sqrt approximation such as _mm256_mul_ps(m,_mm256_rsqrt_ps(m)) you should see a betterAVX-256vs SSE speedupon Sandy Bridge

now this code is heavily load port bound apparently so I'll not expect better than 1.3x speedup from AVX-256 vs. SSE even with a workload that fit at 100% in theL1D cache

just one more thing, you can also help your compiler a bit by using the const keyword when it applies such as :

void length_sse(const float *x, const float *y, const float *z, float *l, unsigned int length)

EDIT : I see that you use in fact AVX-128 (!) so the first thing to do is to switch to AVX-256 to hope for anyspeedup, i.e. use _mm256_load_ps, _mm256_mul_ps etc.

TimP · ‎06-06-2012

As the AVX-256 sqrt is sequenced as 2 128-bit sqrt instructions, you could expect little difference in performance between SSE and AVX-256. Given that the SSE parallel sqrt is reasonably efficient, it may seem a backward step to use the iterative method in an attempt to improve throughput on AVX-256.

bronxzv · ‎06-06-2012

In practice on Sandy Bridge a 2nd order Newton-Raphson is clearlyfaster for AVX-256 (1 rsqrt + 5 mul + 1 sub instead of sqrt with its 28 clock rcp throughput), so something to consider using if it's in a hotspot

it's even more important when you normalize vectors (a very common use case for 3D applications) to useNewton-Raphson since in this case it's 1 rsqrt + 4 mul + 1 sub instead of 1 sqrt + 1 div (i.e. a chain of very high latency / low throughput instructions)

TimP · ‎06-06-2012

OP didn't even tell us whether gcc is generating AVX-256 instructions, and we can't see his source code. I'm skeptical whether the _mm_malloc(...,32) would be sufficient to push gcc into that mode. Even if it did so, OP seems to be ignoring the architectural requirement for the splitting of AVX-256 memory accesses as well as sqrt.

bronxzv · ‎06-06-2012

note that he has now updated his code with an AVX-256 path, the previous version was with 2x the SSE path probably due to a copy&paste error, isn't it Joaquin?

bronxzv · ‎06-06-2012

one more thing Joaquin:

the way you name your variables is very confusing, the code generator automatically choose the register it wants to use so it may well use ymm6 in 32-bit and ymm13 in 64-bit mode for what you call "ymm0" in your code, it may be confusing if you have to do low level debugging and it's not very readable for the maintener of this code

I'll advise to use another notation such as :

const __m256 px = _mm256_load_ps(x+i);

"px" for "packed x", other ideas include"ox" for "octo x", etc. your code will then look like my example below, for any complex project I'll advise to use at least operator overloading if your vectorizer isn't up to the task

[cpp]inline __m256 Sqr(const __m256 &px) {return _mm256_mul_ps(px,px);} void length_avx(const float *x, const float *y, const float *z, float *l, unsigned int length) { for (unsigned int i=0; i

bronxzv · ‎06-06-2012

which splitting of memory accesses are you refering to ? it looks alright to me like this for 32B aligned arrays

Joaquin_Tarraga · ‎06-06-2012

Thanks for your comments and suggestions, and for the included code (bronxzv)

If I work by chunks of 1024 floats, i.e., now I have two loops, the outer one is: for (int i = 0; i < 90000000; i += 1024),

I get a speed-up of 4 (aprox.) for, both, SSE and AVX (see results below).

TimP (Intel) commented: "As the AVX-256 sqrt is sequenced as 2 128-bit sqrt instructions, you could expect little difference in performance between SSE and AVX-256". It couldexplain why there's no difference between SSE and AVX in test0 (sqrt), but in test1 (where only three additions are performed) the speed-up is the same for AVX and SSE !

By executing "objdump -S ", I checked the assembly instructions are AVX, vaddps, vmovaps,vmulps, vsqrtps..., for both the SSE and the AVX functions, the difference cames from the inner loop 'step/stride' value, 4 for SSE and 8 for AVX.

=======================================

TEST 0: l = sqrt((x*x) + (y*y) + (z*z))

=======================================

Seq time: 3.681436e-01

SSE time: 9.068346e-02

AVX time: 9.062290e-02

Speed-up Seq vs SSE : 4.06

Speed-up Seq vs AVX : 4.06

=======================================

TEST 1: l = x + y + z

=======================================

Seq time: 3.898194e-01

SSE time: 1.120391e-01

AVX time: 1.076577e-01

Speed-up Seq vs SSE : 3.48

Speed-up Seq vs AVX : 3.62

Joaquin_Tarraga · ‎06-06-2012

it's true, sorry !

bronxzv · ‎06-06-2012

If I work by chunks of 1024 floats, i.e., now I have two loops, the outer one is: for (int i = 0; i < 90000000; i += 1024),

it's not perfectly clear reading this that you work with a L1 cache-blocked buffer (i.e. that you do the same computations a lot of times redundantly to measure the timings withhigh L1D hit %), I'll advise to post full source code

Joaquin_Tarraga · ‎06-06-2012

sorry, see next comment!

Joaquin_Tarraga · ‎06-06-2012

Exactly.

What I want to know is why there's no difference between SSE and AVX for two simple functions, the first calculates sqrt((x*x) + (y*y) + (z*z)), and the second calculates x + y + z, in both cases, I get the same speed-up for SSE and AVX, when AVX speed-up should be two times SSE speed-up, right?

Playing with the 'chunk' value (see code below), when chunk > 8 * 1024, the speed-up decreases from around 4.19 to around 2.33

len = 90000000;

chunk = 1024;

// AVX

x = (float*)_mm_malloc(chunk * sizeof(float), 32);

y = (float*)_mm_malloc(chunk * sizeof(float), 32);

z = (float*)_mm_malloc(chunk * sizeof(float), 32);

l = (float*)_mm_malloc(chunk * sizeof(float), 32);

for(int j = 0; j < chunk; j++) {

x = j*1.0;

y = j*2.0;

z = j*3.0;

}

partial_t = tic();

for (int i = 0; i < len; i += chunk) {

add_avx1(x, y, z, l, chunk);

}

avx_t += toc(partial_t);

_mm_free(x); _mm_free(y); _mm_free(z); _mm_free(l);

Thanks again for your comments

bronxzv · ‎06-06-2012

for big chunks you are clearly memory bandwidth bound, it's normal thatyou see drastic changes when you overflow the L1D cache (chunk > ~ 2000), then the L2 cache (chunk> ~16000) and eventually the LLC (not sure about your Xeon LLC capacity)

now, for chunk = 1000 you should see better speedups from AVX-256 vs SSE for the cases not using VSQRTPS, I'll expect something like 1.3x speedup

to ensure good timings I'll advise to disable enhanced speedstep and the turbo mode

now that we have agreed on the test procedure I suggest to post an ASM dump of the code of the two inner loops you are comparing, the SSE and AVX-256 version of the simple 3 x add case

Joaquin_Tarraga · ‎06-06-2012

That's right, bronxzv. But why I can't get the ideal speed-up of 8 (or close) for AVX, the maximum, I had, was 3.97, too far, even when I run with little chunks (such as 32, 64, 128, 256, 512)? Using SSE, I'm close to the ideal speed-up of 4 with chunks of 512.

Joaquin_Tarraga · ‎06-06-2012

Here you are (sorry but I don't know how to put it in a fancy format like you do):

00000000004020a0 :

4020a0: 45 85 c0 test %r8d,%r8d

4020a3: 74 2c je 4020d1

4020a5: 31 c0 xor %eax,%eax

4020a7: 45 31 c9 xor %r9d,%r9d

4020aa: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)

4020b0: c5 f8 28 04 07 vmovaps (%rdi,%rax,1),%xmm0

4020b5: 41 83 c1 04 add $0x4,%r9d

4020b9: c5 f8 58 04 06 vaddps (%rsi,%rax,1),%xmm0,%xmm0

4020be: c5 f8 58 04 02 vaddps (%rdx,%rax,1),%xmm0,%xmm0

4020c3: c5 f8 29 04 01 vmovaps %xmm0,(%rcx,%rax,1)

4020c8: 48 83 c0 10 add $0x10,%rax

4020cc: 45 39 c8 cmp %r9d,%r8d

4020cf: 77 df ja 4020b0

4020d1: f3 c3 repz retq

4020d3: 66 66 66 66 2e 0f 1f data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)

4020da: 84 00 00 00 00 00

00000000004020e0 :

4020e0: 55 push %rbp

4020e1: 48 89 e5 mov %rsp,%rbp

4020e4: 48 83 e4 e0 and $0xffffffffffffffe0,%rsp

4020e8: 48 83 c4 10 add $0x10,%rsp

4020ec: 45 85 c0 test %r8d,%r8d

4020ef: 74 30 je 402121

4020f1: 31 c0 xor %eax,%eax

4020f3: 45 31 c9 xor %r9d,%r9d

4020f6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)

4020fd: 00 00 00

402100: c5 fc 28 04 07 vmovaps (%rdi,%rax,1),%ymm0

402105: 41 83 c1 08 add $0x8,%r9d

402109: c5 fc 58 04 06 vaddps (%rsi,%rax,1),%ymm0,%ymm0

40210e: c5 fc 58 04 02 vaddps (%rdx,%rax,1),%ymm0,%ymm0

402113: c5 fc 29 04 01 vmovaps %ymm0,(%rcx,%rax,1)

402118: 48 83 c0 20 add $0x20,%rax

40211c: 45 39 c8 cmp %r9d,%r8d

40211f: 77 df ja 402100

402121: c9 leaveq

402122: c5 f8 77 vzeroupper

402125: c3 retq

402126: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)

40212d: 00 00 00

bronxzv · ‎06-06-2012

a big limiter is the fact that the 2 load ports and the store port are only 128-bit wide, it's a perfetct match for SSE and AVX-128 but usually a bottleneck for AVX-256

bronxzv · ‎06-06-2012

I note that you are effectively not comparing SSE with AVX but AVX-128 with AVX-256.

I don't see obvious problems with your code but since youhave only 2 fast computation instructionsfor 3 loads+ 1 store I'm afraid the load/store bottleneck I was refering to is the cause of the deceptive speedup

out of curiosity I'll study the timings on my own

if you want more convincing speedup from AVX-256 you can try to do more work on registers within the same loop, for example compute a bounding box for your vectors, you'll have typically instructions like

vminps ymm0,ymm6
vmaxps ymm0,ymm7

with ymm0 already loaded for the other computation and no more load/store

in my case it's the use case with the best observed speedup, more than 1.8 x for AVX-256 vs SSE

btw to add code snippets to this forum without going crazy simply write it in your favorite text editor then click on the icon with the orange pen then paste it in the edit box and specifythe syntax, for ex. C++

bronxzv · ‎06-06-2012

I have tested yoursimple add example and I find a very good 2x speedup(AVX-256 vs AVX-128) for the workloads fitting in the L1D cache, that's a lot better than I was expecting, when the L1D cache is overflowed timings are sometimes worse for AVX-256 than for AVX-128, though, then for big workloads timings are much the same (as expected) since we are mostly RAM bandwidth bound

my timings are as follows:

Core i7 3770K@ 3.5 GHz, enhanced speedstep disabled, turbo off
woking set size: AVX-128 time AVX-256 time
[plain] 128: 22.5 ms 12.9 ms 256: 19.3 ms 11.2 ms 512: 19.4 ms 9.63 ms 1024: 19.3 ms 9.84 ms 2048: 19.2 ms 9.72 ms 4096: 20.8 ms 10.1 ms 8192: 20.1 ms 10.7 ms 16384: 19.7 ms 10.1 ms 32768: 19.5 ms 17.6 ms 65536: 24.5 ms 28.8 ms 131072: 23.6 ms 28.3 ms 262144: 28.4 ms 34 ms 524288: 38.8 ms 40.6 ms 1048576: 39 ms 41.5 ms 2097152: 39.1 ms 41.2 ms 4194304: 41 ms 43.1 ms 8388608: 53.6 ms 50.8 ms 16777216: 94.6 ms 85.9 ms 33554432: 110 ms 109 ms 67108864: 115 ms 113 ms 134217728: 118 ms 113 ms[/plain]
source code:

[cpp]template inline T *AAlloc(size_t size) { return (T *)_aligned_malloc(sizeof(T)*size,32); } inline void AFree(void *p) { if (p) _aligned_free(p); } void AddTestAVX128(const float *x, const float *y, const float *z, float *l, unsigned int length) { for (unsigned int i=0; i(chunkSize), *y = AAlloc(chunkSize), *z = AAlloc(chunkSize), *l = AAlloc(chunkSize); for (int j=0; j = j*1.0; y = j*2.0; z = j*3.0;} Chrono chrono(""); const float start = chrono.getTime(); for (int i=0; i
// main call:
for (int chunkSize=8; chunkSize<10000000; chunkSize<<=1) JTTest(chunkSize);[/cpp]

ASM dumps :

[plain].B51.3:: ; Preds .B51.3 .B51.2 ;;; { ;;; const __m128 px = _mm_load_ps(x+i), py = _mm_load_ps(y+i), pz = _mm_load_ps(z+i); vmovups xmm0, XMMWORD PTR [rcx+r10*4] ;478.35 add eax, 4 ;476.36 ;;; _mm_store_ps(l+i,_mm_add_ps(_mm_add_ps(px,py),pz)); vaddps xmm1, xmm0, XMMWORD PTR [rdx+r10*4] ;479.33 vaddps xmm2, xmm1, XMMWORD PTR [r8+r10*4] ;479.22 vmovups XMMWORD PTR [r9+r10*4], xmm2 ;479.18 mov r10d, eax ;476.36 cmp eax, r11d ;476.28 jb .B51.3 ; Prob 82% ;476.28[/plain]

[plain] .B52.3:: ; Preds .B52.3 .B52.2 ;;; { ;;; const __m256 px = _mm256_load_ps(x+i), py = _mm256_load_ps(y+i), pz = _mm256_load_ps(z+i); vmovups ymm0, YMMWORD PTR [rcx+r10*4] ;487.38 add eax, 8 ;485.36 ;;; _mm256_store_ps(l+i,_mm256_add_ps(_mm256_add_ps(px,py),pz)); vaddps ymm1, ymm0, YMMWORD PTR [rdx+r10*4] ;488.39 vaddps ymm2, ymm1, YMMWORD PTR [r8+r10*4] ;488.25 vmovups YMMWORD PTR [r9+r10*4], ymm2 ;488.21 mov r10d, eax ;485.36 cmp eax, r11d ;485.28 jb .B52.3 ; Prob 82% ;485.28 [/plain]

Joaquin_Tarraga · ‎06-07-2012

I think it's the best speed-up we can get.

Thanks a lot, bronxzv, for all your time and posts,

bronxzv · ‎06-07-2012

nothing, actually I learned something from this experiment: loads/stores are far less a bottleneck than I was expecting and AVX-256 can be actually slower than AVX-128 for some working set sizes!

sinceyou basically test the same example you should be able to see the save nice speedup for small chunks (chunk size =~ 1000, working set =~ 16 KB), i.e. nearly a 8x speedup vs. your scalar path

all of this show very well, one more time,how important it is to use cache blocking techniques whenever it's possible