- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

Please, look at these pieces of code, consisting of three versions to

calculate the length of a set of 3-D vectors.

Let's assume, vector v, with components x, y, z.

The length (l) of vector v, is l = sqrt((x*x) + (y*y) + (z*z))

I implemented three versions based on scalar, SSE and AVX instruction, to compute the

length of 90 000 000 vectors. I hope to get much better performance using SSE,

and AVX, but no...., here the results:

=======================================

TEST 0: l = sqrt((x*x) + (y*y) + (z*z))

=======================================

Scalar time: 0.46051

SSE time : 0.18613

AVX time : 0.19043

Speed-up Scalar vs SSE : 2.47

Speed-up Scalar vs AVX : 2.42

I hope a speed-up of 4 when using SSE, a much more with AVX,

but there is no difference between SSE and AVX.

Target architecture:

- Intel Xeon CPU E31245 @ 3.30GHz
- 4 CPU dual-core (but I only use one core)

Command line to compile:

gcc -O3 -std=c99 -mavx main.c -o main -lm

And the code:

Allocating memory for the SSE version:

x = (float*)_mm_malloc(len * sizeof(float), 16);

y =(float*)_mm_malloc(len * sizeof(float), 16);

....

Allocating memory for the AVX version:

x = (float*)_mm_malloc(len * sizeof(float), 32);

y =(float*)_mm_malloc(len * sizeof(float), 32);

....

//----------------------------------------------------------------------------------------------------------------------

void length_scalar(float *x, float *y, float *z, float *l, unsigned int length) {

for (int i = 0; i

l

*= sqrt((x***x**) + (y***y**) + (z***z**));* }

}

//----------------------------------------------------------------------------------------------------------------------

void length_sse(float *x, float *y, float *z, float *l, unsigned int length) {

__m128 xmm0, xmm1, xmm2, xmm3;

for (int i = 0; i //----------------------------------------------------------------------------------------------------------------------

xmm0 = _mm_load_ps(&x

*);* xmm1 = _mm_load_ps(&y

*);* xmm2 = _mm_load_ps(&z

*);* xmm3 = _mm_add_ps(_mm_mul_ps(xmm0, xmm0), _mm_mul_ps(xmm1, xmm1));

xmm3 = _mm_add_ps(_mm_mul_ps(xmm2, xmm2), xmm3);

xmm3 = _mm_sqrt_ps(xmm3);

_mm_store_ps(&l

*, xmm3);* }

}

//----------------------------------------------------------------------------------------------------------------------

void length_avx(float *x, float *y, float *z, float *l, unsigned int length) {

__m256 ymm0, ymm1, ymm2, ymm3;

for (int i = 0; i

ymm0 = _mm256_load_ps(&x

*);* ymm1 = _mm256_load_ps(&y

*);* ymm2 = _mm256_load_ps(&z

*);* ymm3 = _mm256_add_ps(_mm256_mul_ps(ymm0, ymm0), _mm256_mul_ps(ymm1, ymm1));

ymm3 = _mm256_add_ps(_mm256_mul_ps(ymm2, ymm2), ymm3);

ymm3 = _mm256_sqrt_ps(ymm3);

_mm256_store_ps(&l

*, ymm3);* }

Could you, please, give me some hints, suggestions....to explain that?

I think it is due to the 4 instructions to move data (memory /register, i.e., the load and store instructions),

what do you think?

If I ran a example more simple (addition of the 3 components of a vector, for 90 000 000 vectors)

and I got worse results:

=======================================

TEST 1: l = x + y + z

=======================================

Scalar time: 0.61573

SSE time : 0.34304

AVX time : 0.34770

Speed-up Scalar vs SSE : 1.79

Speed-up Scalar vs AVX : 1.77

Any idea?

Thanks a lot

--

Joaqun

1 Solution

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

my timings are as follows:

Core i7 3770K@ 3.5 GHz, enhanced speedstep disabled, turbo off

woking set size: AVX-128 time AVX-256 time

[plain] 128: 22.5 ms 12.9 ms 256: 19.3 ms 11.2 ms 512: 19.4 ms 9.63 ms 1024: 19.3 ms 9.84 ms 2048: 19.2 ms 9.72 ms 4096: 20.8 ms 10.1 ms 8192: 20.1 ms 10.7 ms 16384: 19.7 ms 10.1 ms 32768: 19.5 ms 17.6 ms 65536: 24.5 ms 28.8 ms 131072: 23.6 ms 28.3 ms 262144: 28.4 ms 34 ms 524288: 38.8 ms 40.6 ms 1048576: 39 ms 41.5 ms 2097152: 39.1 ms 41.2 ms 4194304: 41 ms 43.1 ms 8388608: 53.6 ms 50.8 ms 16777216: 94.6 ms 85.9 ms 33554432: 110 ms 109 ms 67108864: 115 ms 113 ms 134217728: 118 ms 113 ms[/plain]

source code:

[cpp]template

// main call:

for (int chunkSize=8; chunkSize<10000000; chunkSize<<=1) JTTest(chunkSize);[/cpp]

ASM dumps :

[plain].B51.3:: ; Preds .B51.3 .B51.2 ;;; { ;;; const __m128 px = _mm_load_ps(x+i), py = _mm_load_ps(y+i), pz = _mm_load_ps(z+i); vmovups xmm0, XMMWORD PTR [rcx+r10*4] ;478.35 add eax, 4 ;476.36 ;;; _mm_store_ps(l+i,_mm_add_ps(_mm_add_ps(px,py),pz)); vaddps xmm1, xmm0, XMMWORD PTR [rdx+r10*4] ;479.33 vaddps xmm2, xmm1, XMMWORD PTR [r8+r10*4] ;479.22 vmovups XMMWORD PTR [r9+r10*4], xmm2 ;479.18 mov r10d, eax ;476.36 cmp eax, r11d ;476.28 jb .B51.3 ; Prob 82% ;476.28[/plain]

[plain] .B52.3:: ; Preds .B52.3 .B52.2 ;;; { ;;; const __m256 px = _mm256_load_ps(x+i), py = _mm256_load_ps(y+i), pz = _mm256_load_ps(z+i); vmovups ymm0, YMMWORD PTR [rcx+r10*4] ;487.38 add eax, 8 ;485.36 ;;; _mm256_store_ps(l+i,_mm256_add_ps(_mm256_add_ps(px,py),pz)); vaddps ymm1, ymm0, YMMWORD PTR [rdx+r10*4] ;488.39 vaddps ymm2, ymm1, YMMWORD PTR [r8+r10*4] ;488.25 vmovups YMMWORD PTR [r9+r10*4], ymm2 ;488.21 mov r10d, eax ;485.36 cmp eax, r11d ;485.28 jb .B52.3 ; Prob 82% ;485.28 [/plain]

Link Copied

21 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

what I'llsuggest you to do totest the speed of the kernel in isolation:

- work with a small working set that fit in the L1D cache, for example 1000 vectors only

- repeat the test a lof of time with an outer loop (for example executed1 million times) to have accurate timings

then the speedupvs the scalar version shouldbe better (provided that the scalar version is really scalar, i.e. not vectorized by the compiler),forAVX vs SSE the botlleneckis clearly the(high latency / low throughput) sqrt whichhas the same throughput on Sandy Bridge with SSE and AVX-256, Ivy Bridge will enjoy a better speedup heresince the throughput is doubled for AVX-256

if youtry a fast sqrt approximation such as _mm256_mul_ps(m,_mm256_rsqrt_ps(m)) you should see a betterAVX-256vs SSE speedupon Sandy Bridge

now this code is heavily load port bound apparently so I'll not expect better than 1.3x speedup from AVX-256 vs. SSE even with a workload that fit at 100% in theL1D cache

just one more thing, you can also help your compiler a bit by using the const keyword when it applies such as :

void length_sse(const float *x, const float *y, const float *z, float *l, unsigned int length)

EDIT : I see that you use in fact AVX-128 (!) so the first thing to do is to switch to AVX-256 to hope for anyspeedup, i.e. use _mm256_load_ps, _mm256_mul_ps etc.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

it's even more important when you normalize vectors (a very common use case for 3D applications) to useNewton-Raphson since in this case it's 1 rsqrt + 4 mul + 1 sub instead of 1 sqrt + 1 div (i.e. a chain of very high latency / low throughput instructions)

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

the way you name your variables is very confusing, the code generator automatically choose the register it wants to use so it may well use ymm6 in 32-bit and ymm13 in 64-bit mode for what you call "ymm0" in your code, it may be confusing if you have to do low level debugging and it's not very readable for the maintener of this code

I'll advise to use another notation such as :

const __m256 px = _mm256_load_ps(x+i);

"px" for "packed x", other ideas include"ox" for "octo x", etc. your code will then look like my example below, for any complex project I'll advise to use at least operator overloading if your vectorizer isn't up to the task

[cpp]inline __m256 Sqr(const __m256 &px) {return _mm256_mul_ps(px,px);} void length_avx(const float *x, const float *y, const float *z, float *l, unsigned int length) { for (unsigned int i=0; i

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for your comments and suggestions, and for the included code (bronxzv)

If I work by chunks of 1024 floats, i.e., now I have two loops, the outer one is: for (int i = 0; i < 90000000; i += 1024),

I get a speed-up of 4 (aprox.) for, both, SSE and AVX (see results below).

TimP (Intel) commented: "As the AVX-256 sqrt is sequenced as 2 128-bit sqrt instructions, you could expect little difference in performance between SSE and AVX-256". It couldexplain why there's no difference between SSE and AVX in test0 (sqrt), but in test1 (where only three additions are performed) the speed-up is the same for AVX and SSE !

By executing "objdump -S ", I checked the assembly instructions are AVX, vaddps, vmovaps,vmulps, vsqrtps..., for both the SSE and the AVX functions, the difference cames from the inner loop 'step/stride' value, 4 for SSE and 8 for AVX.

=======================================

TEST 0: l = sqrt((x*x) + (y*y) + (z*z))

=======================================

Seq time: 3.681436e-01

SSE time: 9.068346e-02

AVX time: 9.062290e-02

Speed-up Seq vs SSE : 4.06

Speed-up Seq vs AVX : 4.06

=======================================

TEST 1: l = x + y + z

=======================================

Seq time: 3.898194e-01

SSE time: 1.120391e-01

AVX time: 1.076577e-01

Speed-up Seq vs SSE : 3.48

Speed-up Seq vs AVX : 3.62

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

it's true, sorry !

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

it's not perfectly clear reading this that you work with a L1 cache-blocked buffer (i.e. that you do the same computations a lot of times redundantly to measure the timings withhigh L1D hit %), I'll advise to post full source codeIf I work by chunks of 1024 floats, i.e., now I have two loops, the outer one is: for (int i = 0; i < 90000000; i += 1024),

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

sorry, see next comment!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Exactly.

What I want to know is why there's no difference between SSE and AVX for two simple functions, the first calculates sqrt((x*x) + (y*y) + (z*z)), and the second calculates x + y + z, in both cases, I get the same speed-up for SSE and AVX, when AVX speed-up should be two times SSE speed-up, right?

Playing with the 'chunk' value (see code below), when chunk > 8 * 1024, the speed-up decreases from around 4.19 to around 2.33

len = 90000000;

chunk = 1024;

// AVX

x = (float*)_mm_malloc(chunk * sizeof(float), 32);

y = (float*)_mm_malloc(chunk * sizeof(float), 32);

z = (float*)_mm_malloc(chunk * sizeof(float), 32);

l = (float*)_mm_malloc(chunk * sizeof(float), 32);

for(int j = 0; j < chunk; j++) {

x = j*1.0;

y = j*2.0;

z = j*3.0;

}

partial_t = tic();

for (int i = 0; i < len; i += chunk) {

add_avx1(x, y, z, l, chunk);

}

avx_t += toc(partial_t);

_mm_free(x); _mm_free(y); _mm_free(z); _mm_free(l);

Thanks again for your comments

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

now, for chunk = 1000 you should see better speedups from AVX-256 vs SSE for the cases not using VSQRTPS, I'll expect something like 1.3x speedup

to ensure good timings I'll advise to disable enhanced speedstep and the turbo mode

now that we have agreed on the test procedure I suggest to post an ASM dump of the code of the two inner loops you are comparing, the SSE and AVX-256 version of the simple 3 x add case

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Here you are (sorry but I don't know how to put it in a fancy format like you do):

00000000004020a0 :

4020a0: 45 85 c0 test %r8d,%r8d

4020a3: 74 2c je 4020d1

4020a5: 31 c0 xor %eax,%eax

4020a7: 45 31 c9 xor %r9d,%r9d

4020aa: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)

4020b0: c5 f8 28 04 07 vmovaps (%rdi,%rax,1),%xmm0

4020b5: 41 83 c1 04 add $0x4,%r9d

4020b9: c5 f8 58 04 06 vaddps (%rsi,%rax,1),%xmm0,%xmm0

4020be: c5 f8 58 04 02 vaddps (%rdx,%rax,1),%xmm0,%xmm0

4020c3: c5 f8 29 04 01 vmovaps %xmm0,(%rcx,%rax,1)

4020c8: 48 83 c0 10 add $0x10,%rax

4020cc: 45 39 c8 cmp %r9d,%r8d

4020cf: 77 df ja 4020b0

4020d1: f3 c3 repz retq

4020d3: 66 66 66 66 2e 0f 1f data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)

4020da: 84 00 00 00 00 00

00000000004020e0 :

4020e0: 55 push %rbp

4020e1: 48 89 e5 mov %rsp,%rbp

4020e4: 48 83 e4 e0 and $0xffffffffffffffe0,%rsp

4020e8: 48 83 c4 10 add $0x10,%rsp

4020ec: 45 85 c0 test %r8d,%r8d

4020ef: 74 30 je 402121

4020f1: 31 c0 xor %eax,%eax

4020f3: 45 31 c9 xor %r9d,%r9d

4020f6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)

4020fd: 00 00 00

402100: c5 fc 28 04 07 vmovaps (%rdi,%rax,1),%ymm0

402105: 41 83 c1 08 add $0x8,%r9d

402109: c5 fc 58 04 06 vaddps (%rsi,%rax,1),%ymm0,%ymm0

40210e: c5 fc 58 04 02 vaddps (%rdx,%rax,1),%ymm0,%ymm0

402113: c5 fc 29 04 01 vmovaps %ymm0,(%rcx,%rax,1)

402118: 48 83 c0 20 add $0x20,%rax

40211c: 45 39 c8 cmp %r9d,%r8d

40211f: 77 df ja 402100

402121: c9 leaveq

402122: c5 f8 77 vzeroupper

402125: c3 retq

402126: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)

40212d: 00 00 00

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I don't see obvious problems with your code but since youhave only 2 fast computation instructionsfor 3 loads+ 1 store I'm afraid the load/store bottleneck I was refering to is the cause of the deceptive speedup

out of curiosity I'll study the timings on my own

if you want more convincing speedup from AVX-256 you can try to do more work on registers within the same loop, for example compute a bounding box for your vectors, you'll have typically instructions like

vminps ymm0,ymm6

vmaxps ymm0,ymm7

with ymm0 already loaded for the other computation and no more load/store

in my case it's the use case with the best observed speedup, more than 1.8 x for AVX-256 vs SSE

btw to add code snippets to this forum without going crazy simply write it in your favorite text editor then click on the icon with the orange pen then paste it in the edit box and specifythe syntax, for ex. C++

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

my timings are as follows:

Core i7 3770K@ 3.5 GHz, enhanced speedstep disabled, turbo off

woking set size: AVX-128 time AVX-256 time

[plain] 128: 22.5 ms 12.9 ms 256: 19.3 ms 11.2 ms 512: 19.4 ms 9.63 ms 1024: 19.3 ms 9.84 ms 2048: 19.2 ms 9.72 ms 4096: 20.8 ms 10.1 ms 8192: 20.1 ms 10.7 ms 16384: 19.7 ms 10.1 ms 32768: 19.5 ms 17.6 ms 65536: 24.5 ms 28.8 ms 131072: 23.6 ms 28.3 ms 262144: 28.4 ms 34 ms 524288: 38.8 ms 40.6 ms 1048576: 39 ms 41.5 ms 2097152: 39.1 ms 41.2 ms 4194304: 41 ms 43.1 ms 8388608: 53.6 ms 50.8 ms 16777216: 94.6 ms 85.9 ms 33554432: 110 ms 109 ms 67108864: 115 ms 113 ms 134217728: 118 ms 113 ms[/plain]

source code:

[cpp]template

// main call:

for (int chunkSize=8; chunkSize<10000000; chunkSize<<=1) JTTest(chunkSize);[/cpp]

ASM dumps :

[plain].B51.3:: ; Preds .B51.3 .B51.2 ;;; { ;;; const __m128 px = _mm_load_ps(x+i), py = _mm_load_ps(y+i), pz = _mm_load_ps(z+i); vmovups xmm0, XMMWORD PTR [rcx+r10*4] ;478.35 add eax, 4 ;476.36 ;;; _mm_store_ps(l+i,_mm_add_ps(_mm_add_ps(px,py),pz)); vaddps xmm1, xmm0, XMMWORD PTR [rdx+r10*4] ;479.33 vaddps xmm2, xmm1, XMMWORD PTR [r8+r10*4] ;479.22 vmovups XMMWORD PTR [r9+r10*4], xmm2 ;479.18 mov r10d, eax ;476.36 cmp eax, r11d ;476.28 jb .B51.3 ; Prob 82% ;476.28[/plain]

[plain] .B52.3:: ; Preds .B52.3 .B52.2 ;;; { ;;; const __m256 px = _mm256_load_ps(x+i), py = _mm256_load_ps(y+i), pz = _mm256_load_ps(z+i); vmovups ymm0, YMMWORD PTR [rcx+r10*4] ;487.38 add eax, 8 ;485.36 ;;; _mm256_store_ps(l+i,_mm256_add_ps(_mm256_add_ps(px,py),pz)); vaddps ymm1, ymm0, YMMWORD PTR [rdx+r10*4] ;488.39 vaddps ymm2, ymm1, YMMWORD PTR [r8+r10*4] ;488.25 vmovups YMMWORD PTR [r9+r10*4], ymm2 ;488.21 mov r10d, eax ;485.36 cmp eax, r11d ;485.28 jb .B52.3 ; Prob 82% ;485.28 [/plain]

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I think it's the best speed-up we can get.

Thanks a lot, bronxzv, for all your time and posts,

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

sinceyou basically test the same example you should be able to see the save nice speedup for small chunks (chunk size =~ 1000, working set =~ 16 KB), i.e. nearly a 8x speedup vs. your scalar path

all of this show very well, one more time,how important it is to use cache blocking techniques whenever it's possible

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page