Solved: Comparing scalar, SSE and AVX basics....poor performances ?? - Page 2

Joaquin_Tarraga · ‎06-06-2012

Hi,

Please, look at these pieces of code, consisting of three versions to

calculate the length of a set of 3-D vectors.

Let's assume, vector v, with components x, y, z.

The length (l) of vector v, is l = sqrt((x*x) + (y*y) + (z*z))

I implemented three versions based on scalar, SSE and AVX instruction, to compute the

length of 90 000 000 vectors. I hope to get much better performance using SSE,

and AVX, but no...., here the results:

=======================================

TEST 0: l = sqrt((x*x) + (y*y) + (z*z))

=======================================

Scalar time: 0.46051

SSE time : 0.18613

AVX time : 0.19043

Speed-up Scalar vs SSE : 2.47

Speed-up Scalar vs AVX : 2.42

I hope a speed-up of 4 when using SSE, a much more with AVX,

but there is no difference between SSE and AVX.

Target architecture:

Intel Xeon CPU E31245 @ 3.30GHz
4 CPU dual-core (but I only use one core)

Command line to compile:

gcc -O3 -std=c99 -mavx main.c -o main -lm

And the code:

Allocating memory for the SSE version:

x = (float*)_mm_malloc(len * sizeof(float), 16);

y =(float*)_mm_malloc(len * sizeof(float), 16);

....

Allocating memory for the AVX version:

x = (float*)_mm_malloc(len * sizeof(float), 32);

y =(float*)_mm_malloc(len * sizeof(float), 32);

....

//----------------------------------------------------------------------------------------------------------------------

void length_scalar(float *x, float *y, float *z, float *l, unsigned int length) {

for (int i = 0; i

l = sqrt((x*x) + (y*y) + (z*z));

}

//----------------------------------------------------------------------------------------------------------------------

void length_sse(float *x, float *y, float *z, float *l, unsigned int length) {

__m128 xmm0, xmm1, xmm2, xmm3;

for (int i = 0; i

xmm0 = _mm_load_ps(&x);

xmm1 = _mm_load_ps(&y);

xmm2 = _mm_load_ps(&z);

xmm3 = _mm_add_ps(_mm_mul_ps(xmm0, xmm0), _mm_mul_ps(xmm1, xmm1));

xmm3 = _mm_add_ps(_mm_mul_ps(xmm2, xmm2), xmm3);

xmm3 = _mm_sqrt_ps(xmm3);

_mm_store_ps(&l, xmm3);

}

//----------------------------------------------------------------------------------------------------------------------

void length_avx(float *x, float *y, float *z, float *l, unsigned int length) {

__m256 ymm0, ymm1, ymm2, ymm3;

for (int i = 0; i

ymm0 = _mm256_load_ps(&x);

ymm1 = _mm256_load_ps(&y);

ymm2 = _mm256_load_ps(&z);

ymm3 = _mm256_add_ps(_mm256_mul_ps(ymm0, ymm0), _mm256_mul_ps(ymm1, ymm1));

ymm3 = _mm256_add_ps(_mm256_mul_ps(ymm2, ymm2), ymm3);

ymm3 = _mm256_sqrt_ps(ymm3);

_mm256_store_ps(&l, ymm3);

}

//----------------------------------------------------------------------------------------------------------------------

Could you, please, give me some hints, suggestions....to explain that?

I think it is due to the 4 instructions to move data (memory /register, i.e., the load and store instructions),

what do you think?

If I ran a example more simple (addition of the 3 components of a vector, for 90 000 000 vectors)

and I got worse results:

=======================================

TEST 1: l = x + y + z

=======================================

Scalar time: 0.61573

SSE time : 0.34304

AVX time : 0.34770

Speed-up Scalar vs SSE : 1.79

Speed-up Scalar vs AVX : 1.77

Any idea?

Thanks a lot

--

Joaqun

bronxzv · ‎06-06-2012

I have tested yoursimple add example and I find a very good 2x speedup(AVX-256 vs AVX-128) for the workloads fitting in the L1D cache, that's a lot better than I was expecting, when the L1D cache is overflowed timings are sometimes worse for AVX-256 than for AVX-128, though, then for big workloads timings are much the same (as expected) since we are mostly RAM bandwidth bound

my timings are as follows:

Core i7 3770K@ 3.5 GHz, enhanced speedstep disabled, turbo off
woking set size: AVX-128 time AVX-256 time
[plain] 128: 22.5 ms 12.9 ms 256: 19.3 ms 11.2 ms 512: 19.4 ms 9.63 ms 1024: 19.3 ms 9.84 ms 2048: 19.2 ms 9.72 ms 4096: 20.8 ms 10.1 ms 8192: 20.1 ms 10.7 ms 16384: 19.7 ms 10.1 ms 32768: 19.5 ms 17.6 ms 65536: 24.5 ms 28.8 ms 131072: 23.6 ms 28.3 ms 262144: 28.4 ms 34 ms 524288: 38.8 ms 40.6 ms 1048576: 39 ms 41.5 ms 2097152: 39.1 ms 41.2 ms 4194304: 41 ms 43.1 ms 8388608: 53.6 ms 50.8 ms 16777216: 94.6 ms 85.9 ms 33554432: 110 ms 109 ms 67108864: 115 ms 113 ms 134217728: 118 ms 113 ms[/plain]
source code:

[cpp]template inline T *AAlloc(size_t size) { return (T *)_aligned_malloc(sizeof(T)*size,32); } inline void AFree(void *p) { if (p) _aligned_free(p); } void AddTestAVX128(const float *x, const float *y, const float *z, float *l, unsigned int length) { for (unsigned int i=0; i(chunkSize), *y = AAlloc(chunkSize), *z = AAlloc(chunkSize), *l = AAlloc(chunkSize); for (int j=0; j = j*1.0; y = j*2.0; z = j*3.0;} Chrono chrono(""); const float start = chrono.getTime(); for (int i=0; i
// main call:
for (int chunkSize=8; chunkSize<10000000; chunkSize<<=1) JTTest(chunkSize);[/cpp]

ASM dumps :

[plain].B51.3:: ; Preds .B51.3 .B51.2 ;;; { ;;; const __m128 px = _mm_load_ps(x+i), py = _mm_load_ps(y+i), pz = _mm_load_ps(z+i); vmovups xmm0, XMMWORD PTR [rcx+r10*4] ;478.35 add eax, 4 ;476.36 ;;; _mm_store_ps(l+i,_mm_add_ps(_mm_add_ps(px,py),pz)); vaddps xmm1, xmm0, XMMWORD PTR [rdx+r10*4] ;479.33 vaddps xmm2, xmm1, XMMWORD PTR [r8+r10*4] ;479.22 vmovups XMMWORD PTR [r9+r10*4], xmm2 ;479.18 mov r10d, eax ;476.36 cmp eax, r11d ;476.28 jb .B51.3 ; Prob 82% ;476.28[/plain]

[plain] .B52.3:: ; Preds .B52.3 .B52.2 ;;; { ;;; const __m256 px = _mm256_load_ps(x+i), py = _mm256_load_ps(y+i), pz = _mm256_load_ps(z+i); vmovups ymm0, YMMWORD PTR [rcx+r10*4] ;487.38 add eax, 8 ;485.36 ;;; _mm256_store_ps(l+i,_mm256_add_ps(_mm256_add_ps(px,py),pz)); vaddps ymm1, ymm0, YMMWORD PTR [rdx+r10*4] ;488.39 vaddps ymm2, ymm1, YMMWORD PTR [r8+r10*4] ;488.25 vmovups YMMWORD PTR [r9+r10*4], ymm2 ;488.21 mov r10d, eax ;485.36 cmp eax, r11d ;485.28 jb .B52.3 ; Prob 82% ;485.28 [/plain]

View solution in original post

Max_L · ‎06-25-2012

Note: on Sandy Bridge one can recover and improve 256-bit loads performance that are missing L1 vs 128-bit ones (256-bit loads are indeed slower than 2x128-bit loads on Sandy Bridge when missing L1, especially if the data is actually misalgned) by issuing prefetch'es to the cache lines before loads

-Max