Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
1095 Discussions

Comparing scalar, SSE and AVX basics....poor performances ??

Joaquin_Tarraga
Beginner
2,871 Views
Hi,
Please, look at these pieces of code, consisting of three versions to
calculate the length of a set of 3-D vectors.
Let's assume, vector v, with components x, y, z.
The length (l) of vector v, is l = sqrt((x*x) + (y*y) + (z*z))
I implemented three versions based on scalar, SSE and AVX instruction, to compute the
length of 90 000 000 vectors. I hope to get much better performance using SSE,
and AVX, but no...., here the results:
=======================================
TEST 0: l = sqrt((x*x) + (y*y) + (z*z))
=======================================
Scalar time: 0.46051
SSE time : 0.18613
AVX time : 0.19043
Speed-up Scalar vs SSE : 2.47
Speed-up Scalar vs AVX : 2.42
I hope a speed-up of 4 when using SSE, a much more with AVX,
but there is no difference between SSE and AVX.
Target architecture:
  • Intel Xeon CPU E31245 @ 3.30GHz
  • 4 CPU dual-core (but I only use one core)
Command line to compile:
gcc -O3 -std=c99 -mavx main.c -o main -lm
And the code:
Allocating memory for the SSE version:
x = (float*)_mm_malloc(len * sizeof(float), 16);
y =(float*)_mm_malloc(len * sizeof(float), 16);
....
Allocating memory for the AVX version:
x = (float*)_mm_malloc(len * sizeof(float), 32);
y =(float*)_mm_malloc(len * sizeof(float), 32);
....
//----------------------------------------------------------------------------------------------------------------------
void length_scalar(float *x, float *y, float *z, float *l, unsigned int length) {
for (int i = 0; i
l = sqrt((x*x) + (y*y) + (z*z));
}
}
//----------------------------------------------------------------------------------------------------------------------
void length_sse(float *x, float *y, float *z, float *l, unsigned int length) {
__m128 xmm0, xmm1, xmm2, xmm3;
for (int i = 0; i
xmm0 = _mm_load_ps(&x);
xmm1 = _mm_load_ps(&y);
xmm2 = _mm_load_ps(&z);
xmm3 = _mm_add_ps(_mm_mul_ps(xmm0, xmm0), _mm_mul_ps(xmm1, xmm1));
xmm3 = _mm_add_ps(_mm_mul_ps(xmm2, xmm2), xmm3);
xmm3 = _mm_sqrt_ps(xmm3);
_mm_store_ps(&l, xmm3);
}
}
//----------------------------------------------------------------------------------------------------------------------
void length_avx(float *x, float *y, float *z, float *l, unsigned int length) {
__m256 ymm0, ymm1, ymm2, ymm3;
for (int i = 0; i
ymm0 = _mm256_load_ps(&x);
ymm1 = _mm256_load_ps(&y);
ymm2 = _mm256_load_ps(&z);
ymm3 = _mm256_add_ps(_mm256_mul_ps(ymm0, ymm0), _mm256_mul_ps(ymm1, ymm1));
ymm3 = _mm256_add_ps(_mm256_mul_ps(ymm2, ymm2), ymm3);
ymm3 = _mm256_sqrt_ps(ymm3);
_mm256_store_ps(&l, ymm3);
}
//----------------------------------------------------------------------------------------------------------------------
Could you, please, give me some hints, suggestions....to explain that?
I think it is due to the 4 instructions to move data (memory /register, i.e., the load and store instructions),
what do you think?
If I ran a example more simple (addition of the 3 components of a vector, for 90 000 000 vectors)
and I got worse results:
=======================================
TEST 1: l = x + y + z
=======================================
Scalar time: 0.61573
SSE time : 0.34304
AVX time : 0.34770
Speed-up Scalar vs SSE : 1.79
Speed-up Scalar vs AVX : 1.77
Any idea?
Thanks a lot
--
Joaqun
0 Kudos
1 Solution
bronxzv
New Contributor II
2,862 Views
I have tested yoursimple add example and I find a very good 2x speedup(AVX-256 vs AVX-128) for the workloads fitting in the L1D cache, that's a lot better than I was expecting, when the L1D cache is overflowed timings are sometimes worse for AVX-256 than for AVX-128, though, then for big workloads timings are much the same (as expected) since we are mostly RAM bandwidth bound

my timings are as follows:

Core i7 3770K@ 3.5 GHz, enhanced speedstep disabled, turbo off
woking set size: AVX-128 time AVX-256 time
[plain] 128: 22.5 ms 12.9 ms 256: 19.3 ms 11.2 ms 512: 19.4 ms 9.63 ms 1024: 19.3 ms 9.84 ms 2048: 19.2 ms 9.72 ms 4096: 20.8 ms 10.1 ms 8192: 20.1 ms 10.7 ms 16384: 19.7 ms 10.1 ms 32768: 19.5 ms 17.6 ms 65536: 24.5 ms 28.8 ms 131072: 23.6 ms 28.3 ms 262144: 28.4 ms 34 ms 524288: 38.8 ms 40.6 ms 1048576: 39 ms 41.5 ms 2097152: 39.1 ms 41.2 ms 4194304: 41 ms 43.1 ms 8388608: 53.6 ms 50.8 ms 16777216: 94.6 ms 85.9 ms 33554432: 110 ms 109 ms 67108864: 115 ms 113 ms 134217728: 118 ms 113 ms[/plain]
source code:

[cpp]template inline T *AAlloc(size_t size) { return (T *)_aligned_malloc(sizeof(T)*size,32); } inline void AFree(void *p) { if (p) _aligned_free(p); } void AddTestAVX128(const float *x, const float *y, const float *z, float *l, unsigned int length) { for (unsigned int i=0; i(chunkSize), *y = AAlloc(chunkSize), *z = AAlloc(chunkSize), *l = AAlloc(chunkSize); for (int j=0; j = j*1.0; y = j*2.0; z = j*3.0;} Chrono chrono(""); const float start = chrono.getTime(); for (int i=0; i
// main call:
for (int chunkSize=8; chunkSize<10000000; chunkSize<<=1) JTTest(chunkSize);[/cpp]

ASM dumps :

[plain].B51.3:: ; Preds .B51.3 .B51.2 ;;; { ;;; const __m128 px = _mm_load_ps(x+i), py = _mm_load_ps(y+i), pz = _mm_load_ps(z+i); vmovups xmm0, XMMWORD PTR [rcx+r10*4] ;478.35 add eax, 4 ;476.36 ;;; _mm_store_ps(l+i,_mm_add_ps(_mm_add_ps(px,py),pz)); vaddps xmm1, xmm0, XMMWORD PTR [rdx+r10*4] ;479.33 vaddps xmm2, xmm1, XMMWORD PTR [r8+r10*4] ;479.22 vmovups XMMWORD PTR [r9+r10*4], xmm2 ;479.18 mov r10d, eax ;476.36 cmp eax, r11d ;476.28 jb .B51.3 ; Prob 82% ;476.28[/plain]


[plain] .B52.3:: ; Preds .B52.3 .B52.2 ;;; { ;;; const __m256 px = _mm256_load_ps(x+i), py = _mm256_load_ps(y+i), pz = _mm256_load_ps(z+i); vmovups ymm0, YMMWORD PTR [rcx+r10*4] ;487.38 add eax, 8 ;485.36 ;;; _mm256_store_ps(l+i,_mm256_add_ps(_mm256_add_ps(px,py),pz)); vaddps ymm1, ymm0, YMMWORD PTR [rdx+r10*4] ;488.39 vaddps ymm2, ymm1, YMMWORD PTR [r8+r10*4] ;488.25 vmovups YMMWORD PTR [r9+r10*4], ymm2 ;488.21 mov r10d, eax ;485.36 cmp eax, r11d ;485.28 jb .B52.3 ; Prob 82% ;485.28 [/plain]

View solution in original post

0 Kudos
21 Replies
Max_L
Employee
177 Views
Note: on Sandy Bridge one can recover and improve 256-bit loads performance that are missing L1 vs 128-bit ones (256-bit loads are indeed slower than 2x128-bit loads on Sandy Bridge when missing L1, especially if the data is actually misalgned) by issuing prefetch'es to the cache lines before loads

-Max
0 Kudos
Reply