Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Comparing scalar, SSE and AVX....poor performances ??

Joaquin_Tarraga
Beginner
790 Views
Hi,
Please, look at these pieces of code, consisting of three versions to
calculate the length of a set of 3-D vectors.
Let's assume, vector v, with components x, y, z.
The length (l) of vector v, is l = sqrt((x*x) + (y*y) + (z*z))
I implemented three versions based on scalar, SSE and AVX instruction, to compute the
length of 90 000 000 vectors. I hope to get much better performance using SSE,
and AVX, but no...., here the results:
=======================================
TEST 0: l = sqrt((x*x) + (y*y) + (z*z))
=======================================
Scalar time: 0.46051
SSE time : 0.18613
AVX time : 0.19043
Speed-up Scalar vs SSE : 2.47
Speed-up Scalar vs AVX : 2.42
I hope a speed-up of 4 when using SSE, and much more with AVX,
but there is no difference between SSE and AVX.
Target architecture:
  • Intel Xeon CPU E31245 @ 3.30GHz
  • 4 CPU dual-core (but I only use one core)
Command line to compile:
gcc -O3 -std=c99 -mavx main.c -o main -lm
(with ic compiler SSE and AVX are similiar too)
And the code:
Allocating memory for the SSE version:
x = (float*)_mm_malloc(len * sizeof(float), 16);
y =(float*)_mm_malloc(len * sizeof(float), 16);
....

//----------------------------------------------------------------------------------------------------------------------
void length_scalar(float *x, float *y, float *z, float *l, unsigned int length) {
for (int i = 0; i
l = sqrt((x*x) + (y*y) + (z*z));
}
}
//----------------------------------------------------------------------------------------------------------------------
void length_sse(float *x, float *y, float *z, float *l, unsigned int length) {
__m128 xmm0, xmm1, xmm2, xmm3;
for (int i = 0; i
xmm0 = _mm_load_ps(&x);
xmm1 = _mm_load_ps(&y);
xmm2 = _mm_load_ps(&z);
xmm3 = _mm_add_ps(_mm_mul_ps(xmm0, xmm0), _mm_mul_ps(xmm1, xmm1));
xmm3 = _mm_add_ps(_mm_mul_ps(xmm2, xmm2), xmm3);
xmm3 = _mm_sqrt_ps(xmm3);
_mm_store_ps(&l, xmm3);
}
}
//----------------------------------------------------------------------------------------------------------------------
void length_avx(float *x, float *y, float *z, float *l, unsigned int length) {
for (int i = 0; i
ymm0 = _mm256_load_ps(&x);
ymm1 = _mm256_load_ps(&y);
ymm2 = _mm256_load_ps(&z);
ymm3 = _mm256_add_ps(_mm256_mul_ps(ymm0, ymm0), _mm256_mul_ps(ymm1, ymm1));
ymm3 = _mm256_add_ps(_mm256_mul_ps(ymm2, ymm2), ymm3);
ymm3 = _mm256_sqrt_ps(ymm3);
_mm256_store_ps(&l, ymm3);
}
}

//----------------------------------------------------------------------------------------------------------------------

Could you, please, give me some hints, suggestions....to explain that?
I think it is due to the 4 instructions to move data (memory /register, i.e., the load and store instructions),
what do you think?
If I ran a example more simple (addition of the 3 components of a vector, for 90 000 000 vectors)
and I got worse results:
=======================================
TEST 1: l = x + y + z
=======================================
Scalar time: 0.61573
SSE time : 0.34304
AVX time : 0.34770
Speed-up Scalar vs SSE : 1.79
Speed-up Scalar vs AVX : 1.77
Any idea?
Thanks a lot
--
Joaqun


0 Kudos
5 Replies
Patrick_F_Intel1
Employee
790 Views
Hello Joaquin,
Let me see if I understand correctly... each array is of length 90 million, times 4 arrays, times 4 bytes per float gives a total size of about 1.44 billion bytes.
I suspect that the cpu is mostly bound (restricted) by the speed at which it can read from (and write to) memory.
Once a cacheline is brought in from memory then avx & sse work faster but it is possible that you are waiting a lot for the cacheline to be fetched.
Can you try running the code with a total size that fits easily into the L3 and see what performance you get?
I'm not a compiler expert but I would probably also look at the generated assembly code and see if avx instructions are actually being generated.
Pat
0 Kudos
Joaquin_Tarraga
Beginner
790 Views
Thanks for your comment, Pat,
If I work by chunks of 1024 floats, i.e.,for (int i = 0; i < 90000000; i += 1024),
I get a speed-up of 4 (aprox.) for, both, SSE and AVX (see results below).

TimP (Intel)told me: "As the AVX-256 sqrt is sequenced as 2 128-bit sqrt instructions, you could expect little difference inperformance between SSE and AVX-256".It can explain why there's no difference between SSE and AVX in test0, but in test1 (where only three additions are performed) the speed-up is the same?
By executing "objdump -S " , I checked the assembly instructions are AVX,vaddps, vmovaps,vmulps,vsqrtps..., for both the SSE function and the AVX function, the difference cames from the loop 'step' value, 4 for SSE and 8 for AVX.
=======================================
TEST 0: l = sqrt((x*x) + (y*y) + (z*z))
=======================================
Seq time: 3.681436e-01
SSE time: 9.068346e-02
AVX time: 9.062290e-02
Speed-up Seq vs SSE : 4.06
Speed-up Seq vs AVX : 4.06
=======================================
TEST 1: l = x + y + z
=======================================
Seq time: 3.898194e-01
SSE time: 1.120391e-01
AVX time: 1.076577e-01
Speed-up Seq vs SSE : 3.48
Speed-up Seq vs AVX : 3.62
0 Kudos
Patrick_F_Intel1
Employee
790 Views
Well Tim certainly knows more about compilers, SSE & AVX than I do.
My main point was to have the test run from cache rather than memory.
So if you did something like (just for testing timing):
for(int j=0; j < 90000; j++)
{
length_sse(x, y, z, l, 1000);
}
so you are still doing 90 million operations... but now everything is in cache.
Yes, this is not solving your problem if you have vector lengths of 90 million but it is not clear to me if you are just timing stuff or trying to solve a problem.
Pat
0 Kudos
TimP
Honored Contributor III
790 Views
Pat is right about memory access limited situations, of course. In fact, the current Intel AVX platforms also narrow the path to 128 bits beyond L1, so, as Pat is hinting, AVX will give a signficant gain in those situations only with data locality in L1.
0 Kudos
Joaquin_Tarraga
Beginner
790 Views
It was only for timing,bronxzv got a speep-up close to 2 (AVX-256 vs AVX-128, for test1, and playing with different 'chunk sizes'):
http://software.intel.com/en-us/forums/showthread.php?t=105767&o=a&s=lr
Thanks a lot for your help.
0 Kudos
Reply