Showing results for

- Intel Community
- Software
- Software Development Topics
- Software Tuning, Performance Optimization & Platform Monitoring
- Comparing scalar, SSE and AVX....poor performances ??

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Joaquin_Tarraga

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-06-2012
02:23 AM

152 Views

Comparing scalar, SSE and AVX....poor performances ??

Hi,

Please, look at these pieces of code, consisting of three versions to

calculate the length of a set of 3-D vectors.

Let's assume, vector v, with components x, y, z.

The length (l) of vector v, is l = sqrt((x*x) + (y*y) + (z*z))

I implemented three versions based on scalar, SSE and AVX instruction, to compute the

length of 90 000 000 vectors. I hope to get much better performance using SSE,

and AVX, but no...., here the results:

=======================================

TEST 0: l = sqrt((x*x) + (y*y) + (z*z))

=======================================

Scalar time: 0.46051

SSE time : 0.18613

AVX time : 0.19043

Speed-up Scalar vs SSE : 2.47

Speed-up Scalar vs AVX : 2.42

I hope a speed-up of 4 when using SSE, and much more with AVX,

but there is no difference between SSE and AVX.

Target architecture:

- Intel Xeon CPU E31245 @ 3.30GHz
- 4 CPU dual-core (but I only use one core)

Command line to compile:

gcc -O3 -std=c99 -mavx main.c -o main -lm

(with ic compiler SSE and AVX are similiar too)

And the code:

Allocating memory for the SSE version:

x = (float*)_mm_malloc(len * sizeof(float), 16);

y =(float*)_mm_malloc(len * sizeof(float), 16);

....

//----------------------------------------------------------------------------------------------------------------------

void length_scalar(float *x, float *y, float *z, float *l, unsigned int length) {

for (int i = 0; i

void length_avx(float *x, float *y, float *z, float *l, unsigned int length) { l* = sqrt((x***x**) + (y***y**) + (z***z**));*

}

}

//----------------------------------------------------------------------------------------------------------------------

void length_sse(float *x, float *y, float *z, float *l, unsigned int length) {

__m128 xmm0, xmm1, xmm2, xmm3;

for (int i = 0; i

xmm0 = _mm_load_ps(&x*);*

xmm1 = _mm_load_ps(&y*);*

xmm2 = _mm_load_ps(&z*);*

xmm3 = _mm_add_ps(_mm_mul_ps(xmm0, xmm0), _mm_mul_ps(xmm1, xmm1));

xmm3 = _mm_add_ps(_mm_mul_ps(xmm2, xmm2), xmm3);

xmm3 = _mm_sqrt_ps(xmm3);

_mm_store_ps(&l*, xmm3);*

}

}

//----------------------------------------------------------------------------------------------------------------------

for (int i = 0; i

ymm0 = _mm256_load_ps(&x*);*

ymm1 = _mm256_load_ps(&y*);*

ymm2 = _mm256_load_ps(&z*);*

ymm3 = _mm256_add_ps(_mm256_mul_ps(ymm0, ymm0), _mm256_mul_ps(ymm1, ymm1));

ymm3 = _mm256_add_ps(_mm256_mul_ps(ymm2, ymm2), ymm3);

ymm3 = _mm256_sqrt_ps(ymm3);

_mm256_store_ps(&l*, ymm3);*

}

}

//----------------------------------------------------------------------------------------------------------------------

Could you, please, give me some hints, suggestions....to explain that?

I think it is due to the 4 instructions to move data (memory /register, i.e., the load and store instructions),

what do you think?

If I ran a example more simple (addition of the 3 components of a vector, for 90 000 000 vectors)

and I got worse results:

=======================================

TEST 1: l = x + y + z

=======================================

Scalar time: 0.61573

SSE time : 0.34304

AVX time : 0.34770

Speed-up Scalar vs SSE : 1.79

Speed-up Scalar vs AVX : 1.77

Any idea?

Thanks a lot

--

Joaqun

Link Copied

5 Replies

Patrick_F_Intel1

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-06-2012
04:46 AM

152 Views

Let me see if I understand correctly... each array is of length 90 million, times 4 arrays, times 4 bytes per float gives a total size of about 1.44 billion bytes.

I suspect that the cpu is mostly bound (restricted) by the speed at which it can read from (and write to) memory.

Once a cacheline is brought in from memory then avx & sse work faster but it is possible that you are waiting a lot for the cacheline to be fetched.

Can you try running the code with a total size that fits easily into the L3 and see what performance you get?

I'm not a compiler expert but I would probably also look at the generated assembly code and see if avx instructions are actually being generated.

Pat

Joaquin_Tarraga

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-06-2012
07:55 AM

152 Views

If I work by chunks of 1024 floats, i.e.,for (int i = 0; i < 90000000; i += 1024),

I get a speed-up of 4 (aprox.) for, both, SSE and AVX (see results below).

TimP (Intel)told me: "As the AVX-256 sqrt is sequenced as 2 128-bit sqrt instructions, you could expect little difference inperformance between SSE and AVX-256".It can explain why there's no difference between SSE and AVX in test0, but in test1 (where only three additions are performed) the speed-up is the same?

By executing "objdump -S " , I checked the assembly instructions are AVX,vaddps, vmovaps,vmulps,vsqrtps..., for both the SSE function and the AVX function, the difference cames from the loop 'step' value, 4 for SSE and 8 for AVX.

=======================================

TEST 0: l = sqrt((x*x) + (y*y) + (z*z))

=======================================

Seq time: 3.681436e-01

SSE time: 9.068346e-02

AVX time: 9.062290e-02

Speed-up Seq vs SSE : 4.06

Speed-up Seq vs AVX : 4.06

=======================================

TEST 1: l = x + y + z

=======================================

Seq time: 3.898194e-01

SSE time: 1.120391e-01

AVX time: 1.076577e-01

Speed-up Seq vs SSE : 3.48

Speed-up Seq vs AVX : 3.62

Patrick_F_Intel1

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-06-2012
08:52 AM

152 Views

My main point was to have the test run from cache rather than memory.

So if you did something like (just for testing timing):

for(int j=0; j < 90000; j++)

{

length_sse(x, y, z, l, 1000);

}

so you are still doing 90 million operations... but now everything is in cache.

Yes, this is not solving your problem if you have vector lengths of 90 million but it is not clear to me if you are just timing stuff or trying to solve a problem.

Pat

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-06-2012
10:50 AM

152 Views

Joaquin_Tarraga

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-07-2012
12:31 AM

152 Views

It was only for timing,bronxzv got a speep-up close to 2 (AVX-256 vs AVX-128, for test1, and playing with different 'chunk sizes'):

http://software.intel.com/en-us/forums/showthread.php?t=105767&o=a&s=lrThanks a lot for your help.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.