Community
cancel
Showing results for 
Search instead for 
Did you mean: 
nik0las
Beginner
134 Views

No speedup AVX over SSE

Jump to solution
Hi. I'm trying to speedup some serial code using SSE and AVX (computational code with SOA data structure). SSE version gives good speedup, up to 2 times using double and some more using float. But when I'm trying to use AVX the same way I've get same speed when using SSE. Attempts to solve this problem with google gave the result that the problem is the memory speed. Is it possible to speed up this code using AVX? OS: linux, ubuntu, x86_64 CPU: i7-2670QM Compilers: gcc and icc Code: http://code.google.com/p/le2d/source/browse/#git%2Fsrc Compile: cd src && make Run: cd tests/sse2 && ./test.sh See result: cd tests/sse2 && gnuplot -p plot1.gnu Thanks.
0 Kudos
1 Solution
TimP
Black Belt
134 Views
Software prefetch may help if the data don't remain local to L1, but in that case performance of 2x SSE is unlikely. I've found it difficult to predict usefulness of software prefetch. It's possible (and may be the case in your example) sometimes for SSE code to take full advantage of L1 performance even on AVX capable CPUs.

View solution in original post

6 Replies
TimP
Black Belt
134 Views
You're probably aware, as you implied you researched the subject, that speedup from SSE to AVX often depends on several factors, including 32-byte data alignment, L1 cache locality, and optimum number of operations per loop. We've seen cases where stuffing lots of operations into a loop in order to optimize SSE performance could bring SSE up to the performance of AVX.
nik0las
Beginner
134 Views
Thanks for your comment. This program use float and double numbers, 32-byte memory alignment, SOA data structures and a lot of computations per element. Also I do some work to do code more cache friendly. Maybe it's possible to speedup AVX using software prefetch? As I see AVX would work faster only when all data stored at L1 cache.
nik0las
Beginner
134 Views
-
TimP
Black Belt
135 Views
Software prefetch may help if the data don't remain local to L1, but in that case performance of 2x SSE is unlikely. I've found it difficult to predict usefulness of software prefetch. It's possible (and may be the case in your example) sometimes for SSE code to take full advantage of L1 performance even on AVX capable CPUs.

View solution in original post

yuriisig
Beginner
134 Views
Usage AVX increases speed of matrix multiplication almost in 2 times.
TimP
Black Belt
134 Views
YuriiSig wrote:

Usage AVX increases speed of matrix multiplication almost in 2 times.

That's with careful hand coding, among other things gaining maximum register and L1 cache data locality, as in the new versions of MKL.
Reply