Solved: Quote:YuriiSig wrote:

nik0las · ‎09-02-2012

Hi. I'm trying to speedup some serial code using SSE and AVX (computational code with SOA data structure). SSE version gives good speedup, up to 2 times using double and some more using float. But when I'm trying to use AVX the same way I've get same speed when using SSE. Attempts to solve this problem with google gave the result that the problem is the memory speed. Is it possible to speed up this code using AVX? OS: linux, ubuntu, x86_64 CPU: i7-2670QM Compilers: gcc and icc Code: http://code.google.com/p/le2d/source/browse/#git%2Fsrc Compile: cd src && make Run: cd tests/sse2 && ./test.sh See result: cd tests/sse2 && gnuplot -p plot1.gnu Thanks.

TimP · ‎09-03-2012

Software prefetch may help if the data don't remain local to L1, but in that case performance of 2x SSE is unlikely. I've found it difficult to predict usefulness of software prefetch. It's possible (and may be the case in your example) sometimes for SSE code to take full advantage of L1 performance even on AVX capable CPUs.

View solution in original post

TimP · ‎09-03-2012

You're probably aware, as you implied you researched the subject, that speedup from SSE to AVX often depends on several factors, including 32-byte data alignment, L1 cache locality, and optimum number of operations per loop. We've seen cases where stuffing lots of operations into a loop in order to optimize SSE performance could bring SSE up to the performance of AVX.

nik0las · ‎09-03-2012

Thanks for your comment. This program use float and double numbers, 32-byte memory alignment, SOA data structures and a lot of computations per element. Also I do some work to do code more cache friendly. Maybe it's possible to speedup AVX using software prefetch? As I see AVX would work faster only when all data stored at L1 cache.

nik0las · ‎09-03-2012

-

TimP · ‎09-03-2012

Software prefetch may help if the data don't remain local to L1, but in that case performance of 2x SSE is unlikely. I've found it difficult to predict usefulness of software prefetch. It's possible (and may be the case in your example) sometimes for SSE code to take full advantage of L1 performance even on AVX capable CPUs.

yuriisig · ‎09-14-2012

Usage AVX increases speed of matrix multiplication almost in 2 times.

TimP · ‎09-14-2012

YuriiSig wrote:
Usage AVX increases speed of matrix multiplication almost in 2 times.

That's with careful hand coding, among other things gaining maximum register and L1 cache data locality, as in the new versions of MKL.

No speedup AVX over SSE