- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi. I'm trying to speedup some serial code using SSE and AVX (computational code with SOA data structure). SSE version gives good speedup, up to 2 times using double and some more using float. But when I'm trying to use AVX the same way I've get same speed when using SSE. Attempts to solve this problem with google gave the result that the problem is the memory speed.
Is it possible to speed up this code using AVX?
OS: linux, ubuntu, x86_64
CPU: i7-2670QM
Compilers: gcc and icc
Code: http://code.google.com/p/le2d/source/browse/#git%2Fsrc
Compile: cd src && make
Run: cd tests/sse2 && ./test.sh
See result: cd tests/sse2 && gnuplot -p plot1.gnu
Thanks.
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Software prefetch may help if the data don't remain local to L1, but in that case performance of 2x SSE is unlikely. I've found it difficult to predict usefulness of software prefetch.
It's possible (and may be the case in your example) sometimes for SSE code to take full advantage of L1 performance even on AVX capable CPUs.
Link Copied
6 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You're probably aware, as you implied you researched the subject, that speedup from SSE to AVX often depends on several factors, including 32-byte data alignment, L1 cache locality, and optimum number of operations per loop.
We've seen cases where stuffing lots of operations into a loop in order to optimize SSE performance could bring SSE up to the performance of AVX.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your comment.
This program use float and double numbers, 32-byte memory alignment, SOA data structures and a lot of computations per element.
Also I do some work to do code more cache friendly.
Maybe it's possible to speedup AVX using software prefetch? As I see AVX would work faster only when all data stored at L1 cache.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
-
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Software prefetch may help if the data don't remain local to L1, but in that case performance of 2x SSE is unlikely. I've found it difficult to predict usefulness of software prefetch.
It's possible (and may be the case in your example) sometimes for SSE code to take full advantage of L1 performance even on AVX capable CPUs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Usage AVX increases speed of matrix multiplication almost in 2 times.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
YuriiSig wrote:That's with careful hand coding, among other things gaining maximum register and L1 cache data locality, as in the new versions of MKL.Usage AVX increases speed of matrix multiplication almost in 2 times.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page