I am just starting with SSE optimizations.I tried a very simple task of adding an array
of 2-dimensional vectors.I made three versions - without utilizing
any sort of SIMD instructions (http://pastebin.com/m3e8838c2), using
SSE2 instructions via intel intrinsics (http://pastebin.com/m783f8e7d)
and using SSE2 instructions through GCC vector intrinsics
(http://pastebin.com/m6f36194e). The best times obtained were without
using any SIMD instructions. I used the gcc 4.2 compiler with -march=prescott and
When I tried compiling without the -O3 flag, the code with the gcc
vector intrinsics was 1.5 times faster than the one without SIMD
instructions, and intel intrinsics code was the slowest :-(.
Any help will be greatly appreciated.
I doubt this was the original intended topic of this forum, but maybe it's good preparation for AVX.
Your Intel intrinsics code forces use of more instructions than the gcc vector intrinsics. That might be OK on the old NetBurst processors, including prescott, if you have one of the highest clock speeds, so I'm guessing you may not have met all of those qualifications.As -march=prescott would be a reasonable choice for this code on Core Duo, for example, I can't infer what CPU you chose. Also, it'sprobably unrolled beyond optimum.
There were a lot of gcc 4.2 compilers. I'm not sure whether any of them enabled auto-vectorization at -O3, as 4.3 did. If so, that would involve SIMD instructions, and could demonstrate that a compiler can do a better job of auto-vectorization than you dowhen youtie it down to specific instructions.
I think you are saying that your plain C code, compiled with -O3, is faster than your other versions. When you dictate use of SSE intrinsics, chances are there isn't enough change in the generated code with -O level to make a difference.
Your gcc should support the flag -ftree-vectorize to perform auto-vectorization. The additional flag -ftree-vectorizer-verbose=2 will tell about vectorization actions.