Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

performance issues with SSE2

I am just starting with SSE optimizations.I tried a very simple task of adding an array
of 2-dimensional vectors.I made three versions - without utilizing
any sort of SIMD instructions (, using
SSE2 instructions via intel intrinsics (
and using SSE2 instructions through GCC vector intrinsics
( The best times obtained were without
using any SIMD instructions. I used the gcc 4.2 compiler with -march=prescott and
-O3 flags.
When I tried compiling without the -O3 flag, the code with the gcc
vector intrinsics was 1.5 times faster than the one without SIMD
instructions, and intel intrinsics code was the slowest :-(.
Any help will be greatly appreciated.
0 Kudos
3 Replies
Black Belt

I doubt this was the original intended topic of this forum, but maybe it's good preparation for AVX.

Your Intel intrinsics code forces use of more instructions than the gcc vector intrinsics. That might be OK on the old NetBurst processors, including prescott, if you have one of the highest clock speeds, so I'm guessing you may not have met all of those qualifications.As -march=prescott would be a reasonable choice for this code on Core Duo, for example, I can't infer what CPU you chose. Also, it'sprobably unrolled beyond optimum.

There were a lot of gcc 4.2 compilers. I'm not sure whether any of them enabled auto-vectorization at -O3, as 4.3 did. If so, that would involve SIMD instructions, and could demonstrate that a compiler can do a better job of auto-vectorization than you dowhen youtie it down to specific instructions.

I have an Intel Core duo 1.6 ghz. I am using GCC 4.2.1 (Ubuntu 4.2.1-5ubuntu4), on Ubuntu 7.10. I checked the assembler output of the "normal" code (the one without intel and gcc intrinsics), and it did not contain any SIMD instructions, so I guess there is no auto-vectorization going on. So I still can't understand the absence of any speed-up with the -O3 flag.

PS - Could you suggest me an appropriate forum for this type of query?
Black Belt

For questions about usage of gcc, you can go to find the mailing list reference and subscribe to

I think you are saying that your plain C code, compiled with -O3, is faster than your other versions. When you dictate use of SSE intrinsics, chances are there isn't enough change in the generated code with -O level to make a difference.

Your gcc should support the flag -ftree-vectorize to perform auto-vectorization. The additional flag -ftree-vectorizer-verbose=2 will tell about vectorization actions.