"#pragma vector aligned" should be sufficient performance-wise. Some extra code will still be generated but not executed anyway. To remove such extra code you can manually strip-mine the loop by a factor of four. But that seems to interfere with loop unrolling. At the end of the day you can always use intrinsics.
1. Did you try a '__fastcall' calling conversion ( supported on Windows only; Option /Gr)?
2. I recommend you to look atMSDN article"Considerations for Writing Prolog/Epilog Code".
3. Iuse a _declspec( naked ) attribute ondeclarations of some functions, in order to make them as
smaller as possible,but this is a Microsoft specific.
From styc: Seriously,a handful of extra bytes of code can cause what you put in the brackets?
Sorry, I was refering to the vectorization prologue/epilogue on each loop rather than the function prologue/epilogue. I can actually get rid of the prologue with poper alignment directives and attributes, but I can't get rid of the vectorization epilogue, no matter what I do. A vectorization-epilogue is a code fragment that assumes I have less than a SIMD-vector worth of floating point operations to do at the tail end of my array segment. Since my lower and upper loop bounds are multiples of the SIMD-vector length, it is impossible for this condition to occur.