AVX slower than SSE?

skycle · ‎12-09-2011

Hello,

I attempting to use SSE and AVX instructions to optimise my program. I have 3 versions of my code: Scalar, SSE, and AVX.

After much optimization, my SSE version is pretty much 4x as fast as my scalar code. This was actually quite suprising, I did not expect to get so close to 4x improvement.

However, my AVX version is 20% slower than my scalar code!

The program is operating on SoA data, so the difference between SSE and AVX versions is very small (just dividing the upper bound of the loop by 2, and incrementing the pointer by 2x).

If I write a simple test program that sums two arrays, I can indeed see that AVX is 2x as fast as SSE, and 8x as fast as scalar code.

My actual algorithm is pretty benign in terms of instructions, I do not use many exotic instructions. Mostly mulps, addps, and rcpps.

I'm using intrinsic functions in VS2010 SP1, and I have an i5 2500 CPU.

I am wondering if there is something subtle that I might be doing wrong?

Thanks in advance for any ideas.

Maxym_D_Intel · ‎12-10-2011

it would be hard to recommend something without been able to see your code
and as a suggestion for now, try to see it there is so called AVX/SSE transition case,

for this, you can use simple and powerful tool described here:
http://software.intel.com/en-us/articles/intel-software-development-emulator/#TRANSITION