Why using SIMD is so slower than not using SIMD?

Raymond_S_ · ‎04-25-2016

Dear all:

I compared my special Convolution algorithms, using SIMD and normal. It is very strange that using SIMD is slower.

Test result:

Using SIMD: 26ms

Not Using SIMD: 42ms.

Both algorithms are single thread. Attachment is source code.

Help!

Vladimir_P_1234567890 · ‎09-07-2016

As far as I can see "Using SIMD:" is faster, isn't it?

--Vladimir

Vladimir_P_1234567890 · ‎09-07-2016

OK, it looks there is a mistake in your description. "normal" algorithm is faster because it was autovectorized and autovectorization was more efficient then manual vectorization.

LOOP BEGIN at C:\temp\pyramid\pyramid_fma_intrinsic\main.cpp(29,3) inlined into C:\temp\pyramid\pyramid_fma_intrinsic\main.cpp(144,2)
C:\temp\pyramid\pyramid_fma_intrinsic\main.cpp(29,3):remark #15300: LOOP WAS VECTORIZED
LOOP END
===========================================================================

For disabled autovectorization I got

pyramid normal elapsed 96.165665 ms   =====20000.000000
pyramid fma elapsed  43.360785 ms   =====20000.000000

--Vladimir

SergeyKostrov · ‎09-08-2016

It is Not clear on what CPU tests were completed. If your tests were completed on a CPU with support of AVX ISA and in binary codes SSE / SSE2 / SSE4.x instructions are mixed with AVX instructions, then there are transitions SSEx-to-AVX / AVX-to SSEx and it affects performance ( it gets slower! ).