Why is my AVX slower than SSE?

Shaquille_W_1 · ‎06-02-2015

As the description of "IIR Gaussian Blur Filter Implementation using Intel® Advanced Vector Extensions",

The AVX should be faster than SSE,But, my result of performance measurement as following:

The computer supports AVX
number CPU in the system = 4

IIR Gaussian Filter Coefficients are:
a0 = 0.021175, a1 = -0.017807, a2 = 0.021103, a3 = -0.017875, b1 = -1.837578, b2
= 0.844174, cprev = 0.510583, cnext = 0.489409

image width = 1024, height = 1024

Running multi threaded SSE code

Running multi threaded AVX code

SSE and AVX Implementation matches

Performance Measurement:

SSE horizontal Pass min: 4.94052 max: 109.795 avg: 6.97836
SSE vertical Pass min: 3.32723 max: 89.6741 avg: 4.52679

AVX horizontal Pass min: 33.0741 max: 159.732 avg: 43.4993

AVX vertical Pass min: 9.69314 max: 162.726 avg: 14.5814

My OS is Windows7 64bit

My CPU is Intel(R) Core(TM) i5-3230M CPU @ 2.6GHz 2.6GHz

My IDE is VS2013, and open the option of OpenMP

I want to know why is my AVX so slowly?

Is there anyone can teach me how to understand it ?

Thank you very much

TimP · ‎06-07-2015

You would need to furnish more information to define your question. Although there is a web page with the title you quote which was once posted on Intel site, it is restricted, so the number of people who could give a partial answer based on having seen it is apparently very small.

You need service packs with win7 to support AVX and HyperThreading.

Are you using AVX intrinsics with Microsoft compiler, when presumably the article was about Intel compiler? Performance of intrinsics code is likely to depend on data alignments; the Ivy Bridge CPU reduced but didn't eliminate the performance loss associated with unaligned AVX.

If using Microsoft compiler, you would likely need /fp:fast, which is roughly equivalent to the Intel compiler setting /fp:source (less aggressive optimization than the article presumably expected). I haven't looked into how the way the Microsoft compiler removes most optimization inside OpenMP regions affects AVX intrinsics code. In cases I've seen, there is no auto-vectorization with Microsoft compilers in OpenMP parallel regions.

Did you try adjusting number of threads to number of physical cores, or disabling HyperThreading, if you run on a HyperThread platform? If it were to run OK with HyperThreading, it might be with the help of affinity settings (which aren't supported in Microsoft's OpenMP).

Peter_Cordes · ‎06-07-2015

Based on the limited info in your post, the only thing I can think of that would make AVX so slow is not using VZEROUPPER everywhere it's needed. There's a massive speed penalty for mixing SSE and AVX without VZEROUPPER. Web search for it, or search in Agner Fog's optimization guides, to find out where you need to use it. You can also use Intel's CPU simulator thing to detect slow transitions in your code.

TimP · ‎06-10-2015

The issue about vzeroupper could arise if you are using AVX intrinsics with the Microsoft compiler, but not setting /arch:AVX. It's just one of many guesses which might be made in the absence of adequate information.