As the description of "IIR Gaussian Blur Filter Implementation using Intel® Advanced Vector Extensions",
The AVX should be faster than SSE,But, my result of performance measurement as following:
The computer supports AVX
number CPU in the system = 4
IIR Gaussian Filter Coefficients are:
a0 = 0.021175, a1 = -0.017807, a2 = 0.021103, a3 = -0.017875, b1 = -1.837578, b2
= 0.844174, cprev = 0.510583, cnext = 0.489409
image width = 1024, height = 1024
Running multi threaded SSE code
Running multi threaded AVX code
SSE and AVX Implementation matches
SSE horizontal Pass min: 4.94052 max: 109.795 avg: 6.97836
SSE vertical Pass min: 3.32723 max: 89.6741 avg: 4.52679
AVX horizontal Pass min: 33.0741 max: 159.732 avg: 43.4993
AVX vertical Pass min: 9.69314 max: 162.726 avg: 14.5814
My OS is Windows7 64bit
My CPU is Intel(R) Core(TM) i5-3230M CPU @ 2.6GHz 2.6GHz
My IDE is VS2013, and open the option of OpenMP
I want to know why is my AVX so slowly?
Is there anyone can teach me how to understand it ?
Thank you very much
You would need to furnish more information to define your question. Although there is a web page with the title you quote which was once posted on Intel site, it is restricted, so the number of people who could give a partial answer based on having seen it is apparently very small.
You need service packs with win7 to support AVX and HyperThreading.
Are you using AVX intrinsics with Microsoft compiler, when presumably the article was about Intel compiler? Performance of intrinsics code is likely to depend on data alignments; the Ivy Bridge CPU reduced but didn't eliminate the performance loss associated with unaligned AVX.
If using Microsoft compiler, you would likely need /fp:fast, which is roughly equivalent to the Intel compiler setting /fp:source (less aggressive optimization than the article presumably expected). I haven't looked into how the way the Microsoft compiler removes most optimization inside OpenMP regions affects AVX intrinsics code. In cases I've seen, there is no auto-vectorization with Microsoft compilers in OpenMP parallel regions.
Did you try adjusting number of threads to number of physical cores, or disabling HyperThreading, if you run on a HyperThread platform? If it were to run OK with HyperThreading, it might be with the help of affinity settings (which aren't supported in Microsoft's OpenMP).
Based on the limited info in your post, the only thing I can think of that would make AVX so slow is not using VZEROUPPER everywhere it's needed. There's a massive speed penalty for mixing SSE and AVX without VZEROUPPER. Web search for it, or search in Agner Fog's optimization guides, to find out where you need to use it. You can also use Intel's CPU simulator thing to detect slow transitions in your code.
The issue about vzeroupper could arise if you are using AVX intrinsics with the Microsoft compiler, but not setting /arch:AVX. It's just one of many guesses which might be made in the absence of adequate information.