I am comparing the performance of AVX against scalar by doing a multiplication operation on two arrays iteratively.
1) The arrays are evaluated only once and remain same in each iteration.
The output seems logical in this case. AVX gives a better performance than scalar even after applying O1 optimization while compiling.
2) The arrays are re-evaluated in each iteration and doesn't stay same.
In this case, the output remains as expected when no optimization is applied. But, when O1 optimization is applied, scalar gives a better performance than AVX, unexpectedly. With no optimization, AVX is better than scalar by 11 nanoseconds per iteration, whereas, with O1 optimization, scalar is better than AVX by 2 nanoseconds per iteration.
There could be following possible reasons for this discrepancy, as far as I know.
a) Data misalignment - It is taken care of.
b) SSE -> AVX transition - Since, this transition is there in case 1 as well, which is performing better, we can say that this has been dealt with. I am also using mvzeroupper flag to cover this.
c) Slow Data loading / unloading - This could be because of L1d, L2 cache miss. I am not aware of any other reasons as such.
d) Multiple copies - Since it is working in case 1, this should not be an issue.
I have used volatile to ensure that multiplication is not omitted in O1 optimization due to which loading and unloading occurs in 2 steps, Load: vmovps and vinsertf128. Unload: vmovups and vextractf128. This adds an overhead of 1nanosecond per iteration. Even if we do not consider this overhead, AVX is slower than scalar in case 2 with O1 optimization.
I could not understand the exact reason for such results. If someone could guide, it would be of great help.
Additional Information: g++ compiler , rdtsc used to measure performance, Skylake microarchitecture processor