Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
For the latest information on Intel’s response to the Log4j/Log4Shell vulnerability, please see Intel-SA-00646

AVX - Scalar Comparison

Agrawal__Mohit
Beginner
176 Views

I am comparing the performance of AVX against scalar by doing a multiplication operation on two arrays iteratively.

 

1) The arrays are evaluated only once and remain same in each iteration.

 The output seems logical in this case. AVX gives a better performance than scalar even after applying O1 optimization while compiling.

2) The arrays are re-evaluated in each iteration and doesn't stay same.

In this case, the output remains as expected when no optimization is applied. But, when O1 optimization is applied, scalar gives a better performance than AVX, unexpectedly. With no optimization, AVX is better than scalar by 11 nanoseconds per iteration, whereas, with O1 optimization, scalar is better than AVX by 2 nanoseconds per iteration.

 

There could be following possible reasons for this discrepancy, as far as I know.

a) Data misalignment - It is taken care of.

b) SSE -> AVX transition - Since, this transition is there in case 1 as well, which is performing better, we can say that this has been dealt with. I am also using mvzeroupper flag to cover this.

c) Slow Data loading / unloading - This could be because of L1d, L2 cache miss. I am not aware of any other reasons as such.

d) Multiple copies - Since it is working in case 1, this should not be an issue.

 

I have used volatile to ensure that multiplication is not omitted in O1 optimization due to which loading and unloading occurs in 2 steps, Load: vmovps and vinsertf128. Unload: vmovups and vextractf128. This adds an overhead of 1nanosecond per iteration. Even if we do not consider this overhead, AVX is slower than scalar in case 2 with O1 optimization.

I could not understand the exact reason for such results. If someone could guide, it would be of great help.

 

Additional Information: g++ compiler , rdtsc used to measure performance, Skylake microarchitecture processor

 

Thanks!

0 Kudos
0 Replies
Reply