I'm testing the assembly code generation of the main loop of an application with icc 14.0.3 and icc 15.0 for the Intel Xeon Phi coprocessor.
I'm generating a large number of prefetch instructions by hand for this main loop using _mm512_prefetch intrinsic and compiling with -O3 -no-opt-prefetch -mmic.
The first version of this application only contains this _mm512_prefetch intrinsics and a #pragma omp simd on the main loop.
The second version has been vectorized by hand using KNC intrinsics in addition to the same _mm512_prefetch instructions of the previous versions.
When I have a look carefully at the assembly code generated for both versions, I see that the assembly corresponding to the loop body is highly equivalent. Both versions seem to have the same optimizations applied, SAME aligned/unaligned loads/stores, etc... BUT the order of the instructions is not the same.
Whereas in the auto-vectorized version ALL prefetch instructions have been shuffled with other instructions throughout the WHOLE body loop (this is just a short snippet):
vprefetch0 2720(%r14,%r13,4) #300.15 c93 vaddps %zmm25, %zmm24, %zmm8 #362.48 c97 vprefetch0 1440(%r14,%r13,4) #301.15 c101 vaddps %zmm27, %zmm26, %zmm7 #363.48 c105 vprefetch0 160(%r14,%r13,4) #302.15 c109 vaddps %zmm29, %zmm28, %zmm6 #364.48 c113 vprefetch0 6560(%r14,%r13,4)
in the hand-vectorized version only a few instructions have been shuffled, and the vast majority of them are one after another at the beginning of the loop body:
vmovaps -3824(%rdx,%rsi,4), %zmm0 #478.2861 c1 vprefetch1 528(%rcx) #459.21 c5 vmovaps -1264(%rdx,%rsi,4), %zmm30 #478.2437 c9 vprefetch0 144(%rcx) #460.21 c13 vmovaps -2544(%rdx,%rsi,4), %zmm31 #478.2649 c17 vprefetch1 -752(%rcx) #461.21 c21 vaddps 3856(%rdx,%rsi,4), %zmm0, %zmm1 #478.2831 c25 vprefetch1 -2032(%rcx) #462.21 c29 vaddps 1296(%rdx,%rsi,4), %zmm30, %zmm3 #478.2407 c33 vprefetch1 -3312(%rcx) #463.21 c37 vaddps 2576(%rdx,%rsi,4), %zmm31, %zmm2 #478.2619 c41 vprefetch1 -4592(%rcx) #464.21 c45 vmulps %zmm11, %zmm1, %zmm1 #478.34 c49 vprefetch1 1808(%rcx) #465.21 c53 vfmadd213ps %zmm1, %zmm13, %zmm3 #478.34 c57 vprefetch1 3088(%rcx) #466.21 c61 vprefetch1 4368(%rcx) #467.21 c65 vprefetch1 5648(%rcx) #468.21 c69 vprefetch0 -1136(%rcx) #469.21 c73 vprefetch0 -2416(%rcx) #470.21 c77 vprefetch0 -3696(%rcx) #471.21 c81 vprefetch0 -4976(%rcx) #472.21 c85 vprefetch0 1424(%rcx) #473.21 c89 vprefetch0 2704(%rcx) #474.21 c93
This cause the auto-vectorized version runs significantly faster than the hand-coded version. I shuffle the assembly prefetch instructions by hand in the hand-coded version and I reached the same performance as in the auto-vectorized version.
Could you help me to understand by this is happening? What could I do to have a similar order of instructions generated automatically in the hand-vectorized version?
Thank you in advance!
The autovectorizer is able to do a pretty good job of analyzing the loop and instruction scheduling. There is no such flexibility with intrinsics. Avoid intrinsics unless you can prove the compiler is not doing a good job.
When using the prefetch instrinsics, it's best to turn off compiler prefetching (use option -no-qopt-prefetch or #pragma noprefetch). Please refer to the "Prefetching on Intel® MIC Architecture" link on this page: https://software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-architecture