Assembly instructions reordering is not optimal when providing hand-vectorized code

Diego_Caballero · ‎09-17-2014

I'm testing the assembly code generation of the main loop of an application with icc 14.0.3 and icc 15.0 for the Intel Xeon Phi coprocessor.

I'm generating a large number of prefetch instructions by hand for this main loop using _mm512_prefetch intrinsic and compiling with -O3 -no-opt-prefetch -mmic.

The first version of this application only contains this _mm512_prefetch intrinsics and a #pragma omp simd on the main loop.

The second version has been vectorized by hand using KNC intrinsics in addition to the same _mm512_prefetch instructions of the previous versions.

When I have a look carefully at the assembly code generated for both versions, I see that the assembly corresponding to the loop body is highly equivalent. Both versions seem to have the same optimizations applied, SAME aligned/unaligned loads/stores, etc... BUT the order of the instructions is not the same.

Whereas in the auto-vectorized version ALL prefetch instructions have been shuffled with other instructions throughout the WHOLE body loop (this is just a short snippet):

        vprefetch0 2720(%r14,%r13,4)                            #300.15 c93
        vaddps    %zmm25, %zmm24, %zmm8                         #362.48 c97
        vprefetch0 1440(%r14,%r13,4)                            #301.15 c101
        vaddps    %zmm27, %zmm26, %zmm7                         #363.48 c105
        vprefetch0 160(%r14,%r13,4)                             #302.15 c109
        vaddps    %zmm29, %zmm28, %zmm6                         #364.48 c113
        vprefetch0 6560(%r14,%r13,4)

in the hand-vectorized version only a few instructions have been shuffled, and the vast majority of them are one after another at the beginning of the loop body:

        vmovaps   -3824(%rdx,%rsi,4), %zmm0                     #478.2861 c1
        vprefetch1 528(%rcx)                                    #459.21 c5
        vmovaps   -1264(%rdx,%rsi,4), %zmm30                    #478.2437 c9
        vprefetch0 144(%rcx)                                    #460.21 c13
        vmovaps   -2544(%rdx,%rsi,4), %zmm31                    #478.2649 c17
        vprefetch1 -752(%rcx)                                   #461.21 c21
        vaddps    3856(%rdx,%rsi,4), %zmm0, %zmm1               #478.2831 c25
        vprefetch1 -2032(%rcx)                                  #462.21 c29
        vaddps    1296(%rdx,%rsi,4), %zmm30, %zmm3              #478.2407 c33
        vprefetch1 -3312(%rcx)                                  #463.21 c37
        vaddps    2576(%rdx,%rsi,4), %zmm31, %zmm2              #478.2619 c41
        vprefetch1 -4592(%rcx)                                  #464.21 c45
        vmulps    %zmm11, %zmm1, %zmm1                          #478.34 c49
        vprefetch1 1808(%rcx)                                   #465.21 c53
        vfmadd213ps %zmm1, %zmm13, %zmm3                        #478.34 c57
        vprefetch1 3088(%rcx)                                   #466.21 c61
        vprefetch1 4368(%rcx)                                   #467.21 c65
        vprefetch1 5648(%rcx)                                   #468.21 c69
        vprefetch0 -1136(%rcx)                                  #469.21 c73
        vprefetch0 -2416(%rcx)                                  #470.21 c77
        vprefetch0 -3696(%rcx)                                  #471.21 c81
        vprefetch0 -4976(%rcx)                                  #472.21 c85
        vprefetch0 1424(%rcx)                                   #473.21 c89
        vprefetch0 2704(%rcx)                                   #474.21 c93

This cause the auto-vectorized version runs significantly faster than the hand-coded version. I shuffle the assembly prefetch instructions by hand in the hand-coded version and I reached the same performance as in the auto-vectorized version.

Could you help me to understand by this is happening? What could I do to have a similar order of instructions generated automatically in the hand-vectorized version?

Thank you in advance!

Amanda_S_Intel · ‎09-19-2014

The autovectorizer is able to do a pretty good job of analyzing the loop and instruction scheduling. There is no such flexibility with intrinsics. Avoid intrinsics unless you can prove the compiler is not doing a good job.

When using the prefetch instrinsics, it's best to turn off compiler prefetching (use option -no-qopt-prefetch or #pragma noprefetch). Please refer to the "Prefetching on Intel® MIC Architecture" link on this page: https://software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-architecture