Intel® C++ Compiler
Support and discussions for creating C++ code that runs on platforms based on Intel® processors.
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
7679 Discussions

Assembly instructions reordering is not optimal when providing hand-vectorized code

Diego_Caballero
Beginner
172 Views

I'm testing the assembly code generation of the main loop of an application with icc 14.0.3 and icc 15.0 for the Intel Xeon Phi coprocessor.

I'm generating a large number of prefetch instructions by hand for this main loop using _mm512_prefetch intrinsic and compiling with -O3 -no-opt-prefetch -mmic.

The first version of this application only contains this _mm512_prefetch intrinsics and a #pragma omp simd on the main loop.

The second version has been vectorized by hand using KNC intrinsics in addition to the same _mm512_prefetch instructions of the previous versions.

 

When I have a look carefully at the assembly code generated for both versions, I see that the assembly corresponding to the loop body is highly equivalent. Both versions seem to have the same optimizations applied, SAME aligned/unaligned loads/stores, etc... BUT the order of the instructions is not the same. 

Whereas in the auto-vectorized version ALL prefetch instructions have been shuffled with other instructions throughout the WHOLE body loop (this is just a short snippet):

        vprefetch0 2720(%r14,%r13,4)                            #300.15 c93
        vaddps    %zmm25, %zmm24, %zmm8                         #362.48 c97
        vprefetch0 1440(%r14,%r13,4)                            #301.15 c101
        vaddps    %zmm27, %zmm26, %zmm7                         #363.48 c105
        vprefetch0 160(%r14,%r13,4)                             #302.15 c109
        vaddps    %zmm29, %zmm28, %zmm6                         #364.48 c113
        vprefetch0 6560(%r14,%r13,4)  

in the hand-vectorized version only a few instructions have been shuffled, and the vast majority of them are one after another at the beginning of the loop body:

        vmovaps   -3824(%rdx,%rsi,4), %zmm0                     #478.2861 c1
        vprefetch1 528(%rcx)                                    #459.21 c5
        vmovaps   -1264(%rdx,%rsi,4), %zmm30                    #478.2437 c9
        vprefetch0 144(%rcx)                                    #460.21 c13
        vmovaps   -2544(%rdx,%rsi,4), %zmm31                    #478.2649 c17
        vprefetch1 -752(%rcx)                                   #461.21 c21
        vaddps    3856(%rdx,%rsi,4), %zmm0, %zmm1               #478.2831 c25
        vprefetch1 -2032(%rcx)                                  #462.21 c29
        vaddps    1296(%rdx,%rsi,4), %zmm30, %zmm3              #478.2407 c33
        vprefetch1 -3312(%rcx)                                  #463.21 c37
        vaddps    2576(%rdx,%rsi,4), %zmm31, %zmm2              #478.2619 c41
        vprefetch1 -4592(%rcx)                                  #464.21 c45
        vmulps    %zmm11, %zmm1, %zmm1                          #478.34 c49
        vprefetch1 1808(%rcx)                                   #465.21 c53
        vfmadd213ps %zmm1, %zmm13, %zmm3                        #478.34 c57
        vprefetch1 3088(%rcx)                                   #466.21 c61
        vprefetch1 4368(%rcx)                                   #467.21 c65
        vprefetch1 5648(%rcx)                                   #468.21 c69
        vprefetch0 -1136(%rcx)                                  #469.21 c73
        vprefetch0 -2416(%rcx)                                  #470.21 c77
        vprefetch0 -3696(%rcx)                                  #471.21 c81
        vprefetch0 -4976(%rcx)                                  #472.21 c85
        vprefetch0 1424(%rcx)                                   #473.21 c89
        vprefetch0 2704(%rcx)                                   #474.21 c93

 

This cause the auto-vectorized version runs significantly faster than the hand-coded version. I shuffle the assembly prefetch instructions by hand in the hand-coded version and I reached the same performance as in the auto-vectorized version.

Could you help me to understand by this is happening? What could I do to have a similar order of instructions generated automatically in the hand-vectorized version?

Thank you in advance!

0 Kudos
1 Reply
Amanda_S_Intel
Employee
172 Views

The autovectorizer is able to do a pretty good job of analyzing the loop and instruction scheduling. There is no such flexibility with intrinsics. Avoid intrinsics unless you can prove the compiler is not doing a good job. 

When using the prefetch instrinsics, it's best to turn off compiler prefetching (use option -no-qopt-prefetch or #pragma noprefetch). Please refer to the "Prefetching on Intel® MIC Architecture" link on this page: https://software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-a...

Reply