Community
cancel
Showing results for 
Search instead for 
Did you mean: 
jan_v_
New Contributor I
130 Views

AVX512 suboptimal intrinsics compilation

Jump to solution

 

I'm looking into the compilation result, of what the Intel compiler makes out of AVX512 intrinsics.
(latest Intel trial compiler downloaded a few weeks ago)

There are several strange things I notice, to number a few

 

1) only ZMM0 to ZMM15 used  --> aren't there 32 registers to play with ?

2) function calls for computing _mm512_floor_ps(x)  becomes

call      __svml_floorf16             --> why ???

3) weird code all over the place like next --> what the hell is this ??

.B157.200::                     ; Preds .B157.200 .B157.250
                                ; Execution count [1.31e+000]
        vmovups   xmm0, XMMWORD PTR [-16+r13+rax]               ;641.19
        vmovups   XMMWORD PTR [3440+r13+rax], xmm0              ;641.19
        vmovups   xmm1, XMMWORD PTR [-32+r13+rax]               ;641.19
        vmovups   XMMWORD PTR [3424+r13+rax], xmm1              ;641.19
        vmovups   xmm2, XMMWORD PTR [-48+r13+rax]               ;641.19
        vmovups   XMMWORD PTR [3408+r13+rax], xmm2              ;641.19
        vmovups   xmm3, XMMWORD PTR [-64+r13+rax]               ;641.19
        vmovups   XMMWORD PTR [3392+r13+rax], xmm3              ;641.19
        sub       rax, 64                                       ;641.19
        jne       .B157.200     ; Prob 66%                      ;641.19

 

Is there any way to get better code out of the compiler ?  

And yes this is compiled in release mode with compiler optimization.

 

 

 

0 Kudos
1 Solution
jan_v_
New Contributor I
130 Views

 

In case any one else is running into these issues and wants to know what to do.
Here the answer I figured out myself.
The code generation options are kind of confusing, but if you set:

Intel Processor-Specific Optimizations       TO

Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) common for Intel(R) Xeon(R) and Intel(R) Xeon Phi(TM) processors (/QxCOMMON-AVX512)

 

Then all issues are solved and the code looks a million times better.

 

View solution in original post

7 Replies
McCalpinJohn
Black Belt
130 Views

The code at the bottom looks like fairly ordinary spill code....

The compiler intrinsics for SIMD instructions don't control the compilation as precisely as one might expect.   I have had very good luck with simple loops with only a few intrinsics and no significant register pressure, but I have also seen the compiler generate some really bad code in cases where I have dozens of intrinsics and require the use of all (or almost all) of the registers. 

For more precise control, I usually compile a reference implementation in C to a ".s" file and manually modify the assembly language code there.   Sometimes this is easy, sometimes not....  

Since this is usually for testing rather than for production, I typically add compiler flags and pragmas to try to encourage the compiler to generate only one version of the loop I am interested it.  Examples are -fno-alias and #pragma vector aligned.

If the compiler generates multiple versions of the loop and I am not sure which one(s) get executed at run time, I will often "poison" the assembly code by replacing an instruction with something that will generate an obviously wrong answer.  For example, I will change a VADDPD to a VMULPD, then look to see if the results change.  If not, then I know that the "poisoned" version of the loop was not actually executed for my input parameters.

andysem
New Contributor III
130 Views

Judging by the fact that the disassembled part is a loop, it looks like this is an inlined and unrolled equivalent of memcpy. Nothing particularly wrong with this code, assuming the compiler has also generated a preamble to achieve target pointer alignment (or has a reason to believe the pointer is aligned already).

I can't comment on the other issues. Also, perhaps showing your code here would be helpful.

 

jan_v_
New Contributor I
130 Views

andysem wrote:

Judging by the fact that the disassembled part is a loop, it looks like this is an inlined and unrolled equivalent of memcpy. Nothing particularly wrong with this code, assuming the compiler has also generated a preamble to achieve target pointer alignment (or has a reason to believe the pointer is aligned already).

I can't comment on the other issues. Also, perhaps showing your code here would be helpful.

 

Indeed issue nr 3,  is a memory copy.

Assuming r13 is a stackpointer, it copies something from a lower part of the stack to a higher part of the stack. Quite unusual IMHO. Maybe it tries to realign data on the stack...

R13 is intialized as next, so indeed it is a stackpointer.

        lea       r13, QWORD PTR [127+rsp]                      ;544.1
        and       r13, -64                  

The realignment seems to be done on 64 byte multiples, and that's the size of an AVX512 register.

I hope this can be fixed by some compiler setting, anybody has an idea ?

(Compiliing the same code using AVX2 intrinsics doesn't have this stack issue)

 

Can somebody comment also on using only 16 out of 32 registers, and using a call for a simple single instruction floor ?

 

 

jan_v_
New Contributor I
130 Views

Here another interesting code fragment. A memcpy of 3x64 bytes occurs on the stack from source to destination And immediately after a copy of the same data from the destination to the source. Hopefully the free Microsoft compiler will perform better.

 

        mov       eax, 192                                      ;573.21
                                ; LOE rax r15 ebp edi zmm0 zmm1 zmm2 zmm11 zmm12
.B157.169::                     ; Preds .B157.169 .B157.5
                                ; Execution count [1.50e+000]
        vmovups   xmm3, XMMWORD PTR [-16+r13+rax]               ;573.21
        vmovups   XMMWORD PTR [176+r13+rax], xmm3               ;573.21
        vmovups   xmm4, XMMWORD PTR [-32+r13+rax]               ;573.21
        vmovups   XMMWORD PTR [160+r13+rax], xmm4               ;573.21
        vmovups   xmm5, XMMWORD PTR [-48+r13+rax]               ;573.21
        vmovups   XMMWORD PTR [144+r13+rax], xmm5               ;573.21
        vmovups   xmm6, XMMWORD PTR [-64+r13+rax]               ;573.21
        vmovups   XMMWORD PTR [128+r13+rax], xmm6               ;573.21
        sub       rax, 64                                       ;573.21
        jne       .B157.169     ; Prob 66%                      ;573.21
                                ; LOE rax r15 ebp edi zmm0 zmm1 zmm2 zmm11 zmm12
.B157.6::                       ; Preds .B157.169
                                ; Execution count [5.00e-001]
        mov       eax, 192                                      ;573.21
                                ; LOE rax r15 ebp edi zmm0 zmm1 zmm2 zmm11 zmm12
.B157.170::                     ; Preds .B157.170 .B157.6
                                ; Execution count [1.50e+000]
        vmovups   xmm3, XMMWORD PTR [176+r13+rax]               ;573.21
        vmovups   XMMWORD PTR [-16+r13+rax], xmm3               ;573.21
        vmovups   xmm4, XMMWORD PTR [160+r13+rax]               ;573.21
        vmovups   XMMWORD PTR [-32+r13+rax], xmm4               ;573.21
        vmovups   xmm5, XMMWORD PTR [144+r13+rax]               ;573.21
        vmovups   XMMWORD PTR [-48+r13+rax], xmm5               ;573.21
        vmovups   xmm6, XMMWORD PTR [128+r13+rax]               ;573.21
        vmovups   XMMWORD PTR [-64+r13+rax], xmm6               ;573.21
        sub       rax, 64                                       ;573.21
        jne       .B157.170     ; Prob 66%                      ;573.21
                                ; LOE rax r15 ebp edi zmm0 zmm1 zmm2 zmm11 zmm12
.B157.7::                       ; Preds .B157.170
                                ; Execution count [5.00e-001]

 

jan_v_
New Contributor I
131 Views

 

In case any one else is running into these issues and wants to know what to do.
Here the answer I figured out myself.
The code generation options are kind of confusing, but if you set:

Intel Processor-Specific Optimizations       TO

Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) common for Intel(R) Xeon(R) and Intel(R) Xeon Phi(TM) processors (/QxCOMMON-AVX512)

 

Then all issues are solved and the code looks a million times better.

 

View solution in original post

andysem
New Contributor III
130 Views

I believe, that way you're limiting the compiler to AVX-512F only.

jan_v_
New Contributor I
130 Views

andysem wrote:

I believe, that way you're limiting the compiler to AVX-512F only.

Also AVX-512CD, but that's sufficient for my purpose, and allows running on KNL too.

Reply