Solved: AVX512 suboptimal intrinsics compilation

jan_v_ · ‎01-03-2017

I'm looking into the compilation result, of what the Intel compiler makes out of AVX512 intrinsics.
(latest Intel trial compiler downloaded a few weeks ago)

There are several strange things I notice, to number a few

1) only ZMM0 to ZMM15 used --> aren't there 32 registers to play with ?

2) function calls for computing _mm512_floor_ps(x) becomes

call __svml_floorf16 --> why ???

3) weird code all over the place like next --> what the hell is this ??

.B157.200::                     ; Preds .B157.200 .B157.250
                                ; Execution count [1.31e+000]
        vmovups   xmm0, XMMWORD PTR [-16+r13+rax]               ;641.19
        vmovups   XMMWORD PTR [3440+r13+rax], xmm0              ;641.19
        vmovups   xmm1, XMMWORD PTR [-32+r13+rax]               ;641.19
        vmovups   XMMWORD PTR [3424+r13+rax], xmm1              ;641.19
        vmovups   xmm2, XMMWORD PTR [-48+r13+rax]               ;641.19
        vmovups   XMMWORD PTR [3408+r13+rax], xmm2              ;641.19
        vmovups   xmm3, XMMWORD PTR [-64+r13+rax]               ;641.19
        vmovups   XMMWORD PTR [3392+r13+rax], xmm3              ;641.19
        sub       rax, 64                                       ;641.19
        jne       .B157.200     ; Prob 66%                      ;641.19

Is there any way to get better code out of the compiler ?

And yes this is compiled in release mode with compiler optimization.

jan_v_ · ‎01-06-2017

In case any one else is running into these issues and wants to know what to do.
Here the answer I figured out myself.
The code generation options are kind of confusing, but if you set:

Intel Processor-Specific Optimizations TO

Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) common for Intel(R) Xeon(R) and Intel(R) Xeon Phi(TM) processors (/QxCOMMON-AVX512)

Then all issues are solved and the code looks a million times better.

View solution in original post

McCalpinJohn · ‎01-04-2017

The code at the bottom looks like fairly ordinary spill code....

The compiler intrinsics for SIMD instructions don't control the compilation as precisely as one might expect. I have had very good luck with simple loops with only a few intrinsics and no significant register pressure, but I have also seen the compiler generate some really bad code in cases where I have dozens of intrinsics and require the use of all (or almost all) of the registers.

For more precise control, I usually compile a reference implementation in C to a ".s" file and manually modify the assembly language code there. Sometimes this is easy, sometimes not....

Since this is usually for testing rather than for production, I typically add compiler flags and pragmas to try to encourage the compiler to generate only one version of the loop I am interested it. Examples are -fno-alias and #pragma vector aligned.

If the compiler generates multiple versions of the loop and I am not sure which one(s) get executed at run time, I will often "poison" the assembly code by replacing an instruction with something that will generate an obviously wrong answer. For example, I will change a VADDPD to a VMULPD, then look to see if the results change. If not, then I know that the "poisoned" version of the loop was not actually executed for my input parameters.

andysem · ‎01-04-2017

Judging by the fact that the disassembled part is a loop, it looks like this is an inlined and unrolled equivalent of memcpy. Nothing particularly wrong with this code, assuming the compiler has also generated a preamble to achieve target pointer alignment (or has a reason to believe the pointer is aligned already).

I can't comment on the other issues. Also, perhaps showing your code here would be helpful.

jan_v_ · ‎01-04-2017

andysem wrote:

Judging by the fact that the disassembled part is a loop, it looks like this is an inlined and unrolled equivalent of memcpy. Nothing particularly wrong with this code, assuming the compiler has also generated a preamble to achieve target pointer alignment (or has a reason to believe the pointer is aligned already).

I can't comment on the other issues. Also, perhaps showing your code here would be helpful.

Indeed issue nr 3, is a memory copy.

Assuming r13 is a stackpointer, it copies something from a lower part of the stack to a higher part of the stack. Quite unusual IMHO. Maybe it tries to realign data on the stack...

R13 is intialized as next, so indeed it is a stackpointer.

lea r13, QWORD PTR [127+rsp] ;544.1
and r13, -64

The realignment seems to be done on 64 byte multiples, and that's the size of an AVX512 register.

I hope this can be fixed by some compiler setting, anybody has an idea ?

(Compiliing the same code using AVX2 intrinsics doesn't have this stack issue)

Can somebody comment also on using only 16 out of 32 registers, and using a call for a simple single instruction floor ?

jan_v_ · ‎01-05-2017

Here another interesting code fragment. A memcpy of 3x64 bytes occurs on the stack from source to destination And immediately after a copy of the same data from the destination to the source. Hopefully the free Microsoft compiler will perform better.

        mov       eax, 192                                      ;573.21
                                ; LOE rax r15 ebp edi zmm0 zmm1 zmm2 zmm11 zmm12
.B157.169::                     ; Preds .B157.169 .B157.5
                                ; Execution count [1.50e+000]
        vmovups   xmm3, XMMWORD PTR [-16+r13+rax]               ;573.21
        vmovups   XMMWORD PTR [176+r13+rax], xmm3               ;573.21
        vmovups   xmm4, XMMWORD PTR [-32+r13+rax]               ;573.21
        vmovups   XMMWORD PTR [160+r13+rax], xmm4               ;573.21
        vmovups   xmm5, XMMWORD PTR [-48+r13+rax]               ;573.21
        vmovups   XMMWORD PTR [144+r13+rax], xmm5               ;573.21
        vmovups   xmm6, XMMWORD PTR [-64+r13+rax]               ;573.21
        vmovups   XMMWORD PTR [128+r13+rax], xmm6               ;573.21
        sub       rax, 64                                       ;573.21
        jne       .B157.169     ; Prob 66%                      ;573.21
                                ; LOE rax r15 ebp edi zmm0 zmm1 zmm2 zmm11 zmm12
.B157.6::                       ; Preds .B157.169
                                ; Execution count [5.00e-001]
        mov       eax, 192                                      ;573.21
                                ; LOE rax r15 ebp edi zmm0 zmm1 zmm2 zmm11 zmm12
.B157.170::                     ; Preds .B157.170 .B157.6
                                ; Execution count [1.50e+000]
        vmovups   xmm3, XMMWORD PTR [176+r13+rax]               ;573.21
        vmovups   XMMWORD PTR [-16+r13+rax], xmm3               ;573.21
        vmovups   xmm4, XMMWORD PTR [160+r13+rax]               ;573.21
        vmovups   XMMWORD PTR [-32+r13+rax], xmm4               ;573.21
        vmovups   xmm5, XMMWORD PTR [144+r13+rax]               ;573.21
        vmovups   XMMWORD PTR [-48+r13+rax], xmm5               ;573.21
        vmovups   xmm6, XMMWORD PTR [128+r13+rax]               ;573.21
        vmovups   XMMWORD PTR [-64+r13+rax], xmm6               ;573.21
        sub       rax, 64                                       ;573.21
        jne       .B157.170     ; Prob 66%                      ;573.21
                                ; LOE rax r15 ebp edi zmm0 zmm1 zmm2 zmm11 zmm12
.B157.7::                       ; Preds .B157.170
                                ; Execution count [5.00e-001]

jan_v_ · ‎01-06-2017

In case any one else is running into these issues and wants to know what to do.
Here the answer I figured out myself.
The code generation options are kind of confusing, but if you set:

Intel Processor-Specific Optimizations TO

Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) common for Intel(R) Xeon(R) and Intel(R) Xeon Phi(TM) processors (/QxCOMMON-AVX512)

Then all issues are solved and the code looks a million times better.

andysem · ‎01-07-2017

I believe, that way you're limiting the compiler to AVX-512F only.

jan_v_ · ‎01-07-2017

andysem wrote:

I believe, that way you're limiting the compiler to AVX-512F only.

Also AVX-512CD, but that's sufficient for my purpose, and allows running on KNL too.