- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
I'm looking into the compilation result, of what the Intel compiler makes out of AVX512 intrinsics.
(latest Intel trial compiler downloaded a few weeks ago)
There are several strange things I notice, to number a few
1) only ZMM0 to ZMM15 used --> aren't there 32 registers to play with ?
2) function calls for computing _mm512_floor_ps(x) becomes
call __svml_floorf16 --> why ???
3) weird code all over the place like next --> what the hell is this ??
.B157.200:: ; Preds .B157.200 .B157.250
; Execution count [1.31e+000]
vmovups xmm0, XMMWORD PTR [-16+r13+rax] ;641.19
vmovups XMMWORD PTR [3440+r13+rax], xmm0 ;641.19
vmovups xmm1, XMMWORD PTR [-32+r13+rax] ;641.19
vmovups XMMWORD PTR [3424+r13+rax], xmm1 ;641.19
vmovups xmm2, XMMWORD PTR [-48+r13+rax] ;641.19
vmovups XMMWORD PTR [3408+r13+rax], xmm2 ;641.19
vmovups xmm3, XMMWORD PTR [-64+r13+rax] ;641.19
vmovups XMMWORD PTR [3392+r13+rax], xmm3 ;641.19
sub rax, 64 ;641.19
jne .B157.200 ; Prob 66% ;641.19
Is there any way to get better code out of the compiler ?
And yes this is compiled in release mode with compiler optimization.
- Etiquetas:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
In case any one else is running into these issues and wants to know what to do.
Here the answer I figured out myself.
The code generation options are kind of confusing, but if you set:
Intel Processor-Specific Optimizations TO
Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) common for Intel(R) Xeon(R) and Intel(R) Xeon Phi(TM) processors (/QxCOMMON-AVX512)
Then all issues are solved and the code looks a million times better.
Enlace copiado
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
The code at the bottom looks like fairly ordinary spill code....
The compiler intrinsics for SIMD instructions don't control the compilation as precisely as one might expect. I have had very good luck with simple loops with only a few intrinsics and no significant register pressure, but I have also seen the compiler generate some really bad code in cases where I have dozens of intrinsics and require the use of all (or almost all) of the registers.
For more precise control, I usually compile a reference implementation in C to a ".s" file and manually modify the assembly language code there. Sometimes this is easy, sometimes not....
Since this is usually for testing rather than for production, I typically add compiler flags and pragmas to try to encourage the compiler to generate only one version of the loop I am interested it. Examples are -fno-alias and #pragma vector aligned.
If the compiler generates multiple versions of the loop and I am not sure which one(s) get executed at run time, I will often "poison" the assembly code by replacing an instruction with something that will generate an obviously wrong answer. For example, I will change a VADDPD to a VMULPD, then look to see if the results change. If not, then I know that the "poisoned" version of the loop was not actually executed for my input parameters.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Judging by the fact that the disassembled part is a loop, it looks like this is an inlined and unrolled equivalent of memcpy. Nothing particularly wrong with this code, assuming the compiler has also generated a preamble to achieve target pointer alignment (or has a reason to believe the pointer is aligned already).
I can't comment on the other issues. Also, perhaps showing your code here would be helpful.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
andysem wrote:
Judging by the fact that the disassembled part is a loop, it looks like this is an inlined and unrolled equivalent of memcpy. Nothing particularly wrong with this code, assuming the compiler has also generated a preamble to achieve target pointer alignment (or has a reason to believe the pointer is aligned already).
I can't comment on the other issues. Also, perhaps showing your code here would be helpful.
Indeed issue nr 3, is a memory copy.
Assuming r13 is a stackpointer, it copies something from a lower part of the stack to a higher part of the stack. Quite unusual IMHO. Maybe it tries to realign data on the stack...
R13 is intialized as next, so indeed it is a stackpointer.
lea r13, QWORD PTR [127+rsp] ;544.1
and r13, -64
The realignment seems to be done on 64 byte multiples, and that's the size of an AVX512 register.
I hope this can be fixed by some compiler setting, anybody has an idea ?
(Compiliing the same code using AVX2 intrinsics doesn't have this stack issue)
Can somebody comment also on using only 16 out of 32 registers, and using a call for a simple single instruction floor ?
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Here another interesting code fragment. A memcpy of 3x64 bytes occurs on the stack from source to destination And immediately after a copy of the same data from the destination to the source. Hopefully the free Microsoft compiler will perform better.
mov eax, 192 ;573.21 ; LOE rax r15 ebp edi zmm0 zmm1 zmm2 zmm11 zmm12 .B157.169:: ; Preds .B157.169 .B157.5 ; Execution count [1.50e+000] vmovups xmm3, XMMWORD PTR [-16+r13+rax] ;573.21 vmovups XMMWORD PTR [176+r13+rax], xmm3 ;573.21 vmovups xmm4, XMMWORD PTR [-32+r13+rax] ;573.21 vmovups XMMWORD PTR [160+r13+rax], xmm4 ;573.21 vmovups xmm5, XMMWORD PTR [-48+r13+rax] ;573.21 vmovups XMMWORD PTR [144+r13+rax], xmm5 ;573.21 vmovups xmm6, XMMWORD PTR [-64+r13+rax] ;573.21 vmovups XMMWORD PTR [128+r13+rax], xmm6 ;573.21 sub rax, 64 ;573.21 jne .B157.169 ; Prob 66% ;573.21 ; LOE rax r15 ebp edi zmm0 zmm1 zmm2 zmm11 zmm12 .B157.6:: ; Preds .B157.169 ; Execution count [5.00e-001] mov eax, 192 ;573.21 ; LOE rax r15 ebp edi zmm0 zmm1 zmm2 zmm11 zmm12 .B157.170:: ; Preds .B157.170 .B157.6 ; Execution count [1.50e+000] vmovups xmm3, XMMWORD PTR [176+r13+rax] ;573.21 vmovups XMMWORD PTR [-16+r13+rax], xmm3 ;573.21 vmovups xmm4, XMMWORD PTR [160+r13+rax] ;573.21 vmovups XMMWORD PTR [-32+r13+rax], xmm4 ;573.21 vmovups xmm5, XMMWORD PTR [144+r13+rax] ;573.21 vmovups XMMWORD PTR [-48+r13+rax], xmm5 ;573.21 vmovups xmm6, XMMWORD PTR [128+r13+rax] ;573.21 vmovups XMMWORD PTR [-64+r13+rax], xmm6 ;573.21 sub rax, 64 ;573.21 jne .B157.170 ; Prob 66% ;573.21 ; LOE rax r15 ebp edi zmm0 zmm1 zmm2 zmm11 zmm12 .B157.7:: ; Preds .B157.170 ; Execution count [5.00e-001]
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
In case any one else is running into these issues and wants to know what to do.
Here the answer I figured out myself.
The code generation options are kind of confusing, but if you set:
Intel Processor-Specific Optimizations TO
Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) common for Intel(R) Xeon(R) and Intel(R) Xeon Phi(TM) processors (/QxCOMMON-AVX512)
Then all issues are solved and the code looks a million times better.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
I believe, that way you're limiting the compiler to AVX-512F only.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
andysem wrote:
I believe, that way you're limiting the compiler to AVX-512F only.
Also AVX-512CD, but that's sufficient for my purpose, and allows running on KNL too.

- Suscribirse a un feed RSS
- Marcar tema como nuevo
- Marcar tema como leído
- Flotar este Tema para el usuario actual
- Favorito
- Suscribir
- Página de impresión sencilla