- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm looking into the compilation result, of what the Intel compiler makes out of AVX512 intrinsics.
(latest Intel trial compiler downloaded a few weeks ago)
There are several strange things I notice, to number a few
1) only ZMM0 to ZMM15 used --> aren't there 32 registers to play with ?
2) function calls for computing _mm512_floor_ps(x) becomes
call __svml_floorf16 --> why ???
3) weird code all over the place like next --> what the hell is this ??
.B157.200:: ; Preds .B157.200 .B157.250
; Execution count [1.31e+000]
vmovups xmm0, XMMWORD PTR [-16+r13+rax] ;641.19
vmovups XMMWORD PTR [3440+r13+rax], xmm0 ;641.19
vmovups xmm1, XMMWORD PTR [-32+r13+rax] ;641.19
vmovups XMMWORD PTR [3424+r13+rax], xmm1 ;641.19
vmovups xmm2, XMMWORD PTR [-48+r13+rax] ;641.19
vmovups XMMWORD PTR [3408+r13+rax], xmm2 ;641.19
vmovups xmm3, XMMWORD PTR [-64+r13+rax] ;641.19
vmovups XMMWORD PTR [3392+r13+rax], xmm3 ;641.19
sub rax, 64 ;641.19
jne .B157.200 ; Prob 66% ;641.19
Is there any way to get better code out of the compiler ?
And yes this is compiled in release mode with compiler optimization.
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In case any one else is running into these issues and wants to know what to do.
Here the answer I figured out myself.
The code generation options are kind of confusing, but if you set:
Intel Processor-Specific Optimizations TO
Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) common for Intel(R) Xeon(R) and Intel(R) Xeon Phi(TM) processors (/QxCOMMON-AVX512)
Then all issues are solved and the code looks a million times better.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The code at the bottom looks like fairly ordinary spill code....
The compiler intrinsics for SIMD instructions don't control the compilation as precisely as one might expect. I have had very good luck with simple loops with only a few intrinsics and no significant register pressure, but I have also seen the compiler generate some really bad code in cases where I have dozens of intrinsics and require the use of all (or almost all) of the registers.
For more precise control, I usually compile a reference implementation in C to a ".s" file and manually modify the assembly language code there. Sometimes this is easy, sometimes not....
Since this is usually for testing rather than for production, I typically add compiler flags and pragmas to try to encourage the compiler to generate only one version of the loop I am interested it. Examples are -fno-alias and #pragma vector aligned.
If the compiler generates multiple versions of the loop and I am not sure which one(s) get executed at run time, I will often "poison" the assembly code by replacing an instruction with something that will generate an obviously wrong answer. For example, I will change a VADDPD to a VMULPD, then look to see if the results change. If not, then I know that the "poisoned" version of the loop was not actually executed for my input parameters.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Judging by the fact that the disassembled part is a loop, it looks like this is an inlined and unrolled equivalent of memcpy. Nothing particularly wrong with this code, assuming the compiler has also generated a preamble to achieve target pointer alignment (or has a reason to believe the pointer is aligned already).
I can't comment on the other issues. Also, perhaps showing your code here would be helpful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
andysem wrote:
Judging by the fact that the disassembled part is a loop, it looks like this is an inlined and unrolled equivalent of memcpy. Nothing particularly wrong with this code, assuming the compiler has also generated a preamble to achieve target pointer alignment (or has a reason to believe the pointer is aligned already).
I can't comment on the other issues. Also, perhaps showing your code here would be helpful.
Indeed issue nr 3, is a memory copy.
Assuming r13 is a stackpointer, it copies something from a lower part of the stack to a higher part of the stack. Quite unusual IMHO. Maybe it tries to realign data on the stack...
R13 is intialized as next, so indeed it is a stackpointer.
lea r13, QWORD PTR [127+rsp] ;544.1
and r13, -64
The realignment seems to be done on 64 byte multiples, and that's the size of an AVX512 register.
I hope this can be fixed by some compiler setting, anybody has an idea ?
(Compiliing the same code using AVX2 intrinsics doesn't have this stack issue)
Can somebody comment also on using only 16 out of 32 registers, and using a call for a simple single instruction floor ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here another interesting code fragment. A memcpy of 3x64 bytes occurs on the stack from source to destination And immediately after a copy of the same data from the destination to the source. Hopefully the free Microsoft compiler will perform better.
mov eax, 192 ;573.21 ; LOE rax r15 ebp edi zmm0 zmm1 zmm2 zmm11 zmm12 .B157.169:: ; Preds .B157.169 .B157.5 ; Execution count [1.50e+000] vmovups xmm3, XMMWORD PTR [-16+r13+rax] ;573.21 vmovups XMMWORD PTR [176+r13+rax], xmm3 ;573.21 vmovups xmm4, XMMWORD PTR [-32+r13+rax] ;573.21 vmovups XMMWORD PTR [160+r13+rax], xmm4 ;573.21 vmovups xmm5, XMMWORD PTR [-48+r13+rax] ;573.21 vmovups XMMWORD PTR [144+r13+rax], xmm5 ;573.21 vmovups xmm6, XMMWORD PTR [-64+r13+rax] ;573.21 vmovups XMMWORD PTR [128+r13+rax], xmm6 ;573.21 sub rax, 64 ;573.21 jne .B157.169 ; Prob 66% ;573.21 ; LOE rax r15 ebp edi zmm0 zmm1 zmm2 zmm11 zmm12 .B157.6:: ; Preds .B157.169 ; Execution count [5.00e-001] mov eax, 192 ;573.21 ; LOE rax r15 ebp edi zmm0 zmm1 zmm2 zmm11 zmm12 .B157.170:: ; Preds .B157.170 .B157.6 ; Execution count [1.50e+000] vmovups xmm3, XMMWORD PTR [176+r13+rax] ;573.21 vmovups XMMWORD PTR [-16+r13+rax], xmm3 ;573.21 vmovups xmm4, XMMWORD PTR [160+r13+rax] ;573.21 vmovups XMMWORD PTR [-32+r13+rax], xmm4 ;573.21 vmovups xmm5, XMMWORD PTR [144+r13+rax] ;573.21 vmovups XMMWORD PTR [-48+r13+rax], xmm5 ;573.21 vmovups xmm6, XMMWORD PTR [128+r13+rax] ;573.21 vmovups XMMWORD PTR [-64+r13+rax], xmm6 ;573.21 sub rax, 64 ;573.21 jne .B157.170 ; Prob 66% ;573.21 ; LOE rax r15 ebp edi zmm0 zmm1 zmm2 zmm11 zmm12 .B157.7:: ; Preds .B157.170 ; Execution count [5.00e-001]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In case any one else is running into these issues and wants to know what to do.
Here the answer I figured out myself.
The code generation options are kind of confusing, but if you set:
Intel Processor-Specific Optimizations TO
Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) common for Intel(R) Xeon(R) and Intel(R) Xeon Phi(TM) processors (/QxCOMMON-AVX512)
Then all issues are solved and the code looks a million times better.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I believe, that way you're limiting the compiler to AVX-512F only.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
andysem wrote:
I believe, that way you're limiting the compiler to AVX-512F only.
Also AVX-512CD, but that's sufficient for my purpose, and allows running on KNL too.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page