Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Michael_H_3
Beginner
134 Views

KNL - Compiler clearing dest. register for gather/scatter

Jump to solution

Hello,

I discovered that the icpc (17.01) for xmic-avx512 (KNL) clears the destination register before the corresponding gather instruction by introducing
vpxord %zmm2, %zmm2, %zmm2
although I am using a non-masked gather instruction which implies that the entire register will be written. 

In Jeffers' and Sodani's Book about KNL programming, the authors also show a similiar line in fig6.32 but unfortunately they just give an unsatisfying comment on it: "Clearing the contents of the zmm1 registers for Gather/Scatter Operation".

Can someone please explain to me the reasoning behind the introduction of this line?

Regards, 
Michael

Appendix:
Full Loop in ASM:

..B1.7: # Preds ..B1.7 ..B1.6
    # Execution count [5.00e+00]

    vpxord      %zmm2, %zmm2, %zmm2                           #18.12 c1
    kxnorw      %k0, %k0, %k1                                 #18.12 c1
    addl        $1, %eax                                      #17.2 c1
    kxnorw      %k0, %k0, %k2                                 #20.3 c3
    vgatherdpd   (%r12,%ymm0,8), %zmm2{%k1}                   #18.12 c3
    vaddpd      %zmm2, %zmm1, %zmm3                           #19.12 c9 stall 2
    vscatterdpd   %zmm3, (%r12,%ymm0,8){%k2}                  #20.3 c15 stall 2
    cmpl        $250, %eax                                    #17.2 c15
    jb          ..B1.7        # Prob 82%                      #17.2 c17


Actual C++ Code:
https://pastebin.com/wGpcFiGm

Compiled with:
icpc -std=c++11 -O3 -xmic-avx512 gather.cpp -o gather.out
icpc -std=c++11 -O3 -xmic-avx512 gather.cpp -o gather.asm -S -fverbose-asm

0 Kudos
1 Solution
JJoha8
New Contributor I
134 Views

It's a dependency-breaking idiom inserted by the compiler to help the out-of-order engine detect there is no dependency on the destination register used by the gather instruction.

Note that the instruction will not negatively impact instruction throughput since it is handled in the in-order front end: The renamer detects zero idioms, discards the instruction, and allocates a 'fresh' physical register to the referenced architectural register; thus no dependency for the physical register will be detected in the OOO engine.

View solution in original post

3 Replies
jimdempseyatthecove
Black Belt
134 Views

Maybe someone else can address the presence of vpxord, There 3 other issues:

1) Why is the vpxord located inside the ..B1.7 loop?
2) Why are two mask resisters (k1 k2) used instead of one?
3) Why is/are the one-ing of the mask register(s) located inside the ..B1.7 loop?

By the way, I do not see the index register (ymm0) .OR. base register (r12) advancing in the loop. Are you missing lines of code from your paste?

Jim Dempsey

Michael_H_3
Beginner
134 Views

Thanks for the reply, Jim.

3) Why is/are the one-ing of the mask register(s) located inside the ..B1.7 loop?
The one-ing has to be located inside the B1.7 loop 
since internally (at the end of a gather/scatter instruction) the masks are unset (set to zero) for every loaded element. Therefore, the masks have to be reset within the loop.

2) Why are two mask resisters (k1 k2) used instead of one?
As mentioned in 3) masks are cleared as the last thing gather/scatter instructions do. Hence, when using a single mask, it has to be set again after every gather/scatter instruction. One could do that, but this introduces loop carried dependencies between the gather/scatter and their corresponding kxnor instructions.


Are you missing lines of code from your paste?
No, I am not. The code does not make "sense" the way I presented it here. I just came up with minimal working example that still produces code feasible for my question. You can see in the C++ code origin in my initial post.

Regards
Michael

 


 

JJoha8
New Contributor I
135 Views

It's a dependency-breaking idiom inserted by the compiler to help the out-of-order engine detect there is no dependency on the destination register used by the gather instruction.

Note that the instruction will not negatively impact instruction throughput since it is handled in the in-order front end: The renamer detects zero idioms, discards the instruction, and allocates a 'fresh' physical register to the referenced architectural register; thus no dependency for the physical register will be detected in the OOO engine.

View solution in original post

Reply