- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I discovered that the icpc (17.01) for xmic-avx512 (KNL) clears the destination register before the corresponding gather instruction by introducing
vpxord %zmm2, %zmm2, %zmm2
although I am using a non-masked gather instruction which implies that the entire register will be written.
In Jeffers' and Sodani's Book about KNL programming, the authors also show a similiar line in fig6.32 but unfortunately they just give an unsatisfying comment on it: "Clearing the contents of the zmm1 registers for Gather/Scatter Operation".
Can someone please explain to me the reasoning behind the introduction of this line?
Regards,
Michael
Appendix:
Full Loop in ASM:
..B1.7: # Preds ..B1.7 ..B1.6
# Execution count [5.00e+00]
vpxord %zmm2, %zmm2, %zmm2 #18.12 c1 kxnorw %k0, %k0, %k1 #18.12 c1 addl $1, %eax #17.2 c1 kxnorw %k0, %k0, %k2 #20.3 c3 vgatherdpd (%r12,%ymm0,8), %zmm2{%k1} #18.12 c3 vaddpd %zmm2, %zmm1, %zmm3 #19.12 c9 stall 2 vscatterdpd %zmm3, (%r12,%ymm0,8){%k2} #20.3 c15 stall 2 cmpl $250, %eax #17.2 c15 jb ..B1.7 # Prob 82% #17.2 c17
Actual C++ Code:
https://pastebin.com/wGpcFiGm
Compiled with:
icpc -std=c++11 -O3 -xmic-avx512 gather.cpp -o gather.out
icpc -std=c++11 -O3 -xmic-avx512 gather.cpp -o gather.asm -S -fverbose-asm
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's a dependency-breaking idiom inserted by the compiler to help the out-of-order engine detect there is no dependency on the destination register used by the gather instruction.
Note that the instruction will not negatively impact instruction throughput since it is handled in the in-order front end: The renamer detects zero idioms, discards the instruction, and allocates a 'fresh' physical register to the referenced architectural register; thus no dependency for the physical register will be detected in the OOO engine.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe someone else can address the presence of vpxord, There 3 other issues:
1) Why is the vpxord located inside the ..B1.7 loop?
2) Why are two mask resisters (k1 k2) used instead of one?
3) Why is/are the one-ing of the mask register(s) located inside the ..B1.7 loop?
By the way, I do not see the index register (ymm0) .OR. base register (r12) advancing in the loop. Are you missing lines of code from your paste?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the reply, Jim.
3) Why is/are the one-ing of the mask register(s) located inside the ..B1.7 loop?
The one-ing has to be located inside the B1.7 loop since internally (at the end of a gather/scatter instruction) the masks are unset (set to zero) for every loaded element. Therefore, the masks have to be reset within the loop.
2) Why are two mask resisters (k1 k2) used instead of one?
As mentioned in 3) masks are cleared as the last thing gather/scatter instructions do. Hence, when using a single mask, it has to be set again after every gather/scatter instruction. One could do that, but this introduces loop carried dependencies between the gather/scatter and their corresponding kxnor instructions.
Are you missing lines of code from your paste?
No, I am not. The code does not make "sense" the way I presented it here. I just came up with minimal working example that still produces code feasible for my question. You can see in the C++ code origin in my initial post.
Regards
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's a dependency-breaking idiom inserted by the compiler to help the out-of-order engine detect there is no dependency on the destination register used by the gather instruction.
Note that the instruction will not negatively impact instruction throughput since it is handled in the in-order front end: The renamer detects zero idioms, discards the instruction, and allocates a 'fresh' physical register to the referenced architectural register; thus no dependency for the physical register will be detected in the OOO engine.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page