- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I want to check the range of a vector of double-precision variables, in order to branch to a slow path on exceptional out-of-range cases. My code looks like the following:
// if(any(!(x < 4.) || (x < 2))) { ... }
__mmask8 toobig = _mm512_cmpnlt_pd_mask(x, _mm512_set1_pd(4.));
__mmask8 toosmall = _mm512_cmplt_pd_mask(x, _mm512_set1_pd(2.));
if(!_mm512_kortestz(toobig, toosmall)) {
// do something with out-of-range numbers (slow path)
}
// do something with in-range numbers (fast path)
I expect it to map to a 3-instruction sequence. However, icc (13.1) seems to generate extra data movement and masking between comparisons and test:
### __mmask8 toobig = _mm512_cmpnlt_pd_mask(x, _mm512_set1_pd(4.));
vcmpnltpd k2, zmm0, QWORD PTR .L_2il0floatpacket.5[rip]{1to8} #20.23 c1
### __mmask8 toosmall = _mm512_cmplt_pd_mask(x, _mm512_set1_pd(2.));
vcmpltpd k3, zmm0, QWORD PTR .L_2il0floatpacket.6[rip]{1to8} #21.25 c5
kmov eax, k2 #20.23 c9
mov dl, dl #21.25 c9
kmov edx, k3 #21.25 c13
### if(!_mm512_kortestz(toobig, toosmall)) {
movzx eax, al #22.9 c13
movzx edx, dl #22.9 c17
kmov k0, eax #22.9 c17
kmov k1, edx #22.9 c21
kortest k0, k1 #22.9 c25
je ..B3.3 # Prob 50% #22.9 c25
It seems the compiler generates instructions to clear the high-order bits of the mask. As I understand it, vcmppd already clears the upper part of the mask, so the zero-extend instructions do not seem to serve any useful purpose. Since the code before the branch is on the critical path, I would rather avoid the overhead.
I am attaching a self-repro case, compiled with icpc -mmic -fsource-asm -masm=intel -S mmask8.cpp
If I am not using the proper idiom, what is the recommended way to test __mmask8 variables?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why not use:
if(!(toobig | toosmall)) {
// fast path
} else {
// alternate path
}
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Jim!
It gets slightly better following your suggestion: I still get the same copies to GPRs and zero-extensions, but at least I avoid the trip back to mask registers.
vcmpnltpd k1, zmm0, QWORD PTR .L_2il0floatpacket.16[rip]{1to8} #61.23 c1
vcmpltpd k2, zmm0, QWORD PTR .L_2il0floatpacket.17[rip]{1to8} #62.25 c5
kmov eax, k1 #61.23 c9
mov dl, dl #62.25 c9
kmov edx, k2 #62.25 c13
movzx eax, al #63.8 c13
movzx edx, dl #63.17 c17
or eax, edx #63.17 c21
je ..B6.3 # Prob 50% #63.17 c21
Still much room for improvement... (by the way, I am curious about what mov dl,dl is supposed to do)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The mov dl,dl is there for a stall. Apparently you cannot perform back to back kmov's
Try "if(!((char)toobig | (char)toosmall))"
or reinterpret_cast
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks. Indeed, "mov dl,dl" seems to be there to prevent the second kmov from pairing with the first one, and avoid the dependency on k2. This might reduce the latency by pipelining the second vcmp*pd and the first kmov, so eax is available earlier. (I guess the c* comments at the end of the lines are the expected issue time in cycles within a basic block.)
No luck with attempts to cast to char. The code generated is still the same... Actually, __mmask8 seems to be a typedef for unsigned char (in zmmintrin.h).

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page