AVX512-VBMI2: VPSHLDV masks its shift count preventing use as a blend

Peter_Cordes · ‎12-09-2017

Is it too late to suggest a change to AVX512_VBMI2 for Ice Lake? (Regardless of that, I'm curious about the design decision.)

VPSHLDV (and the W / Q versions) would potentially have more uses (or save a blend instruction) if they allowed shift counts large enough to take the entire element from SRC2, instead of being limited to keeping at least one bit from the DST vector. The current definition in the

https://software.intel.com/content/dam/develop/external/us/en/documents-tps/architecture-instruction-set-extensions-programming-reference.pdf

is:

tmp ← concat(DEST.dword, SRC2.dword) << (tsrc3 & 31)

(Or & 15 for the VPSHL/RVW, & 63 for VPSHL/RVQ)

This is inconsistent with regular vector shifts, which don't mask their count (e.g. AVX2 and AVX512F VPSLLVD can zero elements with a shift count of 32 or higher. e.g. vpcmpeqd xmm0, xmm0,xmm0 / vpsllvd xmm0, xmm0, xmm0 produces all-zeros. Same for MMX/SSE2/AVX/... (V)PSLLD)

It is consistent with scalar integer SHLD, but arguably the vector version benefits more from having some elements able to produce SRC2, or even SRC2 left-shifted (but that would require a much wider barrel shifter).

I don't have any particular application in mind; maybe some applications benefit from the implicit masking and would otherwise need a VPANDD. I'm picturing a case where you have a constant vector of shift counts to get different windows for different elements, and for some elements it's useful to have a count of zero, and others it's useful to have a count of 32. Maybe there aren't any real use cases like that, or few enough that you don't mind forcing them to use an extra blend instruction if it saves transistors implementing this instruction.

For VPSHRDV, which gives you DEST.dword ← concat(SRC2.dword, DEST.dword) >> (tsrc3 & 31), you can keep elements of DEST with merge masking, or keep elements of SRC2 with a count of zero. (But obviously for consistency, if VPSHLDV changes, then VPSHRDV should change, too, along with the non-V versions!)

Allowing an extra bit of shift count for W / D / Q would mean counts up to 127 for the Q version; that means a much wider barrel shifter, so that's probably not a viable option. Perhaps saturating the count at 16/32/64 would be efficiently possible. (Earlier SSE/AVX shifts effectively saturate the count to 16/32/64 (leaving the register or element = 0), so this behaviour would be consistent.

If we're not so concerned about consistency, perhaps the Q version could saturate to 64, and so could the D and W versions. So the D and W versions could produce SRC2<<n, but the Q version could only produce SRC2<<0. Or maybe that doesn't work for the elements that are using the top part of a 128-bit barrel shifter which normally can't go that far.