Re: Shift with VPSLLQ by 63 or 64 bits results in all zeroes

Jones__Brian · ‎09-07-2020

I have ymm27 filled with four double-precision floats: 73.0 41.0 26.0 83.0. I want to shift the second qword (41.0) into the first position (where 73.0 is now). I tried both VPSRLQ and VPSLLQ as follows:

mov rax,2
kmovq k1,rax
vpxor ymm1,ymm1
VPSRLQ ymm1{k1}{z},ymm27,63

OR

vpxor ymm1,ymm1
VPSRLQ ymm1{k1}{z},ymm27,64

or the same two tests with VPSLLQ. I tried both 63 bits and 64 bits. But in all 4 cases, the first four elements of ymm1 are zero.

I set the mask register (k1) to 2 (00000010) and also 64 (01000000) with the second bit set (because I want to move the second qword in ymm27), but both give the same results.

The Intel documentation on this instruction is sparse with no coding examples, and there is very little discussion of this instruction in other online sources.

I believe that shifting by 63 or 64 bits would logically move a 64-bit value, but that's not the case.

My misunderstanding may be that this instruction shifts the bits in place -- within the same position in the ymm register -- and that it does not shift the bits to the next position left or right in the ymm register. But that's not clear from the documentation.

How can I use these instructions (or any others) to shift the second position qword in ymm27 (127:64) to the first position (63:0)?

McCalpinJohn · ‎09-08-2020

Those shift instructions operate *within* the "word size" specified -- not across the full SIMD register.

There are a lot of instructions for moving data across "lanes" of the SIMD registers. Some of these are limited to working within 128-bit-aligned "lanes", and others allow motion anywhere in the register. The reason for this difference is that all of the SIMD functional units can move data within 128-bit-aligned lanes, but only the functional unit on "port 5" (or the load/store units) can move data across 128-bit-aligned lanes.

In your specific case, it looks like the VSHUFPD instruction will do what you want. Note that the description says that it is limited to selecting which value is chosen within each 128-bit lane. For more general cross-lane copy operations, there are ridiculous number of "permute" instructions available -- especially with the AVX-512 instruction set. Some (like AVX2 VPERMPD) use "immediate operands", so the "swizzle" must be known at compile time. Others (like AVX2 VPERMPS) get the indices from a ymm register, so the "swizzle" can be computed at run-time. (It is not uncommon for the compiler to use instructions for "quadwords" or "packed singles" on "packed double" data -- these instructions don't look at the bits, so as long as you remember to copy both halves of each packed double element to the right place, everything works fine.)