Integer SIMD instructions weirdness -- needs fixing - Page 2

levicki · ‎02-03-2011

I hope that Intel engineers are going to read this, and improve integer SIMD instructions on future CPUs.

1. PSRLB/PSLLB/PSRAB/PSLAB -- they do not exist. How about adding them?

2. All SIMD shifts -- it is not possible to shift each packed byte/word/dword/qword by different amount.

Regarding #2:

- Why did you make another SIMD register as a count parameter for those instructions if we cannot specify more than one shift value?!? You could have simply used GPR for the count parameter!

PSRLW xmm0, eax would be more usefull than PSRLW xmm0, xmm2, not to mention that you could use xmm2 to have 8 different shift counts, one for each word in the destination.

Because of someone's mistake, now we will never have proper SIMD shift instruction -- behavior of PSRLW xmm0, xmm2 can never be changed. If we want more usefull shift we will need another instruction, and the current one will stay forever as a dead weight in the x86 instruction set.

3. SIMD bit manipulation (PBSETB/W/D/Q, PBCLRB/W/D/Q, PBTSTB/W/D/Q) -- it would be nice to have an instruction which can set, clear, or test different bit in each packed byte, word, dword, or qword.

Example:

xmm0 = 0 0 0 0 (dword)
xmm1 = 4 3 2 1 (dword)

PBSETD xmm0, xmm1

xmm0 = 0x10, 0x08, 0x04, 0x02 (dword)

To be continued.

levicki · ‎02-19-2011

Thomas,

Thanks for improving the code futher, now I am curious to test its performance.

Also, many thanks for the link, a lot of nice tricks to learn.

jimdempseyatthecove · ‎02-26-2011

>>PSRLW xmm0, eax would be more usefull than PSRLW xmm0, xmm2, not to mention that you could use xmm2 to have 8 different shift counts, one for each word in the destination.

I fully agree. Add

PSRLW xmm0, imm8

etc for L/R B/W/D/Q/QQ (QQ would use ymm...)

XMM registers are too scarce of a resource to be wasted for an 8-bit shift count.
Different shift values is interesting

PSRL? xmm0,xmm1

where ? = B/W/D/Q/DQ/QQ (QQ would use ymm...)

and where the ? in the 2nd register contains each of the counts.
I am not sure how frequently the differing counts would appear in code.

Also I agree with you that there should be orthagonality. We have move sign bits of (b/w/d/q/qq) to gp register, there should also be move the other way. Some thought should be given as to if 0x80 or 0xFF gets moved in for bytes. I think the 0xFF would align itself with the masks generated with the compares.

Jim Dempsey

levicki · ‎02-26-2011

Jim, there is already PSRLW xmm, imm8 if I am not mistaken :)

Different shift counts would enable really interesting bit manipulations, especially if horizontal OR instruction would be added. Perhaps even better if we had a packed bit shuffle instruction.

Regarding masks, I prefer 0x80.

Thomas_W_Intel · ‎03-04-2011

Quoting Igor Levicki

Regarding masks, I prefer 0x80.

Can you elaborate a little bit, why this is the case? My personal choice would be a mask. I can also understand if someone wants to negate or copy the sign, but Icannot seewhy you would like to set only the sign bit of an integer.

levicki · ‎03-04-2011

Thomas,

Please read previous post by Jim Dempsey for context regarding 0x80 or 0xFF.

Whole idea to be able to move sign bits from GPR to SIMD register is to be able to do bit packing/unpacking which is needed for example for planar to packed (a.k.a. chunky) conversion.

For example, PMOVMSKB can be used for packed to planar conversion because it can move sign bits from 8 successive bytes from a SIMD register to a single byte in GPR which can then be stored as 8 successive bits from bitplane 7 into memory. Then you shift whole SIMD register left by 1 bit and repeat PMOVMSKB storing the next result into bitplane 6, etc.

Reverse from that would take a byte from plane 0 in memory to a GPR, move it to sign bits in SIMD using the instruction that does opposite from PMOVMSKB, then you shift whole SIMD register right by 1 bit, then repeat with byte from plane 1 until you get to plane 7. In the end you would have 8 bytes in SIMD register which you could write out to your packed image buffer.

Bitplane conversion (packed to planar and vice versa) is used in some image compression algorithms.

Thomas_W_Intel · ‎03-09-2011

Thanks a lot for the explanation. Now, I understand.

levicki · ‎03-14-2011

You are welcome Thomas.