Integer SIMD instructions weirdness -- needs fixing

levicki · ‎02-03-2011

I hope that Intel engineers are going to read this, and improve integer SIMD instructions on future CPUs.

1. PSRLB/PSLLB/PSRAB/PSLAB -- they do not exist. How about adding them?

2. All SIMD shifts -- it is not possible to shift each packed byte/word/dword/qword by different amount.

Regarding #2:

- Why did you make another SIMD register as a count parameter for those instructions if we cannot specify more than one shift value?!? You could have simply used GPR for the count parameter!

PSRLW xmm0, eax would be more usefull than PSRLW xmm0, xmm2, not to mention that you could use xmm2 to have 8 different shift counts, one for each word in the destination.

Because of someone's mistake, now we will never have proper SIMD shift instruction -- behavior of PSRLW xmm0, xmm2 can never be changed. If we want more usefull shift we will need another instruction, and the current one will stay forever as a dead weight in the x86 instruction set.

3. SIMD bit manipulation (PBSETB/W/D/Q, PBCLRB/W/D/Q, PBTSTB/W/D/Q) -- it would be nice to have an instruction which can set, clear, or test different bit in each packed byte, word, dword, or qword.

Example:

xmm0 = 0 0 0 0 (dword)
xmm1 = 4 3 2 1 (dword)

PBSETD xmm0, xmm1

xmm0 = 0x10, 0x08, 0x04, 0x02 (dword)

To be continued.

Matthias_Kretz · ‎02-03-2011

Fully agreed. SSE obviously was never designed for general purpose use, but to accelerate specific workloads. Then, as an afterthought, more generality was added. And now we have a mess. :(

AVX seemed like a good new start. But so far I don't have the feeling it will help at all - rather make everything even more complicated.

About the SIMD shifts: it took me more than a day, IIRC, to figure out why my shifts were not working. Until I realized that PSRLW xmm0, xmm1 behaves like you said. That was not clear from the documentation and completely against any consistency expectations of the SSE API. (I was hoping for a fast release of LRBni hardware back then. Those dreams were shattered...)

levicki · ‎02-03-2011

Yes, and it is still easy to misunderstand the PSRLW documentation today!

As I said, I will continue to list things I consider drawbacks in this thread:

4. PBLENDVB

This instruction would be much more usefull if you could specify byte positions where you want each byte loaded instead of just load / don't load flag for each byte.

It would be even more usefull if it did not require 16-byte alignment and if it did not touch memory beyond what you specify in the mask register.

Example:

LSB MSB
esi = 55 AA C3 90 15 FF FF FF FF FF FF FF FF FF FF FF (FF means unreadable location)
xmm1 = empty

MSB LSB
xmm0 = 80 80 80 80 80 80 80 80 80 80 80 07 06 05 04 03 (80 means do not read byte from source, 0-15 means put into that byte position in destination)

PBLENDVB xmm1, xmmword ptr [esi]

MSB LSB
xmm1 = 00 00 00 00 00 00 00 00 55 AA C3 90 15 00 00 00

5. PINSRB/PINSRW/PINSRD

Insert and extract instructions accept only immediate parameter -- this should have been a GPR!

Rationale:
If you need to read 5 bytes from memry into SIMD register and you are at the end of valid memory page, the only option is to MOVZX into GPR and then PINSRB five times. If you need variable byte count, you need very complex and slow code for that because PINSRB does not accept GPR as position parameter which would make it possible to at least write a loop instead of using comparisons and branches.

0xr · ‎02-03-2011

Igor,

SIMD shifts and many other integer SIMD issues are solved in the upcoming XOP instruction set. It would be great to have it in Intel chips too... (contrary to enlarging the mess by another incompatible extension)

Cryptographer · ‎02-03-2011

Igor, your assumption

If we want more useful shift we will need another instruction, and the current one will stay forever as a dead weight in the x86 instruction set.

isnot strictly consequential. By clearing some CPUID flags, and defining a reserved bit for a better usage of these op-codes, Intel could correct that without breaking backward compatibility. Even more, for an intermediate processor generationa flag in a status word could switch between thedepreciated features and the corrected ones.

But I am not sure whether a processor manufacturerwould do something which might taste like admitting a mistake.

levicki · ‎02-04-2011

Once the instruction is defined, its behavior must not change or the old code will break.

Old code has no way of knowing about your proposed new flags, and it expects the instruction to work in the same way it did when the code was written.

Thus, the only way to change some instruction behavior is to introduce new instruction encoding -- in other words, a new instruction.

For example, if you had:

PSRLW xmm0, xmm2

Where xmm2 contains a single shift count in the low order word, and you wanted to have:

PSRLW xmm0, xmm2

Where xmm2 contains 8 different shift counts (one in each word) instead, it would be impossible to keep the same instruction name and encoding without breaking existing software.

That is why I said that even if they introduce such new instruction, the old one will remain as a dead weight forever. Just look what happened with LAHF in 64-bit mode, and you will understand what I am talking about.

I am amazed that Intel CPU engineers can afford themselves the luxury of designing instructions without thoroughly thinking through what would be more usefull to developers.

In case of PSRLW, it is blindingly obvious that:

1. PSRLW xmmdst imm8 (where imm8 is a shift count value from 0 to 15)
2. PSRLW xmmdst, reg32 (where reg32 holds a single shift count value in a low word)
3. PSRLW xmmdst, xmmsrc/mem128 (where xmmsrc/mem128 hold 8 different shift count values)

Was the way to go, and while Intel can still introduce PSRLW xmmdst, reg32, they cannot change the behavior of PSRLW xmmdst, xmmsrc/mem128 without breaking existing code.

sandmann · ‎02-04-2011

While we are complaining about missing integer instructions, how about better support for bytes? Specifically, I am missing pmullb, pmullhb, psllb, psrlb. There are already word and dword versions of these, so it doesn't seem a stretch to have byte versions.

The motivation for these instructions is this computation:

(a * b + 0x7f) / 255

which is common when blending pixels with 8 bit channels. The division with 255 can be done like this: (t = (a * b) + 0x80, (t + (t >> 8)) >> 8), but for that to work with current SIMD instruction sets, you have to unpack the inputs to 16 bits which means quite a bit of arithmetic and register pollution.

If there were pmul{lh}b and psrlb instructions, you could do this instead:

[cpp]    // input uint8_t a, b

    uint8_t hi, lo, t;

    hi = ((a * b) & 0xff00) >> 8;        // pmulhb
    lo = ((a * b) & 0x00ff);             // pmullb

    t = lo >> 7;                         // psrlb
hi += t;                             // padd 

t = hi ^ 0x7f // pxor
    
    if ((int8_t)lo > (int8_t)t)         // pcmpgt
        lo = 0xff;
    else
        lo = 0x00;

return hi - lo;    // psub

[/cpp]

This code is only seven instructions for sixteen operations, and, importantly, has only three live variables at any time.

It can easily be verified that it is equivalent to (a * b + 0x7f) / 255 for all inputs.

levicki · ‎02-05-2011

I agree that those instructions would also be usefull too, especially byte shifts.

levicki · ‎02-16-2011

We have PMOVMSKB. It compacts the sign bits of individual bytes to a GPR.

It would be nice if we had opposite operation as well -- move bits from GPR to sign bits of individual bytes in SIMD register. This would enable:

- parallel bit processing/testing
- more flexible blending with PBLENDVB

bronxzv · ‎02-16-2011

itlooks like something you can do easily(for MMX code) with a PAND 0x7f7f7f7f7f7f7f7f + a POR with a256 x 64-bit look up table (or just a MOVQ from the LUT for a lot of useful codes)

levicki · ‎02-16-2011

Yes but PMOVMSKB works with all 16 bytes so there are 16 sign bits that get packed into a GPR.

Therefore, you would need to be able to do the same in reverse (16 bits from GPR to 16 bytes' sign bits in SIMD). It is possible but impractical -- hardware implementation would certainly be faster.

Finally, I don't see how only 8 LUT entries would be enough?

bronxzv · ‎02-16-2011

you are right, I asnwered way too fast sorry, I had in mind a 256-entry LUT (mixed up with the 8-bit index), Ihave kindafixed the post now

now indeed, 16-bit indices will lead to a monster 1MB LUT which is impractical

levicki · ‎02-16-2011

Actually, there is a way to do it with only 2 KB LUT -- you make 256 entries, 8-byte per entry, LUT, and look it up twice (once with high and once with low byte).

Still, I think that even 2KB LUT is large enough to compete for L1 cache with the rest of the algorithm.

Of course, you can go down further and work on four nibbles thus reducing LUT size to 16 entries, 4-byte per entry (64 bytes LUT), but that considerably increases code complexity both in number of arithmetic/logic operations and number of memory accesses required to accomplish something hardware could (and should) do with a single instruction.

Thomas_W_Intel · ‎02-16-2011

Igor,

you can avoid thelook-up table altogetherbyusing a combination of pshufb and pcmpeqb:

__m128i v = _mm_shuffle_epi8(input, _mm_set_epi8(1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0));

const __m128imask = _mm_set_epi8(0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01, 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01);
v = _mm_and_si128(v, mask);
v = _mm_cmpeq_epi8(v, mask);

First, the 16 bits are broadcasted across the 16 bytes with the shuffle instruction. Then the relevant bit is masked out with "and" for each byte. Finally, all the bits are set with the comparison.

Kind regards
Thomas

levicki · ‎02-16-2011

Thomas,

I want those 16 bits from EAX unpacked to sign bit of corresponding byte in the SIMD register. I only need sign bits to be set/cleared. Any ideas on how to do that?

Thomas_W_Intel · ‎02-17-2011

Igor,

You mean something like this:

[cpp]unsigned short signs;
__m128i values, v;
v = _mm_insert_epi16(v, signs, 0);
v = _mm_shuffle_epi8(input, _mm_set_epi8(1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0));
const __m128i mask = _mm_set_epi8(0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01, 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01);
v = _mm_and_si128(v, mask);
v = _mm_cmpeq_epi8(v, mask);
v = _mm_and_si128(v, _mm_set1_epi8(128));
values = _mm_and_si128(values, _mm_set1_epi8(255));
values = _mm_or_si128(values, v);
[/cpp]

I'm more fluent in intrinsics than assembler, but I assume that the translation is easy for you. Unfortunately, I don't no any more efficient way than the and-and-or sequence for blending the bits. It goes without saying that constants should be precomputed and kept in memory or registers. If the variable "signs" comes from memory and is part of a loop, you can unroll the loop and directly load into v.By using different shuffles you can then reduce the loads and avoiding the insertions.

By the way, I've learned this trick to use cmpeq to propagate bits accross a byte from Nick, so credit is due to him.

Kind regards
Thomas

levicki · ‎02-17-2011

Thomas,

Something like that, yes. However, in line 9 you need to have 0x7F mask, not 0xFF.

Other option would be to insert bits into the register which is already cleared -- in that case, and 0x7F can be left out.

Still this code requires 4 constants which you must keep in a register which does not leave a lot of space for other things and it is not exactly trivial. Having a hardware instruction that does this would be great.

Thomas_W_Intel · ‎02-18-2011

Igor,

you are right about the mask in line 9. Thanks for catching that.

I was not trying to argue that the proposed code completely solves your problem, but it was the best I could think of.

Kind regards
Thomas

levicki · ‎02-18-2011

Thomas,

No problem, thanks for teaching me a new trick :)

I intend to keep posting here my own ideas for new potentially usefull CPU instructions. If others want to contribute they are welcome.

bronxzv · ‎02-18-2011

If others want to contribute they are welcome

I will welcomea new branch prediction hint : RANDOMLY_TAKEN

It will be useful for branches that are pathologically hard to predict (high branch prediction miss rate) and when running 2 threadson a core (or 4 threads on future x86 targets like MIC)

The idea is that whena branch tagged with such a hint is encountered,its thread stall until the branch is resolved (if other threads are running on the same core), letting the other thread(s) do useful work instead of wasting resources (and power) executing nearly half the time in pure waste the wrong path

Even with vectorized code sometimes hard to predict branches are useful to provide speedups (for example if they allow to skip several instructions including loads), see for example this post of mine:

http://software.intel.com/en-us/forums/showpost.php?p=140549

Thomas_W_Intel · ‎02-18-2011

The 4th constant can be avoided by merging the bits using a ^ ((a^b) & mask) as suggested in this collection of bit twiddling hacks. This results in:

[cpp]unsigned short signs;   
__m128i values, v;   
v = _mm_insert_epi16(v, signs, 0);   
v = _mm_shuffle_epi8(input, _mm_set_epi8(1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0));   
const __m128i mask = _mm_set_epi8(0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01, 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01);   
v = _mm_and_si128(v, mask);   
v = _mm_cmpeq_epi8(v, mask);   
v = _mm_xor_si128(v, values); 
v = _mm_and_si128(v, _mm_set1_epi8(128)); 
values = _mm_xor_si128(values, v); 
[/cpp]

For sake of completeness, I would also like to mention that there is a psignb instruction that can replace the last 3 instructions if someone wants to negate the values and not just copy the sign bits.