Converting SSE packed integer handling to AVX

Grace_Oliver__Intel_ · ‎08-09-2011

Used SSE to work with for example, 6-bit packed integers.SSErequired a lot of heavy lifting thru masking and shifting and storing of results into temporary registers; prior to doing the arithmetic or logical operation. I'vebeen studying AVX2 instructions looking for a more optimal set of instructions to do this. Has anyone on this forum already looked at optimizing this type of workload? There are a lot of new instructions for manipulating 32-bit and 64-bit data units; but it is not obvious to me at this piont how these can help with this 6-bit packed integer problem.
Thanks,

capens__nicolas · ‎08-09-2011

Which operations do you need to perform on these 6-bit integers exactly? I imagine the AVX2 vector-vector shift instructions, and gather instructions, could come in quite handy.

Grace_Oliver__Intel_ · ‎08-10-2011

Vector-to-vector shifts work on dword and qword element sizes. Perhaps these are useful; but I believe it will still take multiple shifts and masks to get the packed data aligned in order to operate on smaller than dword elements.

With the AVX larger register sizethe number of elements operated will double and I'm expecting willincrease performance.

Thanks,

jimdempseyatthecove · ‎08-12-2011

Grace,

As asked by the second post, what operations do you intend to perform and how long are your 6-bit vectors.

Example:

You intend to only perform test for 6-bit vector compares.

Or, do you intend to perform addition, subtraction, multiplication, division

rotates, etc...

Compare for equal could be done relatively easily using pxor (and possibly pand for partial vector).

For arithmatic, it might be easier to use the GP registers and almost as fast since you can handle 64 bits (or 60 bits) at a time.

Stating what you want to do would certainly help us in providing you with advise.

Jim Dempsey

sirrida · ‎08-16-2011

You can unpack the data with the proposed new BMI2 command PDEP from several 6 bit entities to 8 bit ones, do your calculation and repack them with PEXT. These commands act on 32 or 64 bit general purpose registers (GPR) like EAX or RAX; unfortunately there is no version for MMX/SSE/AVX. The mask to be used for both packing and unpacking will be probably 0x3f3f3f3f (32 bit) or similar.
I'm not sure whether this is what you want; at least you can test it with the AVX emulator.

Alternatively you can do the unpack/pack with some bit magic, see e.g. Hacker's Delight (see the code of compress and the PDF linked on revisions, figure 7-7 on page 43).
On my programming pages you will find similar routines under "bit permutations".
It should be not too difficult to adapt these routines to MMX/SSE/AVX but not necessarily worthwhile. Be aware that the mask is a constant.
If you work with SSE registers it probably makes sense to unpack every 3 bytes to 4 (i.e. 12 to 16 bytes) via PSHUFB before doing the bit shuffling. For the packing afterwards do this in opposite direction.

Is this what you want? Do you need explicit code snippets?

capens__nicolas · ‎08-18-2011

Quoting sirrida

You can unpack the data with the proposed new BMI2 command PDEP from several 6 bit entities to 8 bit ones, do your calculation and repack them with PEXT.

Wow, those instructions are fantastic. I never imagined a complex operation like that was even possible in a single pipelined instruction, but after reading up on the 'butterfly' datapath it's actually quite elegant.

I really look forward to CPUs with AVX2 and BMI. Am I correct that Haswell won't support BMI2 yet? PDEP and PEXT are not mentioned in the Haswell New Instructions blog. Hopefully it's scheduled for Broadwell then.

sirrida · ‎08-18-2011

Strangely in the blog PEXT and PDEP are missing but all the other BMI2 commands are mentioned: BZHI, MULX, RORX, SARX, SHLX, SHRX.
Nobody has reacted on my comment (2011-07-01 12:11) thereof - and I still don't have any clue for what these lowest bit manipulation operations (BMI1 / XOP) are useful.

Max_L · ‎08-18-2011

PDEP/PEXT are indeed part of BMI2 and are planned to be available in the first CPU supporting BMI2.

Trailing bits manipulation instructions are useful for fast decoding of variable bit length codes (check e.g. Gamma http://nlp.stanford.edu/IR-book/html/htmledition/gamma-codes-1.html), where detecting the length of the next bit field is often on a critical path and reducing latency can help significantly. For example pair of BLSR and TZCNT can be used together to decode unary encoded bit stream.

-Max

Grace_Oliver__Intel_ · ‎08-19-2011

Thanks everyone for the references and suggestions.
Grace