Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
1093 Discussions

Converting SSE packed integer handling to AVX

Grace_Oliver__Intel_
925 Views

Used SSE to work with for example, 6-bit packed integers.SSErequired a lot of heavy lifting thru masking and shifting and storing of results into temporary registers; prior to doing the arithmetic or logical operation. I'vebeen studying AVX2 instructions looking for a more optimal set of instructions to do this. Has anyone on this forum already looked at optimizing this type of workload? There are a lot of new instructions for manipulating 32-bit and 64-bit data units; but it is not obvious to me at this piont how these can help with this 6-bit packed integer problem.
Thanks,

0 Kudos
8 Replies
capens__nicolas
New Contributor I
925 Views
Which operations do you need to perform on these 6-bit integers exactly? I imagine the AVX2 vector-vector shift instructions, and gather instructions, could come in quite handy.
0 Kudos
Grace_Oliver__Intel_
925 Views
Vector-to-vector shifts work on dword and qword element sizes. Perhaps these are useful; but I believe it will still take multiple shifts and masks to get the packed data aligned in order to operate on smaller than dword elements.

With the AVX larger register sizethe number of elements operated will double and I'm expecting willincrease performance.

Thanks,
0 Kudos
jimdempseyatthecove
Honored Contributor III
925 Views
Grace,

As asked by the second post, what operations do you intend to perform and how long are your 6-bit vectors.

Example:

You intend to only perform test for 6-bit vector compares.

Or, do you intend to perform addition, subtraction, multiplication, division

rotates, etc...

Compare for equal could be done relatively easily using pxor (and possibly pand for partial vector).

For arithmatic, it might be easier to use the GP registers and almost as fast since you can handle 64 bits (or 60 bits) at a time.

Stating what you want to do would certainly help us in providing you with advise.

Jim Dempsey
0 Kudos
sirrida
Beginner
925 Views
You can unpack the data with the proposed new BMI2 command PDEP from several 6 bit entities to 8 bit ones, do your calculation and repack them with PEXT. These commands act on 32 or 64 bit general purpose registers (GPR) like EAX or RAX; unfortunately there is no version for MMX/SSE/AVX. The mask to be used for both packing and unpacking will be probably 0x3f3f3f3f (32 bit) or similar.
I'm not sure whether this is what you want; at least you can test it with the AVX emulator.

Alternatively you can do the unpack/pack with some bit magic, see e.g. Hacker's Delight (see the code of compress and the PDF linked on revisions, figure 7-7 on page 43).
On my programming pages you will find similar routines under "bit permutations".
It should be not too difficult to adapt these routines to MMX/SSE/AVX but not necessarily worthwhile. Be aware that the mask is a constant.
If you work with SSE registers it probably makes sense to unpack every 3 bytes to 4 (i.e. 12 to 16 bytes) via PSHUFB before doing the bit shuffling. For the packing afterwards do this in opposite direction.

Is this what you want? Do you need explicit code snippets?
0 Kudos
capens__nicolas
New Contributor I
925 Views
Quoting sirrida
You can unpack the data with the proposed new BMI2 command PDEP from several 6 bit entities to 8 bit ones, do your calculation and repack them with PEXT.

Wow, those instructions are fantastic. I never imagined a complex operation like that was even possible in a single pipelined instruction, but after reading up on the 'butterfly' datapath it's actually quite elegant.

I really look forward to CPUs with AVX2 and BMI. Am I correct that Haswell won't support BMI2 yet? PDEP and PEXT are not mentioned in the Haswell New Instructions blog. Hopefully it's scheduled for Broadwell then.

0 Kudos
sirrida
Beginner
925 Views
Strangely in the blog PEXT and PDEP are missing but all the other BMI2 commands are mentioned: BZHI, MULX, RORX, SARX, SHLX, SHRX.
Nobody has reacted on my comment (2011-07-01 12:11) thereof - and I still don't have any clue for what these lowest bit manipulation operations (BMI1 / XOP) are useful.
0 Kudos
Max_L
Employee
925 Views
PDEP/PEXT are indeed part of BMI2 and are planned to be available in the first CPU supporting BMI2.

Trailing bits manipulation instructions are useful for fast decoding of variable bit length codes (check e.g. Gamma http://nlp.stanford.edu/IR-book/html/htmledition/gamma-codes-1.html), where detecting the length of the next bit field is often on a critical path and reducing latency can help significantly. For example pair of BLSR and TZCNT can be used together to decode unary encoded bit stream.

-Max
0 Kudos
Grace_Oliver__Intel_
925 Views
Thanks everyone for the references and suggestions.
Grace
0 Kudos
Reply