- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Used SSE to work with for example, 6-bit packed integers.SSErequired a lot of heavy lifting thru masking and shifting and storing of results into temporary registers; prior to doing the arithmetic or logical operation. I'vebeen studying AVX2 instructions looking for a more optimal set of instructions to do this. Has anyone on this forum already looked at optimizing this type of workload? There are a lot of new instructions for manipulating 32-bit and 64-bit data units; but it is not obvious to me at this piont how these can help with this 6-bit packed integer problem.

Thanks,

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

With the AVX larger register sizethe number of elements operated will double and I'm expecting willincrease performance.

Thanks,

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

As asked by the second post, what operations do you intend to perform and how long are your 6-bit vectors.

Example:

You intend to only perform test for 6-bit vector compares.

Or, do you intend to perform addition, subtraction, multiplication, division

rotates, etc...

Compare for equal could be done relatively easily using pxor (and possibly pand for partial vector).

For arithmatic, it might be easier to use the GP registers and almost as fast since you can handle 64 bits (or 60 bits) at a time.

Stating what you want to do would certainly help us in providing you with advise.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I'm not sure whether this is what you want; at least you can test it with the AVX emulator.

Alternatively you can do the unpack/pack with some bit magic, see e.g. Hacker's Delight (see the code of compress and the PDF linked on revisions, figure 7-7 on page 43).

On my programming pages you will find similar routines under "bit permutations".

It should be not too difficult to adapt these routines to MMX/SSE/AVX but not necessarily worthwhile. Be aware that the mask is a constant.

If you work with SSE registers it probably makes sense to unpack every 3 bytes to 4 (i.e. 12 to 16 bytes) via PSHUFB before doing the bit shuffling. For the packing afterwards do this in opposite direction.

Is this what you want? Do you need explicit code snippets?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*You can unpack the data with the proposed new BMI2 command PDEP from several 6 bit entities to 8 bit ones, do your calculation and repack them with PEXT.*

Wow, those instructions are fantastic. I never imagined a complex operation like that was even possible in a single pipelined instruction, but after reading up on the 'butterfly' datapath it's actually quite elegant.

I really look forward to CPUs with AVX2 and BMI. Am I correct that Haswell won't support BMI2 yet? PDEP and PEXT are not mentioned in the Haswell New Instructions blog. Hopefully it's scheduled for Broadwell then.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**all**the other BMI2 commands are mentioned: BZHI, MULX, RORX, SARX, SHLX, SHRX.

Nobody has reacted on my comment (2011-07-01 12:11) thereof - and I still don't have any clue for what these lowest bit manipulation operations (BMI1 / XOP) are useful.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Trailing bits manipulation instructions are useful for fast decoding of variable bit length codes (check e.g. Gamma http://nlp.stanford.edu/IR-book/html/htmledition/gamma-codes-1.html), where detecting the length of the next bit field is often on a critical path and reducing latency can help significantly. For example pair of BLSR and TZCNT can be used together to decode unary encoded bit stream.

-Max

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Grace

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page