Parallel instructions for detecting MSB in array of bytes

craptacus · ‎10-15-2009

First time posting, not sure if correct forum, but...

I have a large array of bytes, max up to 1600, but mostly up to 128.

Bytes are typically 7-bit of information, and the MSB is used as a sentinel, so the MSB is set in a small portion of the bytes.

Currently I'm looping through them in a loop, but is there a better way to use SSEx to process 128 bytes in parallel and get back a SSE vector with bits set for each byte?

Any suggestions?

Thank you,
craptacus

Dmitry_Vyukov · ‎10-16-2009

Quoting - craptacus

First time posting, not sure if correct forum, but...

I have a large array of bytes, max up to 1600, but mostly up to 128.

Bytes are typically 7-bit of information, and the MSB is used as a sentinel, so the MSB is set in a small portion of the bytes.

Currently I'm looping through them in a loop, but is there a better way to use SSEx to process 128 bytes in parallel and get back a SSE vector with bits set for each byte?

I do not fully understand what you want but will try to answer anyway.
Yes, SSE instructions do allow you to set a particular bit in each byte. All you have to do is to load mask into SSE register, and then use POR instruction to set particular bits. You can use _mm_or_si128 intrinsic:
__m128i _mm_or_si128 (__m128i a, __m128i b);
(Note that your arrays have to be 16-byte aligned)

Thomas_W_Intel · ‎10-16-2009

Craptacus,

I think that the "AVX and CPU Instructions" forum is more appropriate and I have therefore moved your post there.

Do I understand this correctly? You want to determine in an array of chars, in which bytes the most significant bit is set. There is the SSE2 instruction "pmovmskb" that creates a 16-bit mask from the most significant bits of the 16 chars in an SSE register. TIn contrast to what you were asking for, theresult is created in an normal register, but this shouldn't be a problem for you.

If you are new to SSE programming, the easiest way would be using the intrinsic _mm_movemask_epi8. It wraps the instruction inside a C function so that you don't need to worry about register asignments or address computation. All major compilers support this.

Kind regards
Thomas

bronxzv · ‎10-16-2009

>Currently I'm looping through them in a loop, but is there a better way to use SSEx to process 128 bytes in >parallel and get back a SSE vector with bits set for each byte?

nope, the best you can do is to process 16 bytes in parallel and get back a 16-bit value (as advised by Thomas)
NB: your array must be 16B aligned if you want to avoid an extra MOVUx instruction

your code will look like :

const unsigned char *array; // init. missing !
const int size = 1024;
unsigned short *msbArray; // init. missing !
for (int i=0, j=0; i msbArray = _mm_movemask_epi8(_mm_load_si128((const __m128i *)array+i));

craptacus · ‎10-16-2009

Quoting - bronxzv

>Currently I'm looping through them in a loop, but is there a better way to use SSEx to process 128 bytes in >parallel and get back a SSE vector with bits set for each byte?

nope, the best you can do is to process 16 bytes in parallel and get back a 16-bit value (as advised by Thomas)
NB: your array must be 16B aligned if you want to avoid an extra MOVUx instruction

your code will look like :

const unsigned char *array; // init. missing !
const int size = 1024;
unsigned short *msbArray; // init. missing !
for (int i=0, j=0; i msbArray = _mm_movemask_epi8(_mm_load_si128((const __m128i *)array+i));

A big "Thank You" to both Thomas and bronxzv -- this is excellent, I'll try it this weekend.

jimdempseyatthecove · ‎10-21-2009

Before you jump in and use the SSEx instructions to produce a bitmap of sign bits set look at the remainder of the code to see what it does upon detection. If your are only interested in producing a bit vector then proceed with the SSE conversion. However, if you are looking for sentinal bits then looking at data bytes at and folowing the byte with the sentinal, thenit may be more efficient (on x64) to stay within the __int64 register set. You can check 8 bytes at a time then use rotates or AND list to locate byte offset with sentinal bit. While you will have on average 2x more tests, 3 out of 4 test will come from L1 cache (practicly free). The time saved in computing the byte offset using non-SSE blend of code often makes up the difference. The SSE version will (my guess) only show advantage when no sentinals are in the buffer.

Jim Dempsey

bronxzv · ‎10-22-2009

Quoting - jimdempseyatthecove

Before you jump in and use the SSEx instructions to produce a bitmap of sign bits set look at the remainder of the code to see what it does upon detection. If your are only interested in producing a bit vector then proceed with the SSE conversion. However, if you are looking for sentinal bits then looking at data bytes at and folowing the byte with the sentinal, thenit may be more efficient (on x64) to stay within the __int64 register set. You can check 8 bytes at a time then use rotates or AND list to locate byte offset with sentinal bit. While you will have on average 2x more tests, 3 out of 4 test will come from L1 cache (practicly free). The time saved in computing the byte offset using non-SSE blend of code often makes up the difference. The SSE version will (my guess) only show advantage when no sentinals are in the buffer.

Jim Dempsey

128-bit fetches will allow to maximize L1D bandwidth utilization and L2 to L1D bandwidth, unlike 64-bit fetches using 64-bit GPRs, also the SSE solution will run on 32-bit only enabled platforms (CPU & OS) i.e. 90+ % of the installed base

if the MSB flags are sparse such kind of code will be the fastest solution IMO :

const unsigned char *array; // init. missing !
const int size = 1024;
for (int i=0; i{
const unsigned short mask = _mm_movemask_epi8(_mm_load_si128((const __m128i *)array+i));
if (!mask) continue; // usual case: good branch prediction hit, macrofused with compare
//at least one MSB set: loop through mask bits oruseconversion LUTs (*1)
//...
}

*1: look up tables, one for the nr of elements, the other one with indices to elements
for example mask = 0x24 will return : 2 elements,indices = 2,5

NB: the LUTs solution is generally faster for sparse arrays, less iterations and less branch prediction misses