Indirect Bit Indexing and Set - Page 3

Alexander_L_1 · ‎12-15-2014

Hello together!

This is my first post, so please be patient :)
I've a very interesting problem. I also have a ready working solution, but this solution does not make me happy.
For the first I will try to describe a problem as exact as possible.
1. Let be I (I for Image) a 2D array of bytes.
2. Each byte contain 2 independently indices - say upper 3 bits will be Y-index, lower 5 bits will be X-index
3. This easily defines a translation of I->YX
4. It's also noticeable that YX can be described as 2D bit image with only 8 rows and 32 columns, which makes exactly 256 binary cells
5. Full solution will require to set 1 to each cell adressed by I (see 1 and 2)
6. Accepted solution can be reduced to separatelly calculated "setted rows" and "setted columns" - means 8 bit for row and 32 bit for columns
7. As sufficient output for accepted solution will be easily two 32-bit registers/variables.

I've really no idea how to efficiently implement this thing.

Moreover I've found no instruction to convert number-to-"stted bits" and have aslo idea how to do such thing for complete XMM register.

The current solution uses an array of 256 integers where each entry is adressed by byte-index. Each entry will be also counted (not only) set, but this is not required.

Some better ideas?

Many thanks in advice!

   void countByteIndex( int width, int height, int dStep, int iStep, void* _D, void* _I, void* _R)
   {
       int x, y;

const int xStep = dStep;
const int yStep = iStep;

uint8* srcX = (uint8*)_D;
uint8* srcY = (uint8*)_I;

__int32* dst = (__int32*)_R;

for( y = 0; y < height; y++) // single line at once
{

           register const __m128i *srcX0 = (const __m128i *)(srcX);
           register const __m128i *srcY0 = (const __m128i *)(srcY);
           register __m128i sX0, sY0;

register int r0, r1, r2, r3;

for( x = 0; x < width; x += 16 ) // 16 bytes at once per line

           {
   sX0 = _mm_load_si128( srcX0++ ); // Loads 128-bit value. Aligned.
     sY0 = _mm_load_si128( srcY0++ ); // Loads 128-bit value. Aligned.
     sX0 = _mm_and_si128( sX0, sY0 ); // Mask destination offset.

               // Index sX0
               r0 = _mm_extract_epi8(sX0, 0);
               r1 = _mm_extract_epi8(sX0, 1);
               r2 = _mm_extract_epi8(sX0, 2);
               r3 = _mm_extract_epi8(sX0, 3);

               (*(dst+r0))++; // sufficient is also (*(dst+r0)) = 1 for all (*(dst+..))++
               (*(dst+r1))++;
               (*(dst+r2))++;
               (*(dst+r3))++;

               r0 = _mm_extract_epi8(sX0, 4);
               r1 = _mm_extract_epi8(sX0, 5);
               r2 = _mm_extract_epi8(sX0, 6);
               r3 = _mm_extract_epi8(sX0, 7);

               (*(dst+r0))++;
               (*(dst+r1))++;
               (*(dst+r2))++;
               (*(dst+r3))++;

               r0 = _mm_extract_epi8(sX0, 8);
               r1 = _mm_extract_epi8(sX0, 9);
               r2 = _mm_extract_epi8(sX0, 10);
               r3 = _mm_extract_epi8(sX0, 11);

               (*(dst+r0))++;
               (*(dst+r1))++;
               (*(dst+r2))++;
               (*(dst+r3))++;

               r0 = _mm_extract_epi8(sX0, 12);
               r1 = _mm_extract_epi8(sX0, 13);
               r2 = _mm_extract_epi8(sX0, 14);
               r3 = _mm_extract_epi8(sX0, 15);

               (*(dst+r0))++;
               (*(dst+r1))++;
               (*(dst+r2))++;
               (*(dst+r3))++;
           }

           srcX += xStep;
           srcY += yStep;
       }
   };

Alexander_L_1 · ‎12-22-2014

bronxzv wrote:

Quote:

Alexander L. wrote:
For me the outcome is to use either your LUT version or Vladimir's AVX2 version.

sorry but I updated my post after you answer here

don't miss my comment about PSHUFB for the Y mask, PSHUFB allows for a single clock 4-bit x 8-bit vectorized LUT, i.e. a single clock 4-bit to 8-bit mapping of 16/32/64 elements with 128/256/512 bit wide SIMD, this is arguably the optimal solution for the Y mask (when ignoring final reduction which must be out of core loop)

Yes, PSHUFB/W/.. was also ant attempt for mee for both X and Y but I has noticed, that it's possibly to Y mask only so I have looked forward how to do for X which has 5 bits and excludet Y so far from diskussion. For X we need AVX2 if we want to be short, or use much more code to cover this with SSE only.

jimdempseyatthecove · ‎12-22-2014

Someone that has the other code implemented might try this non-SSE/AVX piece of code.

const int nX = 2048;
const int nY = 2048;
const int nXY = nX * nY;
unsigned char* input[nXY];
static union {
unsigned __int32 output[8];
unsigned __int64 output64[4];
};

int _tmain(int argc, _TCHAR* argv[])
{
 register unsigned __int64 r0 = 0;
 register unsigned __int64 r1 = 0;
 register unsigned __int64 r2 = 0;
 register unsigned __int64 r3 = 0;
 register unsigned __int64 in;
 for( int i=0; i < nXY; i += 8) {
  in = *((unsigned __int64*)&input);
  for( int j=0; j < 8; ++j) {
   unsigned char xy = (unsigned char)in;
   in >>= 8;
   r0 |= 1<<xy;       // produces bit for 0:63
   r1 |= 1<<(xy-64);  // produces bit for 64:127
   r2 |= 1<<(xy-128); // produces bit for 128:191
   r3 |= 1<<(xy-192); // produces bit for 192:255
  }
 }
 output64[0] = r0;
 output64[1] = r1;
 output64[2] = r2;
 output64[3] = r3;
 return 0;
}

Note, adjust the above for the desired input size, the above requires multiple of 8 input. The output is the 8x32 bit bitmap, you can reduce to the intersections if you want. The above should easily parallelize with reduction on final output.

Jim Dempsey

bronxzv · ‎12-22-2014

Alexander L. wrote:
Yes, PSHUFB/W/.. was also ant attempt for mee for both X and Y but I has noticed, that it's possibly to Y mask only so I have looked forward how to do for X which has 5 bits

we can use PSHUFB also with 5 bits, with some extra selection code, selection can be based on and/andnot instructions which are very high throughput (3 such instructions per clock) on modern cores

here is how it can look (assuming reduction out of loop as already discussed) :

const __m128i lut20 = _mm_set_epi32(0,0,0x80402010,0x08040201), lut31 = _mm_set_epi32(0x80402010,0x08040201,0,0);
const __m128i v15 = _mm_set1_epi8(15);

_forceinline void SetBits(const __m128i &x, __m128i &r3, __m128i &r2, __m128i &r1, __m128i &r0)
{
  const __m128i bit4Mask = _mm_cmpgt_epi8(x,v15); 
  const __m128i res20 = _mm_shuffle_epi8(lut20,x), res31 = _mm_shuffle_epi8(lut31,x);
  r0 = _mm_or_si128(r0,_mm_andnot_si128(bit4Mask,res20));
  r1 = _mm_or_si128(r1,_mm_andnot_si128(bit4Mask,res31));
  r2 = _mm_or_si128(r2,_mm_and_si128(bit4Mask,res20));
  r3 = _mm_or_si128(r3,_mm_and_si128(bit4Mask,res31));
}

test loop compiled for SSSE3:

.B39.7::                        ; Preds .B39.7 .B39.6
        movdqu    xmm10, XMMWORD PTR [176+rbp+rdx]              ;53.47
        movdqa    xmm8, xmm5                                    ;53.5
        movdqa    xmm7, xmm10                                   ;53.5
        pshufb    xmm8, xmm10                                   ;53.5
        movdqa    xmm9, xmm6                                    ;53.5
        pcmpgtb   xmm7, xmm0                                    ;53.5
        pshufb    xmm9, xmm10                                   ;53.5
        movdqa    xmm10, xmm7                                   ;53.5
        add       eax, 16                                       ;51.28
        pandn     xmm10, xmm8                                   ;53.5
        pand      xmm8, xmm7                                    ;53.5
        por       xmm1, xmm10                                   ;53.5
        movdqa    xmm10, xmm7                                   ;53.5
        pandn     xmm10, xmm9                                   ;53.5
        pand      xmm7, xmm9                                    ;53.5
        mov       edx, eax                                      ;51.28
        por       xmm2, xmm10                                   ;53.5
        por       xmm3, xmm8                                    ;53.5
        por       xmm4, xmm7                                    ;53.5
        cmp       eax, 1024                                     ;51.22
        jb        .B39.7        ; Prob 98%                      ;51.22

only low latency / high throughput instructions + register moves that can be eliminated on modern cores

this is potentially faster than the string cmp version (TBC)

AVX compilation :

.B39.7::                        ; Preds .B39.7 .B39.6
        vmovdqu   xmm10, XMMWORD PTR [176+rbp+rdx]              ;53.47
        vpshufb   xmm8, xmm3, xmm10                             ;53.5
        vpshufb   xmm7, xmm2, xmm10                             ;53.5
        vpcmpgtb  xmm9, xmm10, xmm0                             ;53.5
        add       eax, 16                                       ;51.28
        vpandn    xmm10, xmm9, xmm8                             ;53.5
        vpand     xmm8, xmm9, xmm8                              ;53.5
        vpor      xmm1, xmm1, xmm10                             ;53.5
        vpandn    xmm10, xmm9, xmm7                             ;53.5
        vpand     xmm7, xmm9, xmm7                              ;53.5
        vpor      xmm4, xmm4, xmm10                             ;53.5
        mov       edx, eax                                      ;51.28
        vpor      xmm5, xmm5, xmm8                              ;53.5
        vpor      xmm6, xmm6, xmm7                              ;53.5
        cmp       eax, 1024                                     ;51.22
        jb        .B39.7        ; Prob 98%                      ;51.22

more compact code but it is probably only marginally faster than the SSSE3 path above