- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello together!
This is my first post, so please be patient :)
I've a very interesting problem. I also have a ready working solution, but this solution does not make me happy.
For the first I will try to describe a problem as exact as possible.
1. Let be I (I for Image) a 2D array of bytes.
2. Each byte contain 2 independently indices - say upper 3 bits will be Y-index, lower 5 bits will be X-index
3. This easily defines a translation of I->YX
4. It's also noticeable that YX can be described as 2D bit image with only 8 rows and 32 columns, which makes exactly 256 binary cells
5. Full solution will require to set 1 to each cell adressed by I (see 1 and 2)
6. Accepted solution can be reduced to separatelly calculated "setted rows" and "setted columns" - means 8 bit for row and 32 bit for columns
7. As sufficient output for accepted solution will be easily two 32-bit registers/variables.
I've really no idea how to efficiently implement this thing.
Moreover I've found no instruction to convert number-to-"stted bits" and have aslo idea how to do such thing for complete XMM register.
The current solution uses an array of 256 integers where each entry is adressed by byte-index. Each entry will be also counted (not only) set, but this is not required.
Some better ideas?
Many thanks in advice!
void countByteIndex( int width, int height, int dStep, int iStep, void* _D, void* _I, void* _R)
{
int x, y;
const int xStep = dStep;
const int yStep = iStep;
uint8* srcX = (uint8*)_D;
uint8* srcY = (uint8*)_I;
__int32* dst = (__int32*)_R;
for( y = 0; y < height; y++) // single line at once
{
register const __m128i *srcX0 = (const __m128i *)(srcX);
register const __m128i *srcY0 = (const __m128i *)(srcY);
register __m128i sX0, sY0;
register int r0, r1, r2, r3;
for( x = 0; x < width; x += 16 ) // 16 bytes at once per line
{
sX0 = _mm_load_si128( srcX0++ ); // Loads 128-bit value. Aligned.
sY0 = _mm_load_si128( srcY0++ ); // Loads 128-bit value. Aligned.
sX0 = _mm_and_si128( sX0, sY0 ); // Mask destination offset.
// Index sX0
r0 = _mm_extract_epi8(sX0, 0);
r1 = _mm_extract_epi8(sX0, 1);
r2 = _mm_extract_epi8(sX0, 2);
r3 = _mm_extract_epi8(sX0, 3);
(*(dst+r0))++; // sufficient is also (*(dst+r0)) = 1 for all (*(dst+..))++
(*(dst+r1))++;
(*(dst+r2))++;
(*(dst+r3))++;
r0 = _mm_extract_epi8(sX0, 4);
r1 = _mm_extract_epi8(sX0, 5);
r2 = _mm_extract_epi8(sX0, 6);
r3 = _mm_extract_epi8(sX0, 7);
(*(dst+r0))++;
(*(dst+r1))++;
(*(dst+r2))++;
(*(dst+r3))++;
r0 = _mm_extract_epi8(sX0, 8);
r1 = _mm_extract_epi8(sX0, 9);
r2 = _mm_extract_epi8(sX0, 10);
r3 = _mm_extract_epi8(sX0, 11);
(*(dst+r0))++;
(*(dst+r1))++;
(*(dst+r2))++;
(*(dst+r3))++;
r0 = _mm_extract_epi8(sX0, 12);
r1 = _mm_extract_epi8(sX0, 13);
r2 = _mm_extract_epi8(sX0, 14);
r3 = _mm_extract_epi8(sX0, 15);
(*(dst+r0))++;
(*(dst+r1))++;
(*(dst+r2))++;
(*(dst+r3))++;
}
srcX += xStep;
srcY += yStep;
}
};
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- « Previous
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
bronxzv wrote:
Quote:
Alexander L. wrote:For me the outcome is to use either your LUT version or Vladimir's AVX2 version.
sorry but I updated my post after you answer here
don't miss my comment about PSHUFB for the Y mask, PSHUFB allows for a single clock 4-bit x 8-bit vectorized LUT, i.e. a single clock 4-bit to 8-bit mapping of 16/32/64 elements with 128/256/512 bit wide SIMD, this is arguably the optimal solution for the Y mask (when ignoring final reduction which must be out of core loop)
Yes, PSHUFB/W/.. was also ant attempt for mee for both X and Y but I has noticed, that it's possibly to Y mask only so I have looked forward how to do for X which has 5 bits and excludet Y so far from diskussion. For X we need AVX2 if we want to be short, or use much more code to cover this with SSE only.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Someone that has the other code implemented might try this non-SSE/AVX piece of code.
const int nX = 2048; const int nY = 2048; const int nXY = nX * nY; unsigned char* input[nXY]; static union { unsigned __int32 output[8]; unsigned __int64 output64[4]; }; int _tmain(int argc, _TCHAR* argv[]) { register unsigned __int64 r0 = 0; register unsigned __int64 r1 = 0; register unsigned __int64 r2 = 0; register unsigned __int64 r3 = 0; register unsigned __int64 in; for( int i=0; i < nXY; i += 8) { in = *((unsigned __int64*)&input); for( int j=0; j < 8; ++j) { unsigned char xy = (unsigned char)in; in >>= 8; r0 |= 1<<xy; // produces bit for 0:63 r1 |= 1<<(xy-64); // produces bit for 64:127 r2 |= 1<<(xy-128); // produces bit for 128:191 r3 |= 1<<(xy-192); // produces bit for 192:255 } } output64[0] = r0; output64[1] = r1; output64[2] = r2; output64[3] = r3; return 0; }
Note, adjust the above for the desired input size, the above requires multiple of 8 input. The output is the 8x32 bit bitmap, you can reduce to the intersections if you want. The above should easily parallelize with reduction on final output.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alexander L. wrote:
Yes, PSHUFB/W/.. was also ant attempt for mee for both X and Y but I has noticed, that it's possibly to Y mask only so I have looked forward how to do for X which has 5 bits
we can use PSHUFB also with 5 bits, with some extra selection code, selection can be based on and/andnot instructions which are very high throughput (3 such instructions per clock) on modern cores
here is how it can look (assuming reduction out of loop as already discussed) :
const __m128i lut20 = _mm_set_epi32(0,0,0x80402010,0x08040201), lut31 = _mm_set_epi32(0x80402010,0x08040201,0,0); const __m128i v15 = _mm_set1_epi8(15); _forceinline void SetBits(const __m128i &x, __m128i &r3, __m128i &r2, __m128i &r1, __m128i &r0) { const __m128i bit4Mask = _mm_cmpgt_epi8(x,v15); const __m128i res20 = _mm_shuffle_epi8(lut20,x), res31 = _mm_shuffle_epi8(lut31,x); r0 = _mm_or_si128(r0,_mm_andnot_si128(bit4Mask,res20)); r1 = _mm_or_si128(r1,_mm_andnot_si128(bit4Mask,res31)); r2 = _mm_or_si128(r2,_mm_and_si128(bit4Mask,res20)); r3 = _mm_or_si128(r3,_mm_and_si128(bit4Mask,res31)); }
test loop compiled for SSSE3:
.B39.7:: ; Preds .B39.7 .B39.6 movdqu xmm10, XMMWORD PTR [176+rbp+rdx] ;53.47 movdqa xmm8, xmm5 ;53.5 movdqa xmm7, xmm10 ;53.5 pshufb xmm8, xmm10 ;53.5 movdqa xmm9, xmm6 ;53.5 pcmpgtb xmm7, xmm0 ;53.5 pshufb xmm9, xmm10 ;53.5 movdqa xmm10, xmm7 ;53.5 add eax, 16 ;51.28 pandn xmm10, xmm8 ;53.5 pand xmm8, xmm7 ;53.5 por xmm1, xmm10 ;53.5 movdqa xmm10, xmm7 ;53.5 pandn xmm10, xmm9 ;53.5 pand xmm7, xmm9 ;53.5 mov edx, eax ;51.28 por xmm2, xmm10 ;53.5 por xmm3, xmm8 ;53.5 por xmm4, xmm7 ;53.5 cmp eax, 1024 ;51.22 jb .B39.7 ; Prob 98% ;51.22
only low latency / high throughput instructions + register moves that can be eliminated on modern cores
this is potentially faster than the string cmp version (TBC)
AVX compilation :
.B39.7:: ; Preds .B39.7 .B39.6 vmovdqu xmm10, XMMWORD PTR [176+rbp+rdx] ;53.47 vpshufb xmm8, xmm3, xmm10 ;53.5 vpshufb xmm7, xmm2, xmm10 ;53.5 vpcmpgtb xmm9, xmm10, xmm0 ;53.5 add eax, 16 ;51.28 vpandn xmm10, xmm9, xmm8 ;53.5 vpand xmm8, xmm9, xmm8 ;53.5 vpor xmm1, xmm1, xmm10 ;53.5 vpandn xmm10, xmm9, xmm7 ;53.5 vpand xmm7, xmm9, xmm7 ;53.5 vpor xmm4, xmm4, xmm10 ;53.5 mov edx, eax ;51.28 vpor xmm5, xmm5, xmm8 ;53.5 vpor xmm6, xmm6, xmm7 ;53.5 cmp eax, 1024 ;51.22 jb .B39.7 ; Prob 98% ;51.22
more compact code but it is probably only marginally faster than the SSSE3 path above

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
- Next »