Indirect Bit Indexing and Set

Alexander_L_1 · ‎12-15-2014

Hello together!

This is my first post, so please be patient :)
I've a very interesting problem. I also have a ready working solution, but this solution does not make me happy.
For the first I will try to describe a problem as exact as possible.
1. Let be I (I for Image) a 2D array of bytes.
2. Each byte contain 2 independently indices - say upper 3 bits will be Y-index, lower 5 bits will be X-index
3. This easily defines a translation of I->YX
4. It's also noticeable that YX can be described as 2D bit image with only 8 rows and 32 columns, which makes exactly 256 binary cells
5. Full solution will require to set 1 to each cell adressed by I (see 1 and 2)
6. Accepted solution can be reduced to separatelly calculated "setted rows" and "setted columns" - means 8 bit for row and 32 bit for columns
7. As sufficient output for accepted solution will be easily two 32-bit registers/variables.

I've really no idea how to efficiently implement this thing.

Moreover I've found no instruction to convert number-to-"stted bits" and have aslo idea how to do such thing for complete XMM register.

The current solution uses an array of 256 integers where each entry is adressed by byte-index. Each entry will be also counted (not only) set, but this is not required.

Some better ideas?

Many thanks in advice!

   void countByteIndex( int width, int height, int dStep, int iStep, void* _D, void* _I, void* _R)
   {
       int x, y;

const int xStep = dStep;
const int yStep = iStep;

uint8* srcX = (uint8*)_D;
uint8* srcY = (uint8*)_I;

__int32* dst = (__int32*)_R;

for( y = 0; y < height; y++) // single line at once
{

           register const __m128i *srcX0 = (const __m128i *)(srcX);
           register const __m128i *srcY0 = (const __m128i *)(srcY);
           register __m128i sX0, sY0;

register int r0, r1, r2, r3;

for( x = 0; x < width; x += 16 ) // 16 bytes at once per line

           {
   sX0 = _mm_load_si128( srcX0++ ); // Loads 128-bit value. Aligned.
     sY0 = _mm_load_si128( srcY0++ ); // Loads 128-bit value. Aligned.
     sX0 = _mm_and_si128( sX0, sY0 ); // Mask destination offset.

               // Index sX0
               r0 = _mm_extract_epi8(sX0, 0);
               r1 = _mm_extract_epi8(sX0, 1);
               r2 = _mm_extract_epi8(sX0, 2);
               r3 = _mm_extract_epi8(sX0, 3);

               (*(dst+r0))++; // sufficient is also (*(dst+r0)) = 1 for all (*(dst+..))++
               (*(dst+r1))++;
               (*(dst+r2))++;
               (*(dst+r3))++;

               r0 = _mm_extract_epi8(sX0, 4);
               r1 = _mm_extract_epi8(sX0, 5);
               r2 = _mm_extract_epi8(sX0, 6);
               r3 = _mm_extract_epi8(sX0, 7);

               (*(dst+r0))++;
               (*(dst+r1))++;
               (*(dst+r2))++;
               (*(dst+r3))++;

               r0 = _mm_extract_epi8(sX0, 8);
               r1 = _mm_extract_epi8(sX0, 9);
               r2 = _mm_extract_epi8(sX0, 10);
               r3 = _mm_extract_epi8(sX0, 11);

               (*(dst+r0))++;
               (*(dst+r1))++;
               (*(dst+r2))++;
               (*(dst+r3))++;

               r0 = _mm_extract_epi8(sX0, 12);
               r1 = _mm_extract_epi8(sX0, 13);
               r2 = _mm_extract_epi8(sX0, 14);
               r3 = _mm_extract_epi8(sX0, 15);

               (*(dst+r0))++;
               (*(dst+r1))++;
               (*(dst+r2))++;
               (*(dst+r3))++;
           }

           srcX += xStep;
           srcY += yStep;
       }
   };

bronxzv · ‎12-18-2014

Alexander L. wrote:

I think, with AVX2 it will perform much better, because two _mm_cmpistrm can be done with one and _mm_or_si128 can be omitted.

there is no VEX.256 variant of VPCMPISTRM, you can see it in the Intrinsics Guide here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/, select the String Compare checkbox on the left

this is a long latency instruction (11 clocks on HSW as per the intrinsics guide), a 256-bit variant will be probably around 20 clocks with a similar implementation

merging the two 16-bit subresults is basically costless comparatively, btw instead of doing a SHIFT + OR as you do, I'll advise to do it with a single unpack :

_forceinline unsigned int get32Bits(const __m128i &a, const __m128i &bh, const __m128i &bl)
{
  return _mm_cvtsi128_si32(_mm_unpacklo_epi16(_mm_cmpistrm(a,bl,mode2),_mm_cmpistrm(a,bh,mode2)));
};

going forward optmizing it more is highly dependent on the size of your typical workload, I suppose that you plan to process more than 16 elements for a given mask, in this case it will be probably beneficial to introduce an early out when the XY mask is full (all 1s), or more subtle, a compute X only as soon as Y == 0xff or compute Y only after X == 0xffffffff

jimdempseyatthecove · ‎12-18-2014

Alexander,

>>This is my first post, so please be patient :)
...
>>4. It's also noticeable that YX can be described as 2D bit image with only 8 rows and 32 columns, which makes exactly 256 binary cells
...
>> 7. As sufficient output for accepted solution will be easily two 32-bit registers/variables.

Can you explain how 64 bits can represent 256 bits?

From my understanding of your problem statement in post #1, simple C code represents the problem

char resultAsChar[256]; // temporary result
__int64 resultAsBits[256/64]; // desired result (256 bits)
...
for(int I=0; I < 256; ++I) resultAsChar = 0; //wipe
int nXY = nX * nY; // size of input 2D array pointed to by unsigned char* input
for(int I=0; I < nXY) resultAsChar[input] = 1; // set byte to 1 when referenced by input
// reduce resultAsChar to bits in resultAsBits
__int64 mask = 0;
for(int I=0; I < 64; ++I) mask = mask + mask + resultAsChar;
resultAsBits[0] = mask;
mask = 0;
for(int I=64; I < 128; ++I) mask = mask + mask + resultAsChar;
resultAsBits[1] = mask;
mask = 0;
for(int I=128; I < 192; ++I) mask = mask + mask + resultAsChar;
resultAsBits[2] = mask;
mask = 0;
for(int I=192; I < 256; ++I) mask = mask + mask + resultAsChar;
resultAsBits[3] = mask;

Now then, does the above properly represent your problem statement #1, at least for setting of the bit image?

Jim Dempsey

bronxzv · ‎12-18-2014

jimdempseyatthecove wrote:
Can you explain how 64 bits can represent 256 bits?

it's an area vs. perimeter thing

as per the specs:

6. Accepted solution can be reduced to separatelly calculated "setted rows" and "setted columns" - means 8 bit for row and 32 bit for columns

so, for some reason, instead of a 32x8 2D bitmap it's enough for Alexander to have 2 1D projections on the X/Y axes (32+8 bits)

jimdempseyatthecove · ‎12-18-2014

>>so, for some reason, instead of a 32x8 2D bitmap it's enough for Alexander to have 2 1D projections on the X/Y axes (32+8 bits)

__int64 result = 0; // hhhhhhhhllllllllllllllllllllllllllllllll; (8-bits high, 32-bits low)
for(int I=0; I < nXY; ++I)
result |= ((__int64)1 << (input & 0x1F)) + ((__int64)1 << ((input >> 5) + 32));

Wouldn't the above satisfy the 2 1D projections?

Jim Dempsey

bronxzv · ‎12-18-2014

jimdempseyatthecove wrote:

>>so, for some reason, instead of a 32x8 2D bitmap it's enough for Alexander to have 2 1D projections on the X/Y axes (32+8 bits)

__int64 result = 0; // hhhhhhhhllllllllllllllllllllllllllllllll; (8-bits high, 32-bits low)
for(int I=0; I < nXY; ++I)
result |= ((__int64)1 << (input & 0x1F)) + ((__int64)1 << ((input >> 5) + 32));

Wouldn't the above satisfy the 2 1D projections?

Jim Dempsey

this looks alright and pretty similar to my example here https://software.intel.com/en-us/forums/topic/537275#comment-1808546

where I do with a lookup table the equivalent of

result |= __int64(1 << (input & 0x1F)) << 8 | (1 << (input >> 5));

you can see other examples in this thread for faster solutions

Vladimir_Sedach · ‎12-19-2014

Hi bronxzv,

Would you please run this on your test bench?
It is maximum I could squeeze from AVX2.

I run it with 64-bit, /O2 /QxCORE-AVX2 and 32-byte aligned input.

typedef unsigned char __u8;
typedef unsigned int __u32;

//*******************************
__inline __u8 or_bytes(__m256i v)
//*******************************
{
__m128i r;
__u64 x;

v = _mm256_or_si256(v, _mm256_unpackhi_epi64(v, v));

r = _mm256_extracti128_si256(v, 1);
r = _mm_or_si128(r, _mm256_castsi256_si128(v));

   x = _mm_cvtsi128_si64(r);
   x |= x >> 32;
   x |= x >> 16;
   return x | (x >> 8);
}

//*********************************
__inline __u32 or_dwords(__m256i v)
//*********************************
{
__m256i r;
__m128i r1;

r = _mm256_unpackhi_epi64(v, v);
v = _mm256_or_si256(v, r);

r = _mm256_shuffle_epi32(v, 1);
v = _mm256_or_si256(v, r);

r1 = _mm256_extracti128_si256(v, 1);
r1 = _mm_or_si128(r1, _mm256_castsi256_si128(v));

return _mm_cvtsi128_si32(r1);
}

//***************************************************
__inline __u32 set_bits(const void *yx, __u32 &y_ret)
//***************************************************
{
   __m256i   bit = _mm256_set_epi32(0, 0, 0x80402010, 0x08040201, 0, 0, 0x80402010, 0x08040201);
   __m256i   one = _mm256_set1_epi32(1);
   __m256i   mask7 = _mm256_set1_epi8(0x07);
   __m256i   mask1F = _mm256_set1_epi8(0x1F);
   __m256i   v, x, y;
   __m256i   x0, x1, x2, x3;
   __m128i   xh;

v = _mm256_loadu_si256((__m256i *)yx);

   y = _mm256_srli_epi16(v, 5);
   y = _mm256_and_si256(y, mask7);
   y = _mm256_shuffle_epi8(bit, y);
   y_ret = or_bytes(y);

   x = _mm256_and_si256(v, mask1F);
   x0 = _mm256_cvtepu8_epi32(_mm256_castsi256_si128(x));
   x1 = _mm256_cvtepu8_epi32(_mm256_castsi256_si128(_mm256_unpackhi_epi64(x, x)));
   x0 = _mm256_sllv_epi32(one, x0);
   x1 = _mm256_sllv_epi32(one, x1);

   xh = _mm256_extracti128_si256(x, 1);
   x2 = _mm256_cvtepu8_epi32(xh);
   x3 = _mm256_cvtepu8_epi32(_mm_unpackhi_epi64(xh, xh));
   x2 = _mm256_sllv_epi32(one, x2);
   x3 = _mm256_sllv_epi32(one, x3);
   x2 = _mm256_or_si256(x2, x3);
   x0 = _mm256_or_si256(x0, x1);
   x = _mm256_or_si256(x0, x2);
   return or_dwords(x);
}

bronxzv · ‎12-19-2014

Vladimir Sedach wrote:

Hi bronxzv,

Would you please run this on your test bench?
It is maximum I could squeeze from AVX2.

it compiles OK after adding this line:

typedef unsigned __int64 __u64;

I have modified slightly the test bench to adapt for the 32 element granularity (and 32 B alignment) and the timings of your new solution look very good

I measure 274-277 ms with this AVX2 version, this is more or less the same than the preceding winner using the string cmp instructions (SSE4.2 version)

Vladimir_Sedach · ‎12-19-2014

bronxzv wrote:

I measure 274-277 ms with this AVX2 version, this is more or less the same than the preceding winner using the string cmp instructions (SSE4.1 version)

Thanks.
It depends unexpectedly too much on 32-byte data alignment on my machine (about 10%).
Can you confirm that?

bronxzv · ‎12-19-2014

Vladimir Sedach wrote:
It depends unexpectedly too much on 32-byte data alignment on my machine (about 10%).
Can you confirm that?

I get around 6% slowndown without proper aligment

32 B alignment: 274 - 276 ms

16 B alignment: 291 - 292 ms

no alignment (pointers with LSB = 1) : 291 - 292 ms

Vladimir_Sedach · ‎12-19-2014

bronxzv wrote:

Thank you.

bronxzv · ‎12-19-2014

@Vladimir

after remarking a lot of scalar code in your or_bytes function, and out of curiosity I have checked what the Intel vectorizer generates

the result is more vectorized but slightly slower than your code (296 - 297 ms), way faster than what a dumb compiler will do, though

source code and ASM dump (for the non-inlined version) below

__inline __u8 or_bytes(const __m256i &v)
{
  const __u8 *vBytes = (__u8 *)&v;
  __u8 res = 0;
  for (int i=0; i<32; i++) res |= vBytes;
  return res;
}

        vmovdqu   ymm0, YMMWORD PTR [rcx]                       ;213.28
$LN2:
        vextracti128 xmm1, ymm0, 1                              ;212.12
$LN3:
        vpor      xmm2, xmm0, xmm1                              ;212.12
$LN4:
        vpshufd   xmm3, xmm2, 14                                ;212.12
$LN5:
        vpor      xmm4, xmm2, xmm3                              ;212.12
$LN6:
        vpshufd   xmm5, xmm4, 57                                ;212.12
$LN7:
        vpor      xmm0, xmm4, xmm5                              ;212.12
$LN8:
        vpsrldq   xmm1, xmm0, 2                                 ;212.12
$LN9:
        vpor      xmm2, xmm0, xmm1                              ;212.12
$LN10:
        vpsrldq   xmm3, xmm2, 1                                 ;212.12
$LN11:
        vpor      xmm4, xmm2, xmm3                              ;212.12
$LN12:
        vmovd     eax, xmm4                                     ;212.12
$LN13:

Vladimir_Sedach · ‎12-19-2014

bronxzv wrote:

a lot of scalar code in your or_bytes function

or_bytes() and or_dwords() together take 33% of all time (!).
My first try with or_bytes() was approximately same as the compiler's one.
Using general registers gave a few percent of speedup, but if you consider or_bytes() alone,
it would be much more then just few percent.
Unfortunately, this trick did almost nothing with or_dwords().

Isn't it very "strange" that compiler copies ymm0 from the memory by vmovdqu ymm0, YMMWORD PTR [rcx]?

bronxzv · ‎12-19-2014

Vladimir Sedach wrote:
or_bytes() and or_dwords() together take 33% of all time (!).

this is typical for reduction code, with bigger worksets OR reduce will be done only once at the end and your version will beat easily the str compare version

speaking of the str compare instructions, I'll say that they are at a dead end going forward to wider SIMD, much like horizontal operations, these aren't truly parallelizable in the hardware implementation, so we see just some anecdotal speedups for narrow SIMD, I'll not use them personnaly just because they are faster for some legacy paths >= SSE4.2 && < AVX2

Vladimir Sedach wrote:
Unfortunately, this trick did almost nothing with or_dwords().

speaking of 32-bit OR reduce, I tried to compile an example using the AVX-512 dedicated intrinsic, here is how it looks when compiled:

;;;   return _mm512_reduce_or_epi32(a);

        vmovups   zmm0, ZMMWORD PTR [rcx]                       ;112.10
        vshuff32x4 zmm1, zmm0, zmm0, 238                        ;112.10
        vpord     zmm3, zmm1, ZMMWORD PTR [rcx]                 ;112.10
        vshuff32x4 zmm2, zmm3, zmm3, 85                         ;112.10
        vpord     zmm4, zmm2, zmm3                              ;112.10
        vpshufd   zmm5, zmm4, 78                                ;112.10
        vpord     zmm16, zmm4, zmm5                             ;112.10
        vpshufd   zmm17, zmm16, 177                             ;112.10
        vpord     zmm18, zmm16, zmm17                           ;112.10
        vmovd     eax, xmm18                                    ;112.10
        ret                                                     ;112.10

Vladimir Sedach wrote:
Isn't it very "strange" that compiler copies ymm0 from the memory by vmovdqu ymm0, YMMWORD PTR [rcx]?

the version shown in the ASM dump is not inlined, unlike the one used for the full test

Alexander_L_1 · ‎12-22-2014

bronxzv wrote:

Quote:

Vladimir Sedach wrote:
or_bytes() and or_dwords() together take 33% of all time (!).

this is typical for reduction code, with bigger worksets OR reduce will be done only once at the end

That's absolutely correct!

bronxzv wrote:

and your version will beat easily the str compare version.

That seems not to be so easy, because we have not seen the SSE path (to remember, I can't ever use AVX2 during software update on old systems in field during their absence nor currently on newer systems during MS bug).

bronxzv wrote:

speaking of the str compare instructions, I'll say that they are at a dead end going forward to wider SIMD, much like horizontal operations, these aren't truly parallelizable in the hardware implementation, so we see just some anecdotal speedups for narrow SIMD,

Can you please, explain, what is anecdotal speedups for narrow SIMD?

AVX2 can simple beat over SSE string compare, but what is the solution without AVX2?

bronxzv wrote:

I'll not use them personnaly just because they are faster for some legacy paths >= SSE4.2 && < AVX2

Fully agree if AVX2 is available and usable. More, it will be easily expandable for AVX512. And we can assume strinf compare will be not a part of AVX 512. But, it's unclear if here will be much speedup at all, because memory will be bottleneck.

bronxzv wrote:

Quote:

Vladimir Sedach wrote:
Unfortunately, this trick did almost nothing with or_dwords().

speaking of 32-bit OR reduce, I tried to compile an example using the AVX-512 dedicated intrinsic, here is how it looks when compiled:

;;;   return _mm512_reduce_or_epi32(a);

        vmovups   zmm0, ZMMWORD PTR [rcx]                       ;112.10
        vshuff32x4 zmm1, zmm0, zmm0, 238                        ;112.10
        vpord     zmm3, zmm1, ZMMWORD PTR [rcx]                 ;112.10
        vshuff32x4 zmm2, zmm3, zmm3, 85                         ;112.10
        vpord     zmm4, zmm2, zmm3                              ;112.10
        vpshufd   zmm5, zmm4, 78                                ;112.10
        vpord     zmm16, zmm4, zmm5                             ;112.10
        vpshufd   zmm17, zmm16, 177                             ;112.10
        vpord     zmm18, zmm16, zmm17                           ;112.10
        vmovd     eax, xmm18                                    ;112.10
        ret                                                     ;112.10

Quote:

Vladimir Sedach wrote:
Isn't it very "strange" that compiler copies ymm0 from the memory by vmovdqu ymm0, YMMWORD PTR [rcx]?

the version shown in the ASM dump is not inlined, unlike the one used for the full test

Interesing to see and compare SSE only version.

Here may be more to do/optimize.
The number of used registers can be important.
Prefer use only XMM0-XMM7 (YMM0-YMM7) can give speedup (this was true on some older cores).
Last but not least, if we have free registers, we can use bulk-read (4 reads), operation resorting to achieve instruction pairing and parallel execution.

What I'm wondered too, that you seen speedup of 6% with AVX2 if memory is aligned to 32 bytes. That is competelly new and surprisingly to me. Is that only important for AVX2 or for older AVX/SSE too? Matters this to specific processor or memory manufacturer? That's important to mee, because I've already to port some older SSE methods to AVX2 and the speedup was only 1-2%.

bronxzv · ‎12-22-2014

Alexander L. wrote:

Quote:

bronxzv wrote:

Quote:

Vladimir Sedach wrote:
or_bytes() and or_dwords() together take 33% of all time (!).

this is typical for reduction code, with bigger worksets OR reduce will be done only once at the end

That's absolutely correct!

and that's for this very reason that it doesn't make much sense to compare timings of the different solutions without knowing the size of your typical workload, i.e. what is the average/median size of your input array I for a given set of X/Y result masks ?

Alexander L. wrote:
That seems not to be so easy, because we have not seen the SSE path

I was speaking of Vladimir's AVX2 version, it's already slightly faster for 32 elements than your SSE 4.2 version, it will be a lot faster for 64 elements or more, for ex. his critical path for the Y mask (without reduction and with loop invariants in registers) is only 3-clock latency (for 32 elements), including the clever 1-clock latency 3-bit x 8-bit LUT using VPSHUFB, that's more than 3 x faster than a single str cmp instruction (usable for 16 elements only), in other words, when ignoring final reduction and load data latency (shared with computations for X), his AVX2 solution is more than 6x faster (per element) for the Y case than your SSE 4.2 version, so at least for Y you should consider using his superior solution (a 128-bit variant of it) instead of the slow and cumbersome (the awkard code to avoid 0 values) str cmp instructions

Alexander L. wrote:
Can you please, explain, what is anecdotal speedups for narrow SIMD?

in my mind it is a speedup for 128-bit wide SIMD that can't be extended to wider SIMD such as 256-bit which is today's mainstream and 512-bit which is already in some development pipelines (people can use SDE for validation)

Alexander L. wrote:
AVX2 can simple beat over SSE string compare, but what is the solution without AVX2?

it's a bit a stretch to call an SSE4.2 only solution an SSE path

if I have to decide a dissemination strategy based on what I have seen so far I'll go with my LUT solution for all legacy paths < AVX2 and with Valdimir's version for modern paths >= AVX2

Alexander L. wrote:
But, it's unclear if here will be much speedup at all, because memory will be bottleneck.

if *memory* is the bottleneck all the optimization exercice in this thread is basically worthless, I was assuming in my benchmark test that the client code is L1D cache friendly, your very small 32x8 bitmap makes me think to a cache blocked algorithm with tiling

1) as asked above, what is the typical size of your input array ?

2) is your input array the result of recent computations ? i.e. is it at least still in the L2/LLC caches ?

3) did you profile your code for L1D/L2 cache-miss rates ?

Alexander L. wrote:
Is that only important for AVX2 or for older AVX/SSE too?

a simple rule of thumb: the older the core, the more it's important to align your data to natural boundaries

Alexander_L_1 · ‎12-22-2014

bronxzv wrote:

1) as asked above, what is the typical size of your input array ?

2) is your input array the result of recent computations ? i.e. is it at least still in the L2/LLC caches ?

3) did you profile your code for L1D/L2 cache-miss rates ?

a simple rule of thumb: the older the core, the more it's important to align your data to natural boundaries

The optimization makes a lot of sense.

The size of input array differs and can typically fit in L2/LLC cache. If they fit in the cache, so they means the inspection has smaller chunks during constant overall machine speed - which in turn means there is less time per chunk for inspection and in speciall for this last step.

I've comapred the exceution speed on several processors and with never i7 the speedup, during lower cache miss rates, is great and therefore the algorithms itself must be approved.

Input size range: 1200..4800 in X direction and 150..6500 in Y direction. Typicall median: 1200..3000 x 200..4500. so for 3 MB LLC it will mostly fit in a cache.

The output was defined either as 8x32 array or as 8+32 bytes vector

bronxzv wrote:

I was speaking of Vladimir's AVX2 version, it's already slightly faster for 32 elements than your SSE 4.2 version, it will be a lot faster for 64 elements or more, for ex. his critical path for the Y mask (without reduction and with loop invariants in registers) is only 3-clock latency (for 32 elements), including the clever 1-clock latency 3-bit x 8-bit LUT using VPSHUFB, that's more than 3 x faster than a single str cmp instruction

Do you mean AVX-512 for 64 elements? Is it sure, that all this instruction will be expanded to AVX-512?

bronxzv wrote:

a simple rule of thumb: the older the core, the more it's important to align your data to natural boundaries

What are natural boundaries currently? For years it was 16 bytes for all that MOVDQA and co. Is it now 32 bytes and does also matter for older SSE/AVX algorithms too? You and Vladimir both seen 6..10% speedup if memory was aligned to 32 bytes. What was a test bench?

bronxzv · ‎12-22-2014

Alexander L. wrote:
Input size range: 1200..4800 in X direction and 150..6500 in Y direction. Typicall median: 1200..3000 x 200..4500. so for 3 MB LLC it will mostly fit in a cache.

that's more than 1 M elements in the median case ! I can't see how it can map to 5-bit for X and 3-bit for Y for your single byte per element input array, do you have concrete examples of input values ?

Alexander L. wrote:
Do you mean AVX-512 for 64 elements? Is it sure, that all this instruction will be expanded to AVX-512?

yes, for 64 element granularity, these are all in AVX512BW AFAIK, you can check the Intrinsics Guide to be sure

Alexander L. wrote:
What are natural boundaries currently?

16 B for 128-bit code, 32 B for 256-bit and 64 B for 512-bit

Alexander_L_1 · ‎12-22-2014

bronxzv wrote:

Quote:

Alexander L. wrote:
Input size range: 1200..4800 in X direction and 150..6500 in Y direction. Typicall median: 1200..3000 x 200..4500. so for 3 MB LLC it will mostly fit in a cache.

that's more than 1 M elements in the median case !

Mathematically mean is not our user value median to be spoken more accurate :)

bronxzv wrote:

I can't see how it can map to 5-bit for X and 3-bit for Y for you single byte per element input array, do you have concrete examples of input values ?

There are up to 32 virtual lanes and up to 8 virtual rows. A the end of processing we must known, if some result byte apply to one of virtual row and column. So the ouput size for this algorithms has nothing with input size at all - just as described in problem.

Quote:

Alexander L. wrote:
Do you mean AVX-512 for 64 elements? Is it sure, that all this instruction will be expanded to AVX-512?

yes, for 64 element granularity, these are all in AVX512BW AFAIK, you can check the Intrinsics Guide to be sure.

bronxzv wrote:

Quote:

Alexander L. wrote:
What are natural boundaries currently?

16 B for 128-bit code, 32 B for 256-bit and 64 B for 512-bit

Thanks, I totally forgot that (but already done this in the software) !

bronxzv · ‎12-22-2014

Alexander L. wrote:
There are up to 32 virtual lanes and up to 8 virtual rows. A the end of processing we must known, if some result byte apply to one of virtual row and column. So the ouput size for this algorithms has nothing with input size at all - just as described in problem.

then I don't see the point to pack the hashed X (5-bit) and hashed Y (3-bit) in a single byte as per 2. of your specs (unless you want, for some reason, to store them and process them later in a multipass algorithm ?), I'll simply process them independly on the fly to avoid useless cache polution, updating SIMD masks without reduction, reduction only once at the end + skip updating an X mask that's already full (Y mask so simple that there is no point to skip its update)

without the YX packing (which is useless if I understand well the real problem), computing the Y mask will become a single clock latency instruction (no need to shift and to clear the MSBs), i.e. around 10x faster with Vladimir's PSHUFB solution than with your string cmp solution, in fact the code for avoiding 0 values before to use the slowish string cmp instruction is already as slow as the PSHUFB full solution !

now, I suppose it's impossible to help you further without having more context, i.e. to see the realworld input data which isn't as 2. of your specs, after all

Alexander_L_1 · ‎12-22-2014

bronxzv wrote:

then I don't see the point to pack the hashed X (5-bit) and hashed Y (3-bit) in a single byte as per 2. of your specs (unless you want to store them and process them later in a multipass algorithm ?), I'll simply process them independly on the fly (updating SIMD masks without reduction, reduction only once at the end) to avoid useless cache polution, anyway I suppose it's impossible to help you further without having more context

Oh, it's much more complex process with multipass processing and label (X/Y) assignment. I've already attempt to process mask through algorithms, but this was a really terrible overkill and after two ot them I give this up. Much more, the mask must be applied in the middle of processing tree, because we done some vizualisation also - but this is completelly another (WPF) story.

So, at that point I cant's see more improvement also. Many thanks for all!

For me the outcome is to use either your LUT version or Vladimir's AVX2 version.

By the way, I've some parts of CCL (Connected-component labeling) algorithms I will like to optimize. Could somebody done such things bevore?
As I noticed, something, that can be named "remapping", is terrible slow. Example below. Maybe it is a better idea to start a new topic? What I'm also very like to know, if the Intel compiler can do this much better and produce vectorized code.

int *Map, *reMap;

   // count objects and prepare remapping array
   objectsCount = 0;
   for ( int i = 1; i <= labelsCount; i++ )
   {
       if ( Map == i )
       {
           // increase objects count
           reMap = ++objectsCount;
       }
   }

   // second pass to complete remapping
   for ( int i = 1; i <= labelsCount; i++ )
   {
       if ( Map != i )
       {
           reMap = reMap[Map];
       }
   }

   // repair object labels
   for ( int i = 0, n = mImageWidth * mImageHeight; i < n; i++ )
   {
       objectLabels = reMap[objectLabels];
   }

bronxzv · ‎12-22-2014

Alexander L. wrote:
For me the outcome is to use either your LUT version or Vladimir's AVX2 version.

sorry but I updated my post after you answer here

don't miss my comment about PSHUFB for the Y mask, PSHUFB allows for a single clock 4-bit x 8-bit vectorized LUT, i.e. a single clock 4-bit to 8-bit mapping of 16/32/64 elements with 128/256/512 bit wide SIMD, this is arguably the optimal solution for the Y mask (when ignoring final reduction which must be out of core loop) for all code paths >= SSSE3

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in