Quote:Alexander L. wrote:

Alexander_L_1 · ‎12-15-2014

Hello together!

This is my first post, so please be patient :)
I've a very interesting problem. I also have a ready working solution, but this solution does not make me happy.
For the first I will try to describe a problem as exact as possible.
1. Let be I (I for Image) a 2D array of bytes.
2. Each byte contain 2 independently indices - say upper 3 bits will be Y-index, lower 5 bits will be X-index
3. This easily defines a translation of I->YX
4. It's also noticeable that YX can be described as 2D bit image with only 8 rows and 32 columns, which makes exactly 256 binary cells
5. Full solution will require to set 1 to each cell adressed by I (see 1 and 2)
6. Accepted solution can be reduced to separatelly calculated "setted rows" and "setted columns" - means 8 bit for row and 32 bit for columns
7. As sufficient output for accepted solution will be easily two 32-bit registers/variables.

I've really no idea how to efficiently implement this thing.

Moreover I've found no instruction to convert number-to-"stted bits" and have aslo idea how to do such thing for complete XMM register.

The current solution uses an array of 256 integers where each entry is adressed by byte-index. Each entry will be also counted (not only) set, but this is not required.

Some better ideas?

Many thanks in advice!

   void countByteIndex( int width, int height, int dStep, int iStep, void* _D, void* _I, void* _R)
   {
       int x, y;

const int xStep = dStep;
const int yStep = iStep;

uint8* srcX = (uint8*)_D;
uint8* srcY = (uint8*)_I;

__int32* dst = (__int32*)_R;

for( y = 0; y < height; y++) // single line at once
{

           register const __m128i *srcX0 = (const __m128i *)(srcX);
           register const __m128i *srcY0 = (const __m128i *)(srcY);
           register __m128i sX0, sY0;

register int r0, r1, r2, r3;

for( x = 0; x < width; x += 16 ) // 16 bytes at once per line

           {
   sX0 = _mm_load_si128( srcX0++ ); // Loads 128-bit value. Aligned.
     sY0 = _mm_load_si128( srcY0++ ); // Loads 128-bit value. Aligned.
     sX0 = _mm_and_si128( sX0, sY0 ); // Mask destination offset.

               // Index sX0
               r0 = _mm_extract_epi8(sX0, 0);
               r1 = _mm_extract_epi8(sX0, 1);
               r2 = _mm_extract_epi8(sX0, 2);
               r3 = _mm_extract_epi8(sX0, 3);

               (*(dst+r0))++; // sufficient is also (*(dst+r0)) = 1 for all (*(dst+..))++
               (*(dst+r1))++;
               (*(dst+r2))++;
               (*(dst+r3))++;

               r0 = _mm_extract_epi8(sX0, 4);
               r1 = _mm_extract_epi8(sX0, 5);
               r2 = _mm_extract_epi8(sX0, 6);
               r3 = _mm_extract_epi8(sX0, 7);

               (*(dst+r0))++;
               (*(dst+r1))++;
               (*(dst+r2))++;
               (*(dst+r3))++;

               r0 = _mm_extract_epi8(sX0, 8);
               r1 = _mm_extract_epi8(sX0, 9);
               r2 = _mm_extract_epi8(sX0, 10);
               r3 = _mm_extract_epi8(sX0, 11);

               (*(dst+r0))++;
               (*(dst+r1))++;
               (*(dst+r2))++;
               (*(dst+r3))++;

               r0 = _mm_extract_epi8(sX0, 12);
               r1 = _mm_extract_epi8(sX0, 13);
               r2 = _mm_extract_epi8(sX0, 14);
               r3 = _mm_extract_epi8(sX0, 15);

               (*(dst+r0))++;
               (*(dst+r1))++;
               (*(dst+r2))++;
               (*(dst+r3))++;
           }

           srcX += xStep;
           srcY += yStep;
       }
   };

bronxzv · ‎12-16-2014

Alexander L. wrote:
I've a very interesting problem. I also have a ready working solution, but this solution does not make me happy.

Your solution doesn't match with your description (the description is to set flags in a bitmap, the "solution" counts things), btw something I'll advise is to always start from a high level source code, before to toy with optimizations, it will also help other people better understand what you want to achieve

Alexander L. wrote:

For the first I will try to describe a problem as exact as possible.
1. Let be I (I for Image) a 2D array of bytes.
2. Each byte contain 2 independently indices - say upper 3 bits will be Y-index, lower 5 bits will be X-index
3. This easily defines a translation of I->YX
4. It's also noticeable that YX can be described as 2D bit image with only 8 rows and 32 columns, which makes exactly 256 binary cells
5. Full solution will require to set 1 to each cell adressed by I (see 1 and 2)
6. Accepted solution can be reduced to separatelly calculated "setted rows" and "setted columns" - means 8 bit for row and 32 bit for columns
7. As sufficient output for accepted solution will be easily two 32-bit registers/variables.

If I got it right this can be coded in a few lines of C++ where the core loop body will be:

    YX[srcElt>>5] |= 1 << (srcElt & 0x1f);

with srcElt a source byte in I and YX a 8 x 32-bit array (i.e. with 1 bit per "cell") with the result, all YX elements set initially to 0

at 1st sight it looks challenging to vectorize since the YX R/W access requires gather/scatter but since there is only 8 entries you should be able to put the whole 256-bit bitmap in a register using AVX2 (btw something missing from your specs is your target ISA)

Alexander_L_1 · ‎12-16-2014

Dear bronxzv

bronxzv wrote:

Your solution doesn't match with your description (the description is to set flags in a bitmap, the "solution" counts things), btw something I'll advise is to always start from a high level source code, before to toy with optimizations, it will also help other people better understand what you want to achieve

You are fully correct, my solutions does more as requested by problem description. Initially I've counted entries by index, but this is not obviously required so I've little changed a problem desription. Moreover - the solution will be thread-parallelized so counting can't work correct at all. I've commented this in provided (unchanged) code comment as
(*(dst+r0))++; // sufficient is also (*(dst+r0)) = 1 for all (*(dst+..))++,
surely it will be much better to notify this more visible.

I've not written other, high-level source, code. This is because I've learned programming assembler first for over 25 years ago. Presented intrinsic code is very well readably for me, much more as all the high level SHIFT, AND, etc. ;) Also I've learned it's most helpfully to desribe problems with words and not a code, because code skips some assumption and can contain errors - this was my motivation.

bronxzv wrote:

If I got it right this can be coded in a few lines of C++ where the core loop body will be:
    YX[srcElt>>5] |= 1 << (srcElt & 0x1f);
with srcElt a source byte in I and YX a 8 x 32-bit array (i.e. with 1 bit per "cell") with the result, all YX elements set initially to 0

Ok, high-level procedural desription will be, used some notation:

YX[ (srcElt>>5), (srcElt & 0x1f) ] |= 1;

bronxzv wrote:

at 1st sight it looks challenging to vectorize since the YX R/W access requires gather/scatter but since there is only 8 entries you should be able to put the whole 256-bit bitmap in a register using AVX2 (btw something missing from your specs is your target ISA)

This is the key of problem.

As described above, the sufficient problem solution will be to get two independend vectors Y and X as following:
The core loop body will be:

   Y[srcElt>>5] |= 1;
   X[srcElt & 0x1f] |=1;

As we can see, because it is sufficient to work with 0 or 1 only, vectors can be coded bit-wise, so it's sufficient to maintain 8-bit Y vector and 32-bit X vector. This will be denoted by by both two last points of problem description:

6. Accepted solution can be reduced to separatelly calculated "setted rows" and "setted columns" - means 8 bit for row and 32 bit for columns
7. As sufficient output for accepted solution will be easily two 32-bit registers/variables.

As next - this should work with SSE-only compatible processor, also without AVX2.

Much more, we use MS compiler and after some time I get a really bogus problem (sometimes it works, sometimes not - just if sun is shining or not) with some SSE-Intrinsics (Instruction not supported on processor during code execution) if AVX2 is enabled - this made me really crazy, but this is another story. To be short - AVX2 is currently unusable for me.

Allright, I hope the problem description is now clarified, because after a 4 days of thinking about (started last week), I've got today a key part of new vectorized short solution.

Because I think, this may be of interest for other people, I will describe a fully vecrorized very compact only few lines of code solution extra in my next reply. But I hope some experienced developer can beat my new solution. If not, get a ready solution will make happy our market competitors :)

bronxzv · ‎12-16-2014

Alexander L. wrote:

I've not written other, high-level source, code. This is because I've learned programming assembler first for over 25 years ago.

I'm quite sure spending just a few hours with a good book on C basics will help you write cleaner/simpler code, also if based on intrinsics

for ex. a classical dst[r0]++ is equivalent to (and more readable than) your (*(dst+r0))++

anyway, the point of my code snippet was to lead to a compilable solution (a full C++ test program) that can be validated, not some pseudo-code notation like the one you use for 2D array access

having such a program available is also nice as a baseline performance point to compare your hand-optimized code against, poorly optimized code with intrinsics may well be slower than what the compiler spits out in one second or two, even after several days of hard work

btw, for the simplified solution, all that you have to do is:

X |= 1 << (srcElt & 0x1f);

Y |= 1 << (srcElt >> 5);

with X a 32-bit integer and Y a byte, i.e. no array access, exactly as per 6. in your specs

Alexander_L_1 · ‎12-16-2014

bronxzv wrote:

I'm quite sure spending just a few hours with a good book on C basics will help you write cleaner/simpler code, also if based on intrinsics
for ex. a classical dst[r0]++ is equivalent to (and more readable than) your (*(dst+r0))++

Yes, this will be more readable :) The code was written in assembly langugae years ago for some other problem - this was a short adaption. To be preciselly, as many books says: ++dst[r0] should be preffered.

But all this is not a key of a question. The question was how to vectorize and optimize the whole thing.

We can extract each byte to common register (big latency), split to Y and X parts, for each parts move a value to "CL" register, than shift 1 by the "CL" (very slow special operation with huge latency that stay unoptimized since years on Intel procesors) and combine by OR. That will take a lot of code and lot of cycles.

Tomorrow I will post the core idea how to do this in a simple vectorized way. Today is way too late ;)

bronxzv · ‎12-16-2014

Alexander L. wrote:
shift 1 by the "CL" (very slow special operation with huge latency that stay unoptimized since years on Intel procesors)

not that slow since several cores back

for ex. SHR/SHL reg,cl is 1.5 clock rcp throughput in modern Intel cores (Sandy Bridge and later), isn't it ? btw they were even faster on previous cores such as Westmere/Nehalem, maybe do you have Pentium 4 in mind ?

moreover, VPSLLVD/VPSRAVD (8 parallel 32-bit shifts with fully independent variable shift count) and the like are 2 clock rcp throughput on Haswell, that's 4 variable reg,reg 32-bit shift per clock, really not bad if you ask me

Alexander L. wrote:
Tomorrow I will post the core idea how to do this in a simple vectorized way. Today is way too late ;)

OK, I look forward for it

Vladimir_Sedach · ‎12-17-2014

Hello Alexander,

That's my solution. It's as ~2.7 fast as a simple:
   for (int i = 0; i < 16; i++)
   {
       x_res |= 1 << (yx & 0x1F);
       y_res |= 1 << (yx >> 5);
   }
Hope yours is considerably faster )
===

__inline unsigned char or_bytes(__m128i v)
{
unsigned char ret = 0;

__m128i r;

r = _mm_unpackhi_epi64(v, v);
v = _mm_or_si128(v, r);

r = _mm_shuffle_epi32(v, 1);
v = _mm_or_si128(v, r);

r = _mm_shufflelo_epi16(v, 1);
v = _mm_or_si128(v, r);

r = _mm_srli_epi16(v, 8);
v = _mm_or_si128(v, r);

return _mm_cvtsi128_si32(v);
}

__inline unsigned int set_bits(unsigned char *yx, unsigned int &y_ret)
{
   __m128i   bit = _mm_set_epi32(0, 0, 0x80402010, 0x08040201);
   __m128i   byte0 = _mm_set_epi32(0, 0, 0, 0x000000FF);
   __m128i   byte1 = _mm_set_epi32(0, 0, 0, 0x0000FF00);
   __m128i   byte2 = _mm_set_epi32(0, 0, 0, 0x00FF0000);
   __m128i   byte3 = _mm_set_epi32(0, 0, 0, 0xFF000000);
   __m128i   bytei, mask;
   __m128i   mask3 = _mm_set1_epi8(0x03);
   __m128i   mask7 = _mm_set1_epi8(0x07);
   __m128i   v, y;
   __m128i   xbits, ybits;
   __m128i   xbits0, xbits1, xbits2, xbits3;
   unsigned int   x_res, y_res;

ybits = _mm_setzero_si128();
xbits0 = xbits1 = xbits2 = xbits3 = _mm_setzero_si128();

   v = _mm_loadu_si128((__m128i *)yx);
   y = _mm_srli_epi16(v, 5);
   y = _mm_and_si128(y, mask7);
   y = _mm_shuffle_epi8(bit, y);
   ybits = _mm_or_si128(ybits, y);

xbits = _mm_and_si128(v, mask7);
xbits = _mm_shuffle_epi8(bit, xbits);

bytei = _mm_srli_epi16(v, 3);
bytei = _mm_and_si128(bytei, mask3);

mask = _mm_shuffle_epi8(byte0, bytei);
xbits0 = _mm_or_si128(xbits0, _mm_and_si128(xbits, mask));

mask = _mm_shuffle_epi8(byte1, bytei);
xbits1 = _mm_or_si128(xbits1, _mm_and_si128(xbits, mask));

mask = _mm_shuffle_epi8(byte2, bytei);
xbits2 = _mm_or_si128(xbits2, _mm_and_si128(xbits, mask));

mask = _mm_shuffle_epi8(byte3, bytei);
xbits3 = _mm_or_si128(xbits3, _mm_and_si128(xbits, mask));

   y_res = or_bytes(ybits);
   x_res =
       (or_bytes(xbits0) << 0) |
       (or_bytes(xbits1) << 8) |
       (or_bytes(xbits2) << 16) |
       (or_bytes(xbits3) << 24);

y_ret = y_res;
return x_res;
}

bronxzv · ‎12-17-2014

Vladimir Sedach wrote:

Hello Alexander,

That's my solution. It's as ~2.7 fast as a simple:
   for (int i = 0; i < 16; i++)
   {
       x_res |= 1 << (yx & 0x1F);
       y_res |= 1 << (yx >> 5);
   }

it's faster for legacy SSE2 targets but the simplistic C version above will be vectorizable for AVX2 targets (maybe after a bit of refactoring) and may well end up faster for modern cores (assuming more than 16 elements, the original specs don't mention such a low nr of elements, 16 is the best case for SSE2, I'll choose 32 for AVX2 and 64 for AVX-512)

Vladimir_Sedach · ‎12-17-2014

bronxzv wrote:

it's faster for legacy SSE2 targets but the simplistic C version above will be vectorizable for AVX2 targets (maybe after a bit of refactoring) and may well end up faster for modern cores (assuming more than 16 elements, the original specs don't mention such a low nr of elements, 16 is the best case for SSE2, I'll choose 32 for AVX2 and 64 for AVX-512)

You're absolutely right (except for SSE2 -- it is actually SSSE3).
Though the boss (Alexander) doesn't want AVX for some mysterious reason.
Lets wait for his SSEx version he is so proud of.

BTW, Alexander, VC isn't a good choice for SSE/AVX projects. It is(was) buggy and produces a slow code.

Forgot to say: I'm using MinGW 4.8.2 on 64-bit Windows on a Haswell machine.
The "simple" version is 46% faster with Intel C than MinGW one, while Intel SSE version is 11% slower.
All with just O2 option set.

bronxzv · ‎12-17-2014

Vladimir Sedach wrote:

Lets wait for his SSEx version he is so proud of.

makes me think that now that the problem is clearly defined and quite simple it looks like a good candidate for some coding contest, I'll try to find some time to propose my fav. solution(s)

bronxzv · ‎12-17-2014

Vladimir Sedach wrote:

You're absolutely right

just to be sure about AVX2 vectorization, I tested the code as is (full func. below)

unsigned int set_bitsC(const unsigned char *yx, unsigned int &y_ret)
{
  unsigned int x_res = 0, y_res = 0;
  for (int i=0; i<16; i++)
  {
    x_res |= 1 << (yx & 0x1F);
    y_res |= 1 << (yx >> 5);
  }
  y_ret = y_res;
  return x_res;
}

and the Intel compiler vectorize it well (fully unrolled, as your solution), see ASM dump below

PUBLIC ?set_bitsC@@YAIPEBEAEAI@Z
?set_bitsC@@YAIPEBEAEAI@Z PROC 
; parameter 1(yx): rcx
; parameter 2(y_ret): rdx
.B1.1::                         ; Preds .B1.0

;;; {

$LN0:
$LN1:

;;;   unsigned int x_res = 0, y_res = 0;
;;;   for (int i=0; i<16; i++)
;;;   {
;;;     x_res |= 1 << (yx & 0x1F);
;;;     y_res |= 1 << (yx >> 5);

        vpmovzxbw ymm5, XMMWORD PTR [rcx]                       ;18.5
$LN2:
        vpsraw    ymm1, ymm5, 5                                 ;18.5
$LN3:
        vmovdqu   ymm4, YMMWORD PTR [_2il0floatpacket.2]        ;17.5
$LN4:
        vmovdqu   ymm3, YMMWORD PTR [_2il0floatpacket.3]        ;17.5
$LN5:
        vextracti128 xmm5, ymm1, 1                              ;18.5
$LN6:
        vpmovsxwd ymm0, xmm1                                    ;18.5
$LN7:
        vpmovsxwd ymm1, xmm5                                    ;18.5
$LN8:
        vpsllvd   ymm2, ymm4, ymm0                              ;18.5
$LN9:
        vpsllvd   ymm0, ymm4, ymm1                              ;18.5
$LN10:
        vpor      ymm2, ymm2, ymm0                              ;14.33
$LN11:
        vextracti128 xmm1, ymm2, 1                              ;14.33
$LN12:
        vpor      xmm0, xmm2, xmm1                              ;14.33
$LN13:
        vpshufd   xmm5, xmm0, 14                                ;14.33
$LN14:
        vpmovzxbd ymm2, QWORD PTR [rcx]                         ;17.5
$LN15:
        vpor      xmm1, xmm0, xmm5                              ;14.33
$LN16:
        vpand     ymm5, ymm2, ymm3                              ;17.5
$LN17:
        vpshufd   xmm0, xmm1, 57                                ;14.33
$LN18:
        vpor      xmm1, xmm1, xmm0                              ;14.33
$LN19:
        vpmovzxbd ymm2, QWORD PTR [8+rcx]                       ;17.5
$LN20:
        vpand     ymm3, ymm2, ymm3                              ;17.5
$LN21:

;;;   }
;;;   y_ret = y_res;

        vmovd     DWORD PTR [rdx], xmm1                         ;20.3
$LN22:
        vpsllvd   ymm0, ymm4, ymm5                              ;17.5
$LN23:
        vpsllvd   ymm4, ymm4, ymm3                              ;17.5
$LN24:
        vpor      ymm0, ymm0, ymm4                              ;14.22
$LN25:
        vextracti128 xmm2, ymm0, 1                              ;14.22
$LN26:
        vpor      xmm3, xmm0, xmm2                              ;14.22
$LN27:
        vpshufd   xmm4, xmm3, 14                                ;14.22
$LN28:
        vpor      xmm5, xmm3, xmm4                              ;14.22
$LN29:
        vpshufd   xmm0, xmm5, 57                                ;14.22
$LN30:
        vpor      xmm2, xmm5, xmm0                              ;14.22
$LN31:
        vmovd     eax, xmm2                                     ;14.22
$LN32:

;;;   return x_res;

        vzeroupper                                              ;21.3
$LN33:
        ret                                                     ;21.3

Vladimir_Sedach · ‎12-17-2014

bronxzv wrote:

makes me think that now that the problem is clearly defined and quite simple it looks like a good candidate for some coding contest, I'll try to find some time to propose my fav. solution(s)

Well,
let's get ready to rumble )

Though, there's a high risk to be defeated by Intel's AVX2 code that is ~9 times faster than the one w/o SIMD ))

jimdempseyatthecove · ‎12-17-2014

Not having a full description of your whole problem, and with your 25 years of programming experience, you should be able to recognize that an application-wide optimal solution may be quite different than optimizing a core routine. This said, let be offer an alternate solution that should be easy enough to try.

Premises:

1) Your byte encoded YX 256-bit, bit index can be constructed to reference a single cache line enclosed structure
2) Two such of these 256-bit structures can be contained within a single cache line
3) L1 cache hit latency is on the order of 4 clock cycles

Suggestion:

Use the byte encode YX as an index into a table of 256-bit bit masks (one bit set per mask)

Setting would be an OR, testing would be an AND.

You may want to have SSE, AVX and AVX2 versions.

Also note that the newer AVX... instructions have a compare that set a byte/word/dword/qword mask in ymm (zmm) and a second instruction that packs the msb of the bit fields into a GP register.

Jim Dempsey

bronxzv · ‎12-17-2014

Vladimir Sedach wrote:
Though, there's a high risk to be defeated by Intel's AVX2 code that is ~9 times faster than the one w/o SIMD ))

I just measured the simplistic C++ AVX2 compiled version 3x faster than the SSE2 compiled version, and 1.7x faster than your hand optimized SSSE3 version

btw I validated your solution over > 1e9 random examples and I confirm it's all OK

my measurements are as follows:

100 000 000 runs over 4 KB of random data (includes computation of control checksums)
Core i7 4770K @ 3.5 GHz (turbo enabled)
Intel C++ compiler v. 14.0.4.237

Vladimir's hand optimized w/ intrinsics (64-bit SSSE3 target) 834 - 837 ms
simplistic C++ (64-bit SSE2 target) 1468 - 1470 ms
simplistic C++ (64-bit AVX2 target) 490 - 493 ms

bronxzv · ‎12-17-2014

jimdempseyatthecove wrote:
Use the byte encode YX as an index into a table of 256-bit bit masks (one bit set per mask)

I understand that your proposal is for the full solution but I have tested the simplified case with a LUT (code below) to see how the timings compare

the speed is roughly the same than the best score so far, with 496 - 498 ms *when compiled for an SSE2 target* (probably same speed for a generic x86 target), it requires a single code path and is faster than Vladimir's proposal with intrinsics, moreover it is directly usable for non-multiple of 16 element counts (with no wasted computation for padded elements), this is thus my favorite solution so far

__int64 XYLUT[256];

_forceinline unsigned int set_bitsCv2(const unsigned char *yx, unsigned int &y_ret)
{
  unsigned __int64 xy_res = 0;
  for (int i=0; i<16; i++)
    xy_res |= XYLUT[yx];
  y_ret = xy_res & 0xFF;
  return xy_res >> 8;
}

void init()
{
  for (int i=0; i<256; i++)
  {
    const unsigned int x = 1 << (i & 0x1F),
                       y = 1 << (i >> 5);
    XYLUT = __int64(x) << 8 | y;
  }
}

it's interesting to note that the Intel compiler avoid to use gather instructions when targeting AVX2, for good reasons: when forcing the usage of gather with the example below

_forceinline int set_bitsCv3(const unsigned char *yx, unsigned int &y_ret)
{
  __m256i vxy256 = _mm256_setzero_si256();
  for (int i=0; i<16; i+=4)
  { 
    const __m128i vindex = _mm_cvtepu8_epi32((__m128i &)yx); 
    vxy256 = _mm256_or_si256(vxy256,_mm256_i32gather_epi64(XYLUT,vindex,8));
  }
  const __m128i vxy128 = _mm_or_si128(_mm256_extractf128_si256(vxy256,0),_mm256_extractf128_si256(vxy256,1));
  const unsigned __int64 xy_res = _mm_cvtsi128_si64(_mm_or_si128(_mm_unpackhi_epi64(vxy128,vxy128),vxy128));
  y_ret = xy_res & 0xFF;
  return xy_res >> 8; 
}

I measured very poor timings, around 1270 ms which is more than 2.5x worse than the scalar LUT version

Broadwell should provides better scores with gather (TBC)

Alexander_L_1 · ‎12-17-2014

Hello there,

it's very nice to get interesting info and so much help!

First to clarify why AVX2 is currently not an option.

We have many systems in a field where with Intel i5 without AVX2.
The next problem is Visual Studio bug - the project is much larger as only C/C++, just to say very large.
The bug is really crazy - if AVX2 is used, sometimes old SSE coded methods produces invalid instruction exception.

Just as bronxzv mentioned the first choice was to use VPSLLVD/VPSRAVD, but this option fails during a compiler bug. Moreover, we need to extract bytes-to-words, words-to-dwords to use this instructions.

The next was simple extract and shift (than OR), but we need an extended shift instruction with variable count, and this instructions are to slow.

As i can see, BTS (bit test and set) will be also perfect alternative - with the same slownes.

So, my next try was really simply serach "intel-instruction+search+mask" and see: http://www.strchr.com/strcmp_and_strlen_using_sse_4.2

All this instructions was new for me, the next try was to find well documented explanation:
"Intel® Advanced Vector Extensions Programming Reference" - nothing, just mentioned as instructions.
"Intel® Architecture Instruction Set Extensions Programming Reference" - does not help either.
"Intel® 64 and IA-32 Architectures Optimization Reference Manual" - just show me the way :)

So trie to modify a question and voila: the same question as mine: http://stackoverflow.com/questions/10068541/efficient-way-to-create-a-bit-mask-from-multiple-numbers-possibly-using-sse-sse2

So the solution was terrible simple, after 2 hours try and error (because it just does not work as expected) and after found a last puzzle comment on MSDN for another string-search instruction (http://msdn.microsoft.com/en-us/library/bb513993.aspx) : "One if b is does not contain the null character and the resulting mask is equal to zero. Otherwise, zero."

This comment means, the subsearch vector should not contain a 0 (zero value), after that i could write a core function (for overall optimization we just OR both high/low results a the end of complete loop). Here is only a core function for test purposes. So only first two (bold) lines are of interest. Surely we must split input YX-value in two separate Y and X (as described above), but this is trivial.

   const __m128i bl = _mm_set_epi8(16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1);
   const __m128i bh = _mm_set_epi8(32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17);
   const int mode2 = _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY;
__int32 getBitsForX(const __m128i& a)
{
   __m128i fullResultl = _mm_cmpistrm(a, bl, mode2); // set bit with position 1--16
   __m128i fullResulth = _mm_cmpistrm(a, bh, mode2); // set bit with position 17--32
   fullResulth = _mm_slli_si128(fullResulth, 2); // shift 2 bytes!
   fullResultl = _mm_or_si128(fullResultl, fullResulth);
   __int32 res = _mm_extract_epi32(fullResultl, 0);
   return res;
};

So, ended with string search instruction for bit-set operation. That's amazing. Maybe it will be really helpful to mention that in the documentation for all other users. Without search engine this solution will not possible, so i should not be honored for that ;)

Once again - many thanks for all! It's very inetersting to see alternative solutions and see what different compilers done.
And, if I can use AVX2, i think, only one instruction will be sufficient for all 32 bits.

Just to say, X does can't have a value of 0 by problem description.
But Y can, so, to use the same method with Y, 1 mus be added for all bytes before.

@jimdempseyatthecove: sad to say, but since many years my terrain is boring C#, WPF and such things - so i'm very backward with actual processor technologies.

So, maybe it can be done better?

Alexander_L_1 · ‎12-17-2014

@Jim Dempsey - can you, please, explain your solution, possible both with and without AVX2?

You wrtote:

Also note that the newer AVX... instructions have a compare that set a byte/word/dword/qword mask in ymm (zmm) and a second instruction that packs the msb of the bit fields into a GP register.

Will be this the same solution? It looks very close to solution I ended up.

bronxzv · ‎12-17-2014

void

bronxzv · ‎12-17-2014

Alexander L. wrote:

So trie to modify a question and voila: the same question as mine: http://stackoverflow.com/questions/10068541/efficient-way-to-create-a-bi.

So the solution was terrible simple, after 2 hours try and error (because it just does not work as expected) and after found a last puzzle comment on MSDN for another string-search instruction (http://msdn.microsoft.com/en-us/library/bb513993.aspx) : "One if b is does not contain the null character and the resulting mask is equal to zero. Otherwise, zero."

This comment means, the subsearch vector should not contain a 0 (zero value), after that i could write a core function (for overall optimization we just OR both high/low results a the end of complete loop). Here is only a core function for test purposes. So only first two (bold) lines are of interest. Surely we must split input YX-value in two separate Y and X (as described above), but this is trivial.

   const __m128i bl = _mm_set_epi8(16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1);
   const __m128i bh = _mm_set_epi8(32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17);
   const int mode2 = _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY;
__int32 getBitsForX(const __m128i& a)
{
   __m128i fullResultl = _mm_cmpistrm(a, bl, mode2); // set bit with position 1--16
   __m128i fullResulth = _mm_cmpistrm(a, bh, mode2); // set bit with position 17--32
   fullResulth = _mm_slli_si128(fullResulth, 2); // shift 2 bytes!
   fullResultl = _mm_or_si128(fullResultl, fullResulth);
   __int32 res = _mm_extract_epi32(fullResultl, 0);
   return res;
};

hey, it looks like you have a winner here!

I got 284-286 ms witth the code below directly adapted from your example, this is 1.7x faster than the best preceding result

const __m128i blx = _mm_set_epi8(16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1),
              bhx = _mm_set_epi8(32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17),
              bly = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, (7<<5)|1, (6<<5)|1, (5<<5)|1, (4<<5)|1, (3<<5)|1, (2<<5)|1, (1<<5)|1, (0<<5)|1);

const int mode2 = _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY;

const __m128i clearLSB = _mm_set1_epi8(0xE0), clearMSB = _mm_set1_epi8(0x1F), v1 = _mm_set1_epi8(1);

_forceinline __int32 get16Bits(const __m128i &a, const __m128i &bl)
{
  const __m128i tl = _mm_cmpistrm(a,bl,mode2); // set bit with position 1--16
  return _mm_cvtsi128_si32(tl);
};

_forceinline __int32 get32Bits(const __m128i &a, const __m128i &bh, const __m128i &bl)
{
  const __m128i tl = _mm_cmpistrm(a,bl,mode2), // set bit with position 1--16
                th = _mm_cmpistrm(a,bh,mode2); // set bit with position 17--32
  return _mm_cvtsi128_si32(_mm_or_si128(tl,_mm_slli_si128(th,2)));
};

_forceinline unsigned int set_bitsCv4(const unsigned char *yx, unsigned int &y_ret)
{
  const __m128i vxy = _mm_load_si128((__m128i *)yx);
  const __m128i vx = _mm_add_epi8(_mm_and_si128(vxy,clearMSB),v1), vy = _mm_or_si128(_mm_and_si128(vxy,clearLSB),v1); 
  y_ret = get16Bits(vy,bly);
  return get32Bits(vx,bhx,blx);
}

bronxzv · ‎12-17-2014

void

Alexander_L_1 · ‎12-18-2014

bronxzv wrote:

Quote:

I got 284-286 ms witth the code below directly adapted from your example, this is 1.7x faster than the best preceding result

Many thanks for evaluating! I think, with AVX2 it will perform much better, because two _mm_cmpistrm can be done with one and _mm_or_si128 can be omitted. Can you, please, compare both versions, maybe this will be an argument I can use to force buy and use other compiler without such MS bugs?

Very strange for me to search manuals many days up and down without results and found the right direction for solution on internet.

It's also very interesting for me to see, how well can some compilers optimize a code.

Indirect Bit Indexing and Set