<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Quote:jimdempseyatthecove in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025841#M5063</link>
    <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;Use the byte encode YX as an index into a table of 256-bit bit masks (one bit set per mask)&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;I understand that your proposal is for the full solution but I have tested the simplified case with a LUT (code below) to see how the timings compare&lt;/P&gt;

&lt;P&gt;the speed is roughly the same than the best score so far, with 496 - 498 ms *when compiled for an SSE2 target* (probably same speed for a generic x86 target), it requires a single code path and is faster than Vladimir's&amp;nbsp;proposal with intrinsics, moreover it is directly&amp;nbsp;usable for non-multiple of 16 element counts (with no wasted computation for padded elements), this is thus my favorite solution so far&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;__int64 XYLUT[256];

_forceinline unsigned int set_bitsCv2(const unsigned char *yx, unsigned int &amp;amp;y_ret)
{
&amp;nbsp; unsigned __int64 xy_res = 0;
&amp;nbsp; for (int i=0; i&amp;lt;16; i++)
&amp;nbsp;&amp;nbsp;&amp;nbsp; xy_res |= XYLUT[yx&lt;I&gt;];
&amp;nbsp; y_ret = xy_res &amp;amp; 0xFF;
&amp;nbsp; return xy_res &amp;gt;&amp;gt; 8;
}

void init()
{
&amp;nbsp; for (int i=0; i&amp;lt;256; i++)
&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp; const unsigned int x = 1 &amp;lt;&amp;lt; (i &amp;amp; 0x1F),
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; y = 1 &amp;lt;&amp;lt; (i &amp;gt;&amp;gt; 5);
&amp;nbsp;&amp;nbsp;&amp;nbsp; XYLUT&lt;I&gt; = __int64(x) &amp;lt;&amp;lt; 8 | y;
&amp;nbsp; }
}

&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;it's interesting to note that the Intel compiler avoid to use gather instructions when targeting AVX2, for good reasons: when forcing the usage of gather with&amp;nbsp;the example below&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;_forceinline int set_bitsCv3(const unsigned char *yx, unsigned int &amp;amp;y_ret)
{
&amp;nbsp; __m256i vxy256 = _mm256_setzero_si256();
&amp;nbsp; for (int i=0; i&amp;lt;16; i+=4)
&amp;nbsp; { 
&amp;nbsp;&amp;nbsp;&amp;nbsp; const __m128i vindex = _mm_cvtepu8_epi32((__m128i &amp;amp;)yx&lt;I&gt;); 
&amp;nbsp;&amp;nbsp;&amp;nbsp; vxy256 = _mm256_or_si256(vxy256,_mm256_i32gather_epi64(XYLUT,vindex,8));
&amp;nbsp; }
&amp;nbsp; const __m128i vxy128 = _mm_or_si128(_mm256_extractf128_si256(vxy256,0),_mm256_extractf128_si256(vxy256,1));
&amp;nbsp; const unsigned __int64 xy_res = _mm_cvtsi128_si64(_mm_or_si128(_mm_unpackhi_epi64(vxy128,vxy128),vxy128));
&amp;nbsp; y_ret = xy_res &amp;amp; 0xFF;
&amp;nbsp; return xy_res &amp;gt;&amp;gt; 8; 
}&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;I measured very poor timings, around 1270 ms which is more than&amp;nbsp;2.5x worse than the scalar LUT version&lt;/P&gt;

&lt;P&gt;Broadwell should provides better scores with gather (TBC)&lt;/P&gt;</description>
    <pubDate>Wed, 17 Dec 2014 20:57:00 GMT</pubDate>
    <dc:creator>bronxzv</dc:creator>
    <dc:date>2014-12-17T20:57:00Z</dc:date>
    <item>
      <title>Indirect Bit Indexing and Set</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025827#M5049</link>
      <description>&lt;P&gt;&amp;nbsp; &amp;nbsp;Hello together!&lt;/P&gt;

&lt;P&gt;This is my first post, so please be patient :)&lt;BR /&gt;
	I've a very interesting problem. I also have a ready working solution, but this solution does not make me happy.&lt;BR /&gt;
	For the first I will try to describe a problem as exact as possible.&lt;BR /&gt;
	1. Let be I (I for Image) a 2D array of bytes.&lt;BR /&gt;
	2. Each byte contain 2 independently indices - say upper 3 bits will be Y-index, lower 5 bits will be X-index&lt;BR /&gt;
	3. This easily defines a translation of I-&amp;gt;YX&lt;BR /&gt;
	4. It's also noticeable that YX can be described as 2D bit image with only 8 rows and 32 columns, which makes exactly 256 binary cells&lt;BR /&gt;
	5. Full solution will require to set 1 to each cell&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp;adressed by I (see 1 and 2)&lt;BR /&gt;
	6. Accepted solution can be reduced to separatelly calculated "setted rows" and "setted columns" - means 8 bit for row and 32 bit for columns&lt;BR /&gt;
	7. As sufficient output for accepted solution will be easily two 32-bit registers/variables.&lt;/SPAN&gt;&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;I've really no idea how to efficiently implement this thing.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Moreover I've found no instruction to convert number-to-"stted bits" and have aslo idea how to do such thing for complete XMM register.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;The current solution uses an array of 256 integers where each entry is adressed by byte-index. Each entry will be also counted (not only) set, but this is not required.&lt;/P&gt;

&lt;P&gt;Some better ideas?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Many thanks in advice!&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;void countByteIndex( int width, int height, int dStep, &amp;nbsp;int iStep, &amp;nbsp;void* &amp;nbsp;_D, void* _I, void* _R)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;{&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;int x, y;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; const int xStep = dStep;&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;const int yStep = iStep;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;uint8* srcX = (uint8*)_D;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;uint8* srcY = (uint8*)_I;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;__int32* dst = (__int32*)_R;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;for( y = 0; y &amp;lt; height; y++) // single line at once&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;{&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;register const __m128i *srcX0 = (const __m128i *)(srcX);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;register const __m128i *srcY0 = (const __m128i *)(srcY);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;register __m128i sX0, sY0;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;register int r0, r1, r2, r3;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; for( x = 0; x &amp;lt; width; x += 16 ) // 16 bytes at once per line&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;{&lt;BR /&gt;
	&amp;nbsp;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; sX0 = _mm_load_si128( srcX0++ ); // Loads 128-bit value. Aligned.&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;sY0 = _mm_load_si128( srcY0++ ); // Loads 128-bit value. Aligned.&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;sX0 = _mm_and_si128( sX0, sY0 ); // Mask destination offset.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;// Index sX0&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r0 = _mm_extract_epi8(sX0, 0);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r1 = _mm_extract_epi8(sX0, 1);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r2 = _mm_extract_epi8(sX0, 2);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r3 = _mm_extract_epi8(sX0, 3);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(*(dst+r0))++; // sufficient is also&amp;nbsp;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;(*(dst+r0)) = 1 for all&amp;nbsp;(*(dst+..))++&lt;/SPAN&gt;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(*(dst+r1))++;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(*(dst+r2))++;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(*(dst+r3))++;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r0 = _mm_extract_epi8(sX0, 4);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r1 = _mm_extract_epi8(sX0, 5);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r2 = _mm_extract_epi8(sX0, 6);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r3 = _mm_extract_epi8(sX0, 7);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(*(dst+r0))++;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(*(dst+r1))++;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(*(dst+r2))++;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(*(dst+r3))++;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r0 = _mm_extract_epi8(sX0, 8);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r1 = _mm_extract_epi8(sX0, 9);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r2 = _mm_extract_epi8(sX0, 10);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r3 = _mm_extract_epi8(sX0, 11);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(*(dst+r0))++;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(*(dst+r1))++;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(*(dst+r2))++;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(*(dst+r3))++;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r0 = _mm_extract_epi8(sX0, 12);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r1 = _mm_extract_epi8(sX0, 13);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r2 = _mm_extract_epi8(sX0, 14);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r3 = _mm_extract_epi8(sX0, 15);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(*(dst+r0))++;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(*(dst+r1))++;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(*(dst+r2))++;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(*(dst+r3))++;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;}&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;srcX += xStep;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;srcY += yStep;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;}&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp; };&lt;/P&gt;</description>
      <pubDate>Mon, 15 Dec 2014 15:50:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025827#M5049</guid>
      <dc:creator>Alexander_L_1</dc:creator>
      <dc:date>2014-12-15T15:50:36Z</dc:date>
    </item>
    <item>
      <title>Quote:Alexander L. wrote:I've</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025828#M5050</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Alexander L. wrote:&lt;BR /&gt;I've a very interesting problem. I also have a ready working solution, but this solution does not make me happy.&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Your solution doesn't match with your description (the description is to set&amp;nbsp;flags in a bitmap, the "solution" counts things), btw something I'll&amp;nbsp;advise is to always start from a high level source code, before to toy with optimizations, it will also help other people better understand what you want to achieve&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Alexander L. wrote:&lt;BR /&gt;&lt;BR /&gt;
	For the first I will try to describe a problem as exact as possible.&lt;BR /&gt;
	1. Let be I (I for Image) a 2D array of bytes.&lt;BR /&gt;
	2. Each byte contain 2 independently indices - say upper 3 bits will be Y-index, lower 5 bits will be X-index&lt;BR /&gt;
	3. This easily defines a translation of I-&amp;gt;YX&lt;BR /&gt;
	4. It's also noticeable that YX can be described as 2D bit image with only 8 rows and 32 columns, which makes exactly 256 binary cells&lt;BR /&gt;
	5. Full solution will require to set 1 to each cell&amp;nbsp;adressed by I (see 1 and 2)&lt;BR /&gt;
	6. Accepted solution can be reduced to separatelly calculated "setted rows" and "setted columns" - means 8 bit for row and 32 bit for columns&lt;BR /&gt;
	7. As sufficient output for accepted solution will be easily two 32-bit registers/variables.&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;If I got it right this can be coded in a few lines of C++ where the core loop body will be:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; YX[srcElt&amp;gt;&amp;gt;5] |= 1 &amp;lt;&amp;lt; (srcElt &amp;amp; 0x1f);&lt;/PRE&gt;

&lt;P&gt;with &lt;EM&gt;srcElt &lt;/EM&gt;a source byte in &lt;EM&gt;I &lt;/EM&gt;and &lt;EM&gt;YX &lt;/EM&gt;a 8 x 32-bit array (i.e. with 1 bit per "cell")&amp;nbsp;with the result, all &lt;EM&gt;YX &lt;/EM&gt;elements set initially&amp;nbsp;to 0&lt;/P&gt;

&lt;P&gt;at 1st sight it looks challenging to vectorize since the YX R/W access requires gather/scatter but since there is only 8 entries you should be able to put the whole 256-bit bitmap in a register using AVX2 (btw something missing from your specs is your target ISA)&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 16 Dec 2014 17:58:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025828#M5050</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2014-12-16T17:58:00Z</dc:date>
    </item>
    <item>
      <title>Dear bronxzv</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025829#M5051</link>
      <description>&lt;P&gt;Dear&amp;nbsp;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;bronxzv&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;bronxzv wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Your solution doesn't match with your description (the description is to set&amp;nbsp;flags in a bitmap, the "solution" counts things), btw something I'll&amp;nbsp;advise is to always start from a high level source code, before to toy with optimizations, it will also help other people better understand what you want to achieve&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;You are fully correct, my solutions does more as requested by problem description. Initially I've counted entries by index, but this is not obviously required so I've little changed a problem desription. Moreover - the solution will be thread-parallelized so counting can't work correct at all. I've commented this in provided (unchanged) code comment as &lt;/SPAN&gt;&lt;BR /&gt;
	&lt;EM&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp;(*(dst+r0))++; // sufficient is also&amp;nbsp;(*(dst+r0)) = 1 for all&amp;nbsp;(*(dst+..))++&lt;/SPAN&gt;&lt;/EM&gt;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;&lt;EM&gt;,&lt;/EM&gt;&lt;BR /&gt;
	surely it will be much better to notify this more visible.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;I've not written other, high-level source, code. This is because I've learned programming assembler first for over 25 years ago. Presented intrinsic code is very well readably for me, much more as all the high level SHIFT, AND, etc. ;) Also I've learned it's most helpfully to desribe problems with words and not a code, because code skips some assumption and can contain errors - this was my motivation.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;bronxzv wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;If I got it right this can be coded in a few lines of C++ where the core loop body will be:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; YX[srcElt&amp;gt;&amp;gt;5] |= 1 &amp;lt;&amp;lt; (srcElt &amp;amp; 0x1f);&lt;/PRE&gt;

&lt;P&gt;with &lt;EM&gt;srcElt &lt;/EM&gt;a source byte in &lt;EM&gt;I &lt;/EM&gt;and &lt;EM&gt;YX &lt;/EM&gt;a 8 x 32-bit array (i.e. with 1 bit per "cell")&amp;nbsp;with the result, all &lt;EM&gt;YX &lt;/EM&gt;elements set initially&amp;nbsp;to 0&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Ok, high-level procedural desription will be, used some notation:&lt;/P&gt;

&lt;PRE class="brush:cpp;" style="font-size: 13px; line-height: 19.5120010375977px;"&gt;YX[ (srcElt&amp;gt;&amp;gt;5), (srcElt &amp;amp; 0x1f) ] |= 1;&lt;/PRE&gt;

&lt;P&gt;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;bronxzv wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;at 1st sight it looks challenging to vectorize since the YX R/W access requires gather/scatter but since there is only 8 entries you should be able to put the whole 256-bit bitmap in a register using AVX2 (btw something missing from your specs is your target ISA)&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;This is the key of problem.&lt;/P&gt;

&lt;P&gt;As described above, the sufficient problem solution will be to get two independend vectors Y and X as following:&lt;BR /&gt;
	T&lt;SPAN style="line-height: 19.5120010375977px; font-size: 1em;"&gt;he core loop body will be:&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:cpp;" style="font-size: 13px; line-height: 19.5120010375977px;"&gt;   Y[srcElt&amp;gt;&amp;gt;5] |= 1;
&lt;SPAN style="line-height: 19.5120010375977px;"&gt;   X[srcElt &amp;amp; 0x1f] |=1;&lt;/SPAN&gt;&lt;/PRE&gt;

&lt;P&gt;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;As we can see, because it is sufficient to work with 0 or 1 only, vectors can be coded bit-wise, so it's sufficient to maintain 8-bit Y vector and 32-bit X vector. This will be denoted by by both two last points of problem description:&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="color: rgb(96, 96, 96); line-height: 19.5120010375977px; background-color: rgb(238, 238, 238);"&gt;6. Accepted solution can be reduced to separatelly calculated "setted rows" and "setted columns" - means 8 bit for row and 32 bit for columns&lt;/SPAN&gt;&lt;BR style="color: rgb(96, 96, 96); line-height: 19.5120010375977px;" /&gt;
	&lt;SPAN style="color: rgb(96, 96, 96); line-height: 19.5120010375977px; background-color: rgb(238, 238, 238);"&gt;7. As sufficient output for accepted solution will be easily two 32-bit registers/variables.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;As next - this should work with SSE-only compatible processor, also without AVX2.&lt;BR /&gt;
	&lt;BR /&gt;
	Much more, we use MS compiler and after some time I get a really bogus problem (sometimes it works, sometimes not - just if sun is shining or not) with some SSE-Intrinsics (Instruction not supported on processor during code execution) if AVX2 is enabled - this made me really crazy, but this is another story. To be short - AVX2 is currently unusable for me.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;Allright, I hope the problem description is now clarified, because after a 4 days of thinking about (started last week), I've got today &amp;nbsp;a key part of new vectorized short solution.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;Because I think, this may be of interest for other people, I will describe a fully vecrorized very compact only few lines of code solution extra in my next reply. But I hope some experienced developer can beat my new solution. If not, get a ready solution will make happy our market competitors :)&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 16 Dec 2014 21:49:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025829#M5051</guid>
      <dc:creator>Alexander_L_1</dc:creator>
      <dc:date>2014-12-16T21:49:53Z</dc:date>
    </item>
    <item>
      <title>I fail to see how the way you</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025830#M5052</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Alexander L. wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;I've not written other, high-level source, code. This is because I've learned programming assembler first for over 25 years ago.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;I'm quite sure spending just a few hours&amp;nbsp;with a good book on C basics will help you write cleaner/simpler code, also if based on intrinsics&lt;/P&gt;

&lt;P&gt;for ex. a classical &lt;EM&gt;dst[r0]++ &lt;/EM&gt;is equivalent to (and more readable&amp;nbsp;than) your&amp;nbsp;&amp;nbsp;&lt;EM&gt;(*(dst+r0))++&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;anyway, the&amp;nbsp;point of my code snippet was to lead to a compilable solution (a full&amp;nbsp;C++ test program) that can be validated, not some pseudo-code notation like the one you use for 2D array access&lt;/P&gt;

&lt;P&gt;having such a program available is also nice as a baseline performance point to compare your hand-optimized code against, poorly optimized code with intrinsics may well be slower than what the compiler spits out in one second or two, even after several days of hard work&lt;/P&gt;

&lt;P&gt;btw, for the simplified solution, all that you have to do is:&lt;/P&gt;

&lt;P&gt;X |= &amp;nbsp;1 &amp;lt;&amp;lt; (srcElt &amp;amp; 0x1f);&lt;/P&gt;

&lt;P&gt;Y&amp;nbsp;|= &amp;nbsp;1 &amp;lt;&amp;lt; (srcElt &amp;gt;&amp;gt; 5);&lt;/P&gt;

&lt;P&gt;with X a 32-bit integer and Y a byte, i.e. no array access, exactly as per 6. in your specs&lt;/P&gt;</description>
      <pubDate>Tue, 16 Dec 2014 22:35:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025830#M5052</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2014-12-16T22:35:00Z</dc:date>
    </item>
    <item>
      <title>Quote:bronxzv wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025831#M5053</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;bronxzv wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;I'm quite sure spending just a few hours&amp;nbsp;with a good book on C basics will help you write cleaner/simpler code, also if based on intrinsics&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;for ex. a classical &lt;/SPAN&gt;&lt;EM style="font-size: 1em; line-height: 1.5;"&gt;dst[r0]++ &lt;/EM&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;is equivalent to (and more readable&amp;nbsp;than) your&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;EM style="font-size: 1em; line-height: 1.5;"&gt;(*(dst+r0))++&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Yes, this will be more readable :) The code was written in assembly langugae years ago for some other problem - this was a short adaption. To be preciselly, as many books says: ++&lt;EM style="font-size: 1em; line-height: 1.5;"&gt;dst[r0]&amp;nbsp;&lt;/EM&gt;should be preffered.&lt;/P&gt;

&lt;P&gt;But all this is not a key of a question. The question was how to vectorize and optimize the whole thing.&lt;/P&gt;

&lt;P&gt;We can extract each byte to common register (big latency), split to Y and X parts, for each parts move a value to "CL" register, than shift 1 by the "CL" (very slow special operation with huge latency that stay unoptimized since years on Intel procesors) and combine by OR. That will take a lot of code and lot of cycles.&lt;/P&gt;

&lt;P&gt;Tomorrow I will post the core idea how to do this in a simple vectorized way. Today is way too late ;) &amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Dec 2014 00:37:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025831#M5053</guid>
      <dc:creator>Alexander_L_1</dc:creator>
      <dc:date>2014-12-17T00:37:40Z</dc:date>
    </item>
    <item>
      <title>Quote:Alexander L. wrote</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025832#M5054</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Alexander L. wrote:&lt;BR /&gt;shift 1 by the "CL" (very slow special operation with huge latency that stay unoptimized since years on Intel procesors) &lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;not that slow since several cores back&lt;/P&gt;

&lt;P&gt;for ex. SHR/SHL reg,cl is 1.5 clock rcp throughput&amp;nbsp;in modern Intel cores (Sandy Bridge and later), isn't it ? btw they were&amp;nbsp;even faster on previous cores such as Westmere/Nehalem, maybe do you have Pentium 4 in mind ?&lt;/P&gt;

&lt;P&gt;moreover,&amp;nbsp;&amp;nbsp;VPSLLVD/VPSRAVD (8 parallel 32-bit shifts with fully independent variable shift count) and the like are 2 clock rcp throughput on Haswell, that's 4 variable reg,reg 32-bit shift per clock, really not bad if you ask me&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Alexander L. wrote:&lt;BR /&gt;Tomorrow I will post the core idea how to do this in a simple vectorized way. Today is way too late ;)&amp;nbsp;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;OK, I look forward for it&lt;/P&gt;</description>
      <pubDate>Wed, 17 Dec 2014 01:13:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025832#M5054</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2014-12-17T01:13:00Z</dc:date>
    </item>
    <item>
      <title>Hello Alexander,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025833#M5055</link>
      <description>&lt;P&gt;Hello Alexander,&lt;BR /&gt;
	&lt;BR /&gt;
	That's my solution. It's as ~2.7 fast as a simple:&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;for (int i = 0; i &amp;lt; 16; i++)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;{&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;x_res |= 1 &amp;lt;&amp;lt; (yx&lt;I&gt; &amp;amp; 0x1F);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;y_res |= 1 &amp;lt;&amp;lt; (yx&lt;I&gt; &amp;gt;&amp;gt; 5);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;}&lt;BR /&gt;
	Hope yours is considerably faster )&lt;BR /&gt;
	&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;===&lt;BR /&gt;
	&lt;BR /&gt;
	__inline unsigned char or_bytes(__m128i v)&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;{&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp; &amp;nbsp; unsigned char ret = 0;&lt;/SPAN&gt;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i&amp;nbsp;&amp;nbsp; &amp;nbsp;r;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;r = _mm_unpackhi_epi64(v, v);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;v = _mm_or_si128(v, r);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;r = _mm_shuffle_epi32(v, 1);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;v = _mm_or_si128(v, r);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;r = _mm_shufflelo_epi16(v, 1);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;v = _mm_or_si128(v, r);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;r = _mm_srli_epi16(v, 8);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;v = _mm_or_si128(v, r);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;return _mm_cvtsi128_si32(v);&lt;BR /&gt;
	}&lt;/P&gt;

&lt;P&gt;__inline &lt;SPAN style="font-size: 12.8000020980835px; line-height: 15.6096038818359px;"&gt;unsigned int&lt;/SPAN&gt; set_bits(&lt;SPAN style="font-size: 12.8000020980835px; line-height: 15.6096038818359px;"&gt;unsigned char&lt;/SPAN&gt;&amp;nbsp;*yx, &lt;SPAN style="font-size: 12.8000020980835px; line-height: 15.6096038818359px;"&gt;unsigned int&lt;/SPAN&gt; &amp;amp;y_ret)&lt;BR /&gt;
	{&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i&amp;nbsp;&amp;nbsp; &amp;nbsp;bit = _mm_set_epi32(0, 0, 0x80402010, 0x08040201);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i&amp;nbsp;&amp;nbsp; &amp;nbsp;byte0 = _mm_set_epi32(0, 0, 0, 0x000000FF);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i&amp;nbsp;&amp;nbsp; &amp;nbsp;byte1 = _mm_set_epi32(0, 0, 0, 0x0000FF00);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i&amp;nbsp;&amp;nbsp; &amp;nbsp;byte2 = _mm_set_epi32(0, 0, 0, 0x00FF0000);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i&amp;nbsp;&amp;nbsp; &amp;nbsp;byte3 = _mm_set_epi32(0, 0, 0, 0xFF000000);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i&amp;nbsp;&amp;nbsp; &amp;nbsp;bytei, mask;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i&amp;nbsp;&amp;nbsp; &amp;nbsp;mask3 = _mm_set1_epi8(0x03);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i&amp;nbsp;&amp;nbsp; &amp;nbsp;mask7 = _mm_set1_epi8(0x07);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i&amp;nbsp;&amp;nbsp; &amp;nbsp;v, y;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i&amp;nbsp;&amp;nbsp; &amp;nbsp;xbits, ybits;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i&amp;nbsp;&amp;nbsp; &amp;nbsp;xbits0, xbits1, xbits2, xbits3;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;unsigned int&amp;nbsp;&amp;nbsp; &amp;nbsp;x_res, y_res;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;ybits = _mm_setzero_si128();&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xbits0 = xbits1 = xbits2 = xbits3 = _mm_setzero_si128();&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;v = _mm_loadu_si128((__m128i *)yx);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;y = _mm_srli_epi16(v, 5);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;y = _mm_and_si128(y, mask7);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;y = _mm_shuffle_epi8(bit, y);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;ybits = _mm_or_si128(ybits, y);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;xbits = _mm_and_si128(v, mask7);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xbits = _mm_shuffle_epi8(bit, xbits);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;bytei = _mm_srli_epi16(v, 3);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;bytei = _mm_and_si128(bytei, mask3);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;mask = _mm_shuffle_epi8(byte0, bytei);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xbits0 = _mm_or_si128(xbits0, _mm_and_si128(xbits, mask));&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;mask = _mm_shuffle_epi8(byte1, bytei);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xbits1 = _mm_or_si128(xbits1, _mm_and_si128(xbits, mask));&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;mask = _mm_shuffle_epi8(byte2, bytei);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xbits2 = _mm_or_si128(xbits2, _mm_and_si128(xbits, mask));&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;mask = _mm_shuffle_epi8(byte3, bytei);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;xbits3 = _mm_or_si128(xbits3, _mm_and_si128(xbits, mask));&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;y_res = or_bytes(ybits);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;x_res =&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(or_bytes(xbits0) &amp;lt;&amp;lt; 0) |&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(or_bytes(xbits1) &amp;lt;&amp;lt; 8) |&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(or_bytes(xbits2) &amp;lt;&amp;lt; 16) |&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;(or_bytes(xbits3) &amp;lt;&amp;lt; 24);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;y_ret = y_res;&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;return x_res;&lt;BR /&gt;
	}&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Dec 2014 11:04:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025833#M5055</guid>
      <dc:creator>Vladimir_Sedach</dc:creator>
      <dc:date>2014-12-17T11:04:00Z</dc:date>
    </item>
    <item>
      <title>Quote:Vladimir Sedach wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025834#M5056</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Vladimir Sedach wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hello Alexander,&lt;/P&gt;

&lt;P&gt;That's my solution. It's as ~2.7 fast as a simple:&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;for (int i = 0; i &amp;lt; 16; i++)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;{&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;x_res |= 1 &amp;lt;&amp;lt; (yx&lt;I&gt; &amp;amp; 0x1F);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;y_res |= 1 &amp;lt;&amp;lt; (yx&lt;I&gt; &amp;gt;&amp;gt; 5);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;}&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;it's faster for legacy SSE2 targets but the simplistic C version above&amp;nbsp;will be vectorizable&amp;nbsp;for AVX2 targets (maybe after a bit of refactoring) and may well end up faster for modern cores (assuming more than 16 elements, the original specs don't mention such a low nr of elements, 16 is the best case for SSE2, I'll choose 32 for AVX2 and 64 for AVX-512)&lt;/P&gt;</description>
      <pubDate>Wed, 17 Dec 2014 12:13:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025834#M5056</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2014-12-17T12:13:00Z</dc:date>
    </item>
    <item>
      <title>Quote:bronxzv wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025835#M5057</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;bronxzv wrote:&lt;BR /&gt;&lt;BR /&gt;
	&lt;BR /&gt;
	&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;it's faster for legacy SSE2 targets but the simplistic C version above&amp;nbsp;will be vectorizable&amp;nbsp;for AVX2 targets (maybe after a bit of refactoring) and may well end up faster for modern cores (assuming more than 16 elements, the original specs don't mention such a low nr of elements, 16 is the best case for SSE2, I'll choose 32 for AVX2 and 64 for AVX-512)&lt;/SPAN&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;BR /&gt;
	&lt;BR /&gt;
	You're absolutely right (except for SSE2 -- it is actually SSSE3).&lt;BR /&gt;
	Though the boss (Alexander) doesn't want AVX for some mysterious reason.&lt;BR /&gt;
	Lets wait for his SSEx version he is so proud of.&lt;BR /&gt;
	&lt;BR /&gt;
	BTW, Alexander, VC isn't a good choice&amp;nbsp;for SSE/AVX projects. It is(was) buggy and produces a slow code.&lt;BR /&gt;
	&lt;BR /&gt;
	Forgot to say: I'm using MinGW 4.8.2 on &lt;SPAN style="font-size: 12.8000020980835px; line-height: 15.6096038818359px;"&gt;64-bit&amp;nbsp;&lt;/SPAN&gt;Windows on a Haswell machine.&lt;BR /&gt;
	The "simple" version is 46% faster with Intel C than &lt;SPAN style="font-size: 12.8000020980835px; line-height: 15.6096038818359px;"&gt;MinGW one, while Intel SSE version is 11% slower.&lt;BR /&gt;
	All with just O2 option set.&lt;/SPAN&gt;&lt;BR /&gt;
	&amp;nbsp;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Dec 2014 14:30:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025835#M5057</guid>
      <dc:creator>Vladimir_Sedach</dc:creator>
      <dc:date>2014-12-17T14:30:00Z</dc:date>
    </item>
    <item>
      <title>Quote:Vladimir Sedach wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025836#M5058</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Vladimir Sedach wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Lets wait for his SSEx version he is so proud of.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;makes me think that now that the problem is clearly defined&amp;nbsp;and quite&amp;nbsp;simple it looks like a good candidate for some coding contest, I'll try to find some time to propose&amp;nbsp;my fav.&amp;nbsp;solution(s)&lt;/P&gt;</description>
      <pubDate>Wed, 17 Dec 2014 14:43:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025836#M5058</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2014-12-17T14:43:00Z</dc:date>
    </item>
    <item>
      <title>Quote:Vladimir Sedach wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025837#M5059</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Vladimir Sedach wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;You're absolutely right&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;just to be sure about AVX2 vectorization, I tested&amp;nbsp;the code as is (full func. below)&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;unsigned int set_bitsC(const unsigned char *yx, unsigned int &amp;amp;y_ret)
{
&amp;nbsp; unsigned int x_res = 0, y_res = 0;
&amp;nbsp; for (int i=0; i&amp;lt;16; i++)
&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp; x_res |= 1 &amp;lt;&amp;lt; (yx&lt;I&gt; &amp;amp; 0x1F);
&amp;nbsp;&amp;nbsp;&amp;nbsp; y_res |= 1 &amp;lt;&amp;lt; (yx&lt;I&gt; &amp;gt;&amp;gt; 5);
&amp;nbsp; }
&amp;nbsp; y_ret = y_res;
&amp;nbsp; return x_res;
}
&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;and the Intel compiler vectorize it well (fully unrolled, as your solution), see ASM dump below&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;PUBLIC ?set_bitsC@@YAIPEBEAEAI@Z
?set_bitsC@@YAIPEBEAEAI@Z&amp;nbsp;PROC 
; parameter 1(yx): rcx
; parameter 2(y_ret): rdx
.B1.1::&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ; Preds .B1.0

;;; {

$LN0:
$LN1:

;;;&amp;nbsp;&amp;nbsp; unsigned int x_res = 0, y_res = 0;
;;;&amp;nbsp;&amp;nbsp; for (int i=0; i&amp;lt;16; i++)
;;;&amp;nbsp;&amp;nbsp; {
;;;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; x_res |= 1 &amp;lt;&amp;lt; (yx&lt;I&gt; &amp;amp; 0x1F);
;;;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; y_res |= 1 &amp;lt;&amp;lt; (yx&lt;I&gt; &amp;gt;&amp;gt; 5);

&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpmovzxbw ymm5, XMMWORD PTR [rcx]&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;18.5
$LN2:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpsraw&amp;nbsp;&amp;nbsp;&amp;nbsp; ymm1, ymm5, 5&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;18.5
$LN3:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vmovdqu&amp;nbsp;&amp;nbsp; ymm4, YMMWORD PTR [_2il0floatpacket.2]&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;17.5
$LN4:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vmovdqu&amp;nbsp;&amp;nbsp; ymm3, YMMWORD PTR [_2il0floatpacket.3]&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;17.5
$LN5:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vextracti128 xmm5, ymm1, 1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;18.5
$LN6:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpmovsxwd ymm0, xmm1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;18.5
$LN7:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpmovsxwd ymm1, xmm5&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;18.5
$LN8:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpsllvd&amp;nbsp;&amp;nbsp; ymm2, ymm4, ymm0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;18.5
$LN9:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpsllvd&amp;nbsp;&amp;nbsp; ymm0, ymm4, ymm1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;18.5
$LN10:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpor&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ymm2, ymm2, ymm0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;14.33
$LN11:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vextracti128 xmm1, ymm2, 1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;14.33
$LN12:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpor&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; xmm0, xmm2, xmm1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;14.33
$LN13:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpshufd&amp;nbsp;&amp;nbsp; xmm5, xmm0, 14&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;14.33
$LN14:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpmovzxbd ymm2, QWORD PTR [rcx]&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;17.5
$LN15:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpor&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; xmm1, xmm0, xmm5&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;14.33
$LN16:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpand&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ymm5, ymm2, ymm3&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;17.5
$LN17:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpshufd&amp;nbsp;&amp;nbsp; xmm0, xmm1, 57&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;14.33
$LN18:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpor&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; xmm1, xmm1, xmm0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;14.33
$LN19:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpmovzxbd ymm2, QWORD PTR [8+rcx]&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;17.5
$LN20:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpand&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ymm3, ymm2, ymm3&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;17.5
$LN21:

;;;&amp;nbsp;&amp;nbsp; }
;;;&amp;nbsp;&amp;nbsp; y_ret = y_res;

&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vmovd&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; DWORD PTR [rdx], xmm1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;20.3
$LN22:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpsllvd&amp;nbsp;&amp;nbsp; ymm0, ymm4, ymm5&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;17.5
$LN23:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpsllvd&amp;nbsp;&amp;nbsp; ymm4, ymm4, ymm3&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;17.5
$LN24:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpor&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ymm0, ymm0, ymm4&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;14.22
$LN25:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vextracti128 xmm2, ymm0, 1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;14.22
$LN26:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpor&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; xmm3, xmm0, xmm2&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;14.22
$LN27:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpshufd&amp;nbsp;&amp;nbsp; xmm4, xmm3, 14&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;14.22
$LN28:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpor&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; xmm5, xmm3, xmm4&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;14.22
$LN29:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpshufd&amp;nbsp;&amp;nbsp; xmm0, xmm5, 57&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;14.22
$LN30:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vpor&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; xmm2, xmm5, xmm0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;14.22
$LN31:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vmovd&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; eax, xmm2&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;14.22
$LN32:

;;;&amp;nbsp;&amp;nbsp; return x_res;

&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vzeroupper&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;21.3
$LN33:
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ret&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ;21.3&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Dec 2014 15:24:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025837#M5059</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2014-12-17T15:24:00Z</dc:date>
    </item>
    <item>
      <title>Quote:bronxzv wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025838#M5060</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;bronxzv wrote:&lt;BR /&gt;&lt;BR /&gt;
	&lt;BR /&gt;
	&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;makes me think that now that the problem is clearly defined&amp;nbsp;and quite&amp;nbsp;simple it looks like a good candidate for some coding contest, I'll try to find some time to propose&amp;nbsp;my fav.&amp;nbsp;solution(s)&lt;/SPAN&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;BR /&gt;
	&lt;BR /&gt;
	Well,&lt;BR /&gt;
	let's get ready to rumble )&lt;BR /&gt;
	&lt;BR /&gt;
	Though, there's a high risk &lt;SPAN style="color: rgb(64, 64, 64); font-family: Arial, Helvetica, sans-serif; font-size: 12.8000020980835px; line-height: 19.5px;"&gt;to&lt;/SPAN&gt;&lt;SPAN style="color: rgb(64, 64, 64); font-family: Arial, Helvetica, sans-serif; font-size: 12.8000020980835px; line-height: 19.5px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="color: rgb(64, 64, 64); font-family: Arial, Helvetica, sans-serif; font-size: 12.8000020980835px; line-height: 19.5px;"&gt;be&lt;/SPAN&gt;&lt;SPAN style="color: rgb(64, 64, 64); font-family: Arial, Helvetica, sans-serif; font-size: 12.8000020980835px; line-height: 19.5px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="hvr" style="box-sizing: inherit; color: rgb(64, 64, 64); font-family: Arial, Helvetica, sans-serif; font-size: 12.8000020980835px; line-height: 19.5px;"&gt;defeated&lt;/SPAN&gt;&lt;SPAN style="color: rgb(64, 64, 64); font-family: Arial, Helvetica, sans-serif; font-size: 12.8000020980835px; line-height: 19.5px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="color: rgb(64, 64, 64); font-family: Arial, Helvetica, sans-serif; font-size: 12.8000020980835px; line-height: 19.5px;"&gt;by&lt;/SPAN&gt;&lt;SPAN style="color: rgb(64, 64, 64); font-family: Arial, Helvetica, sans-serif; font-size: 12.8000020980835px; line-height: 19.5px;"&gt;&amp;nbsp;&lt;/SPAN&gt;Intel's AVX2 code that is ~9 times faster than the one w/o SIMD ))&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Dec 2014 15:25:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025838#M5060</guid>
      <dc:creator>Vladimir_Sedach</dc:creator>
      <dc:date>2014-12-17T15:25:00Z</dc:date>
    </item>
    <item>
      <title>Not having a full description</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025839#M5061</link>
      <description>&lt;P&gt;Not having a full description of your whole problem, and with your 25 years of programming experience, you should be able to recognize that an application-wide optimal solution may be quite different than optimizing a core routine. This said, let be offer an alternate solution that should be easy enough to try.&lt;/P&gt;

&lt;P&gt;Premises:&lt;/P&gt;

&lt;P&gt;1) Your byte encoded YX 256-bit, bit index can be constructed to reference a single cache line enclosed structure&lt;BR /&gt;
	2) Two such of these 256-bit structures can be contained within a single cache line&lt;BR /&gt;
	3) L1 cache hit latency is on the order of 4 clock cycles&lt;/P&gt;

&lt;P&gt;Suggestion:&lt;/P&gt;

&lt;P&gt;Use the byte encode YX as an index into a table of 256-bit bit masks (one bit set per mask)&lt;/P&gt;

&lt;P&gt;Setting would be an OR, testing would be an AND.&lt;/P&gt;

&lt;P&gt;You may want to have SSE, AVX and AVX2 versions.&lt;/P&gt;

&lt;P&gt;Also note that the newer AVX... instructions have a compare that set a byte/word/dword/qword mask in ymm (zmm) and a second instruction that packs the msb of the bit fields into a GP register.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 17 Dec 2014 16:22:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025839#M5061</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2014-12-17T16:22:46Z</dc:date>
    </item>
    <item>
      <title>Quote:Vladimir Sedach wrote</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025840#M5062</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Vladimir Sedach wrote:&lt;BR /&gt;Though, there's a high risk to&amp;nbsp;be&amp;nbsp;defeated&amp;nbsp;by&amp;nbsp;Intel's AVX2 code that is ~9 times faster than the one w/o SIMD ))&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;I just measured the simplistic C++ AVX2 compiled version 3x faster than the SSE2 compiled version, and 1.7x faster than your hand optimized SSSE3 version&lt;/P&gt;

&lt;P&gt;btw I validated your solution over&amp;nbsp;&amp;gt; 1e9 random examples and I confirm it's all OK&lt;/P&gt;

&lt;P&gt;my&amp;nbsp;measurements are as follows:&lt;/P&gt;

&lt;P&gt;100 000 000 runs over 4 KB of random data (includes computation of control checksums)&lt;BR /&gt;
	Core i7 4770K @ 3.5 GHz (turbo enabled)&lt;BR /&gt;
	Intel C++ compiler v. 14.0.4.237&lt;/P&gt;

&lt;P&gt;Vladimir's hand optimized w/ intrinsics (64-bit SSSE3 target)&amp;nbsp; 834 - 837 ms&lt;BR /&gt;
	simplistic C++ (64-bit SSE2 target) 1468 - 1470 ms&lt;BR /&gt;
	simplistic C++ (64-bit AVX2 target) 490 - 493 ms&lt;/P&gt;</description>
      <pubDate>Wed, 17 Dec 2014 17:41:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025840#M5062</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2014-12-17T17:41:00Z</dc:date>
    </item>
    <item>
      <title>Quote:jimdempseyatthecove</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025841#M5063</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;Use the byte encode YX as an index into a table of 256-bit bit masks (one bit set per mask)&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;I understand that your proposal is for the full solution but I have tested the simplified case with a LUT (code below) to see how the timings compare&lt;/P&gt;

&lt;P&gt;the speed is roughly the same than the best score so far, with 496 - 498 ms *when compiled for an SSE2 target* (probably same speed for a generic x86 target), it requires a single code path and is faster than Vladimir's&amp;nbsp;proposal with intrinsics, moreover it is directly&amp;nbsp;usable for non-multiple of 16 element counts (with no wasted computation for padded elements), this is thus my favorite solution so far&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;__int64 XYLUT[256];

_forceinline unsigned int set_bitsCv2(const unsigned char *yx, unsigned int &amp;amp;y_ret)
{
&amp;nbsp; unsigned __int64 xy_res = 0;
&amp;nbsp; for (int i=0; i&amp;lt;16; i++)
&amp;nbsp;&amp;nbsp;&amp;nbsp; xy_res |= XYLUT[yx&lt;I&gt;];
&amp;nbsp; y_ret = xy_res &amp;amp; 0xFF;
&amp;nbsp; return xy_res &amp;gt;&amp;gt; 8;
}

void init()
{
&amp;nbsp; for (int i=0; i&amp;lt;256; i++)
&amp;nbsp; {
&amp;nbsp;&amp;nbsp;&amp;nbsp; const unsigned int x = 1 &amp;lt;&amp;lt; (i &amp;amp; 0x1F),
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; y = 1 &amp;lt;&amp;lt; (i &amp;gt;&amp;gt; 5);
&amp;nbsp;&amp;nbsp;&amp;nbsp; XYLUT&lt;I&gt; = __int64(x) &amp;lt;&amp;lt; 8 | y;
&amp;nbsp; }
}

&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;it's interesting to note that the Intel compiler avoid to use gather instructions when targeting AVX2, for good reasons: when forcing the usage of gather with&amp;nbsp;the example below&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;_forceinline int set_bitsCv3(const unsigned char *yx, unsigned int &amp;amp;y_ret)
{
&amp;nbsp; __m256i vxy256 = _mm256_setzero_si256();
&amp;nbsp; for (int i=0; i&amp;lt;16; i+=4)
&amp;nbsp; { 
&amp;nbsp;&amp;nbsp;&amp;nbsp; const __m128i vindex = _mm_cvtepu8_epi32((__m128i &amp;amp;)yx&lt;I&gt;); 
&amp;nbsp;&amp;nbsp;&amp;nbsp; vxy256 = _mm256_or_si256(vxy256,_mm256_i32gather_epi64(XYLUT,vindex,8));
&amp;nbsp; }
&amp;nbsp; const __m128i vxy128 = _mm_or_si128(_mm256_extractf128_si256(vxy256,0),_mm256_extractf128_si256(vxy256,1));
&amp;nbsp; const unsigned __int64 xy_res = _mm_cvtsi128_si64(_mm_or_si128(_mm_unpackhi_epi64(vxy128,vxy128),vxy128));
&amp;nbsp; y_ret = xy_res &amp;amp; 0xFF;
&amp;nbsp; return xy_res &amp;gt;&amp;gt; 8; 
}&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;I measured very poor timings, around 1270 ms which is more than&amp;nbsp;2.5x worse than the scalar LUT version&lt;/P&gt;

&lt;P&gt;Broadwell should provides better scores with gather (TBC)&lt;/P&gt;</description>
      <pubDate>Wed, 17 Dec 2014 20:57:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025841#M5063</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2014-12-17T20:57:00Z</dc:date>
    </item>
    <item>
      <title>  Hello there,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025842#M5064</link>
      <description>&lt;P&gt;&amp;nbsp; Hello there,&lt;/P&gt;

&lt;P&gt;it's very nice to get interesting info and so much help!&lt;/P&gt;

&lt;P&gt;First to clarify why AVX2 is currently not an option.&lt;/P&gt;

&lt;P&gt;We have many systems in a field where with Intel i5 without AVX2.&lt;BR /&gt;
	The next problem is Visual Studio bug - the project is much larger as only C/C++, just to say very large.&lt;BR /&gt;
	&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;The bug is really crazy - if AVX2 is used, sometimes old SSE coded methods produces invalid instruction exception.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Just as&amp;nbsp;bronxzv mentioned the first choice was to use&amp;nbsp;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;VPSLLVD/VPSRAVD, but this option fails during a compiler bug. Moreover, we need to extract bytes-to-words, words-to-dwords to use this instructions.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;The next was simple extract and shift (than OR), but we need an extended shift instruction with variable count, and this instructions are to slow.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;As i can see, BTS (bit test and set) will be also perfect alternative - with the same slownes.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;So, my next try was really simply serach "&lt;/SPAN&gt;intel-instruction+search+mask&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;" and see:&amp;nbsp;&lt;/SPAN&gt;&lt;A href="http://www.strchr.com/strcmp_and_strlen_using_sse_4.2" target="_blank"&gt;http://www.strchr.com/strcmp_and_strlen_using_sse_4.2&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;All this instructions was new for me, the next try was to find well documented explanation:&lt;BR /&gt;
	"Intel® Advanced Vector Extensions Programming Reference" - nothing, just mentioned as instructions.&lt;BR /&gt;
	"Intel® Architecture Instruction Set Extensions Programming Reference" - does not help either.&lt;BR /&gt;
	"Intel® 64 and IA-32 Architectures&amp;nbsp;Optimization Reference Manual" - just show me the way :)&lt;/P&gt;

&lt;P&gt;So trie to modify a question and voila: the same question as mine:&amp;nbsp;http://stackoverflow.com/questions/10068541/efficient-way-to-create-a-bit-mask-from-multiple-numbers-possibly-using-sse-sse2&lt;/P&gt;

&lt;P&gt;So the solution was terrible simple, after 2 hours try and error (because it just does not work as expected) and after found a last puzzle comment on MSDN for another string-search instruction (http://msdn.microsoft.com/en-us/library/bb513993.aspx) : "&lt;SPAN style="color: rgb(42, 42, 42); font-family: 'Segoe UI', 'Lucida Grande', Verdana, Arial, Helvetica, sans-serif; line-height: 18px;"&gt;One if&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="parameter" style="font-style: italic; color: rgb(42, 42, 42); font-family: 'Segoe UI', 'Lucida Grande', Verdana, Arial, Helvetica, sans-serif; line-height: 18px;"&gt;b&lt;/SPAN&gt;&lt;SPAN style="color: rgb(42, 42, 42); font-family: 'Segoe UI', 'Lucida Grande', Verdana, Arial, Helvetica, sans-serif; line-height: 18px;"&gt;&amp;nbsp;is does not contain the null character and the resulting mask is equal to zero. Otherwise, zero."&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="color: rgb(42, 42, 42); font-family: 'Segoe UI', 'Lucida Grande', Verdana, Arial, Helvetica, sans-serif; line-height: 18px;"&gt;This comment means, the subsearch vector should not contain a 0 (zero value), after that i could write a core function (for overall optimization we just OR both high/low results a the end of complete loop). Here is only a core function for test purposes. So only first two (bold) lines are of interest. Surely we must split input YX-value in two separate Y and X (as described above), but this is trivial.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;const __m128i bl = _mm_set_epi8(16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;const __m128i bh = _mm_set_epi8(32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;const int mode2 = _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY;&lt;BR /&gt;
	__int32 getBitsForX(const __m128i&amp;amp; a)&lt;BR /&gt;
	{&lt;BR /&gt;
	&lt;STRONG&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i fullResultl = _mm_cmpistrm(a, bl, mode2); // set bit with position 1--16&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i fullResulth = _mm_cmpistrm(a, bh, mode2); // set bit with position 17--32&lt;/STRONG&gt;&lt;BR /&gt;
	&lt;EM&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;fullResulth = _mm_slli_si128(fullResulth, 2); // shift 2 bytes!&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;fullResultl = _mm_or_si128(fullResultl, fullResulth);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__int32 res = _mm_extract_epi32(fullResultl, 0);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;return res;&lt;/EM&gt;&lt;BR /&gt;
	};&lt;/P&gt;

&lt;P&gt;So, ended with string search instruction for bit-set operation. That's amazing. Maybe it will be really helpful to mention that in the documentation for all other users. Without search engine this solution will not possible, so i should not be honored for that ;)&lt;/P&gt;

&lt;P&gt;Once again - many thanks for all! It's very inetersting to see alternative solutions and see what different compilers done.&lt;BR /&gt;
	And, if I can use AVX2, i think, only one instruction will be sufficient for all 32 bits.&lt;/P&gt;

&lt;P&gt;Just to say, X does can't have a value of 0 by problem description.&lt;BR /&gt;
	But Y can, so, to use the same method with Y, 1 mus be added for all bytes before.&lt;/P&gt;

&lt;P&gt;@jimdempseyatthecove: sad to say, but since many years my terrain is boring C#, WPF and such things - so i'm very backward with actual processor technologies.&lt;/P&gt;

&lt;P&gt;So, maybe it can be done better?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 18 Dec 2014 00:42:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025842#M5064</guid>
      <dc:creator>Alexander_L_1</dc:creator>
      <dc:date>2014-12-18T00:42:23Z</dc:date>
    </item>
    <item>
      <title>@Jim Dempsey - can you,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025843#M5065</link>
      <description>&lt;P&gt;@&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;Jim Dempsey - can you, please, explain your solution, possible both with and without AVX2?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;You wrtote:&lt;/SPAN&gt;&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;Also note that the newer AVX... instructions have a compare that set a byte/word/dword/qword mask in ymm (zmm) and a second instruction that packs the msb of the bit fields into a GP register.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;Will be this the same solution? It looks very close to solution I ended up.&lt;/P&gt;</description>
      <pubDate>Thu, 18 Dec 2014 00:55:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025843#M5065</guid>
      <dc:creator>Alexander_L_1</dc:creator>
      <dc:date>2014-12-18T00:55:39Z</dc:date>
    </item>
    <item>
      <title>Quote:Alexander L. wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025844#M5066</link>
      <description>&lt;P&gt;void&lt;/P&gt;</description>
      <pubDate>Thu, 18 Dec 2014 02:36:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025844#M5066</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2014-12-18T02:36:00Z</dc:date>
    </item>
    <item>
      <title>Quote:Alexander L. wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025845#M5067</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Alexander L. wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;So trie to modify a question and voila: the same question as mine:&amp;nbsp;&lt;A href="http://stackoverflow.com/questions/10068541/efficient-way-to-create-a-bit-mask-from-multiple-numbers-possibly-using-sse-sse2"&gt;&lt;U&gt;&lt;FONT color="#0066cc"&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/A&gt;&lt;A href="http://stackoverflow.com/questions/10068541/efficient-way-to-create-a-bi" target="_blank"&gt;http://stackoverflow.com/questions/10068541/efficient-way-to-create-a-bi&lt;/A&gt;.&lt;/P&gt;

&lt;P&gt;So the solution was terrible simple, after 2 hours try and error (because it just does not work as expected) and after found a last puzzle comment on MSDN for another string-search instruction (&lt;A href="http://msdn.microsoft.com/en-us/library/bb513993.aspx"&gt;&lt;U&gt;&lt;FONT color="#0066cc"&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/A&gt;&lt;A href="http://msdn.microsoft.com/en-us/library/bb513993.aspx" target="_blank"&gt;http://msdn.microsoft.com/en-us/library/bb513993.aspx&lt;/A&gt;) : "One if&amp;nbsp;b&amp;nbsp;is does not contain the null character and the resulting mask is equal to zero. Otherwise, zero."&lt;/P&gt;

&lt;P&gt;This comment means, the subsearch vector should not contain a 0 (zero value), after that i could write a core function (for overall optimization we just OR both high/low results a the end of complete loop). Here is only a core function for test purposes. So only first two (bold) lines are of interest. Surely we must split input YX-value in two separate Y and X (as described above), but this is trivial.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;const __m128i bl = _mm_set_epi8(16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;const __m128i bh = _mm_set_epi8(32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;const int mode2 = _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY;&lt;BR /&gt;
	__int32 getBitsForX(const __m128i&amp;amp; a)&lt;BR /&gt;
	{&lt;BR /&gt;
	&lt;STRONG&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i fullResultl = _mm_cmpistrm(a, bl, mode2); // set bit with position 1--16&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i fullResulth = _mm_cmpistrm(a, bh, mode2); // set bit with position 17--32&lt;/STRONG&gt;&lt;BR /&gt;
	&lt;EM&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;fullResulth = _mm_slli_si128(fullResulth, 2); // shift 2 bytes!&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;fullResultl = _mm_or_si128(fullResultl, fullResulth);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__int32 res = _mm_extract_epi32(fullResultl, 0);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;return res;&lt;/EM&gt;&lt;BR /&gt;
	};&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;hey, it looks like you have a winner here!&lt;/P&gt;

&lt;P&gt;I got 284-286 ms witth&amp;nbsp;the code below directly adapted&amp;nbsp;from your example, this is 1.7x faster than the best preceding result&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;const __m128i blx = _mm_set_epi8(16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1),
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; bhx = _mm_set_epi8(32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17),
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; bly = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, (7&amp;lt;&amp;lt;5)|1, (6&amp;lt;&amp;lt;5)|1, (5&amp;lt;&amp;lt;5)|1, (4&amp;lt;&amp;lt;5)|1, (3&amp;lt;&amp;lt;5)|1, (2&amp;lt;&amp;lt;5)|1, (1&amp;lt;&amp;lt;5)|1, (0&amp;lt;&amp;lt;5)|1);

const int mode2 = _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY;

const __m128i clearLSB = _mm_set1_epi8(0xE0), clearMSB = _mm_set1_epi8(0x1F), v1 = _mm_set1_epi8(1);

_forceinline __int32 get16Bits(const __m128i &amp;amp;a, const __m128i &amp;amp;bl)
{
&amp;nbsp; const __m128i tl = _mm_cmpistrm(a,bl,mode2); // set bit with position 1--16
&amp;nbsp; return _mm_cvtsi128_si32(tl);
};

_forceinline __int32 get32Bits(const __m128i &amp;amp;a, const __m128i &amp;amp;bh, const __m128i &amp;amp;bl)
{
&amp;nbsp; const __m128i tl = _mm_cmpistrm(a,bl,mode2), // set bit with position 1--16
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; th = _mm_cmpistrm(a,bh,mode2); // set bit with position 17--32
&amp;nbsp; return _mm_cvtsi128_si32(_mm_or_si128(tl,_mm_slli_si128(th,2)));
};

_forceinline unsigned int set_bitsCv4(const unsigned char *yx, unsigned int &amp;amp;y_ret)
{
&amp;nbsp; const __m128i vxy = _mm_load_si128((__m128i *)yx);
&amp;nbsp; const __m128i vx = _mm_add_epi8(_mm_and_si128(vxy,clearMSB),v1), vy = _mm_or_si128(_mm_and_si128(vxy,clearLSB),v1); 
&amp;nbsp; y_ret = get16Bits(vy,bly);
&amp;nbsp; return get32Bits(vx,bhx,blx);
}&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 18 Dec 2014 02:36:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025845#M5067</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2014-12-18T02:36:00Z</dc:date>
    </item>
    <item>
      <title>Quote:Alexander L. wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025846#M5068</link>
      <description>&lt;P&gt;void&lt;/P&gt;</description>
      <pubDate>Thu, 18 Dec 2014 02:59:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Indirect-Bit-Indexing-and-Set/m-p/1025846#M5068</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2014-12-18T02:59:00Z</dc:date>
    </item>
  </channel>
</rss>

