<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Your right. Perhaps, it's not in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962579#M4440</link>
    <description>&lt;P&gt;Your right. Perhaps, it's not a good idea to do it with AVX at all.&lt;BR /&gt;
	I'm afraid this version wont be too fast.&lt;BR /&gt;
	&lt;BR /&gt;
	__m256&amp;nbsp;&amp;nbsp; &amp;nbsp;r;&lt;BR /&gt;
	__m128&amp;nbsp;&amp;nbsp; &amp;nbsp;r0, r1;&lt;/P&gt;

&lt;P&gt;r0 = _mm256_castps256_ps128(r);&lt;BR /&gt;
	r1 = _mm256_extractf128_ps(r, 1);&lt;BR /&gt;
	r1 = _mm_insert_ps(r1, r0, 0xF0); //r1[3] = r0[3]&lt;BR /&gt;
	r0 = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(r0), 4));&lt;BR /&gt;
	r1 = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 1, 0, 3)); //rotate&lt;BR /&gt;
	r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1);&lt;/P&gt;

&lt;P&gt;If you don't care about r[0]:&lt;BR /&gt;
	r = _mm256_shuffle_ps(r, r, _MM_SHUFFLE(2, 1, 0, 3));&lt;BR /&gt;
	r0 = _mm256_castps256_ps128(r);&lt;BR /&gt;
	r1 = _mm256_extractf128_ps(r, 1);&lt;BR /&gt;
	r1 = _mm_move_ss(r1, r0);&lt;BR /&gt;
	r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 04 Feb 2014 23:03:00 GMT</pubDate>
    <dc:creator>Vladimir_Sedach</dc:creator>
    <dc:date>2014-02-04T23:03:00Z</dc:date>
    <item>
      <title>AVX - Vector shifts</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962570#M4431</link>
      <description>&lt;P&gt;I'm comparing two programmes, one is written using SSE and the other one AVX. My aim is to show that the avx version is running 2 times faster but I'm loosing something like 20 % with some shift operations.&lt;/P&gt;
&lt;P&gt;I need to perform quite often a shift operation to rotate an Avx Vector 1 byte on the left. It seems like all the instructions I need will only be available with AVX2.&lt;/P&gt;
&lt;P&gt;Actually I'm splitting the source _m256i vector into 2 _128i but&amp;nbsp; this way I'm loosing performances. Is there any other way to perform this operation? Why shifting operation were not included in avx instruction set?&lt;/P&gt;
&lt;P&gt;Thanks in advance for your help, here's the current version on my code&lt;/P&gt;
&lt;P&gt;[cpp]&lt;/P&gt;
&lt;P&gt;&amp;nbsp; a1 = _mm256_castsi256_si128( _source );&lt;BR /&gt;&amp;nbsp; a2 = _mm256_extractf128_si256 ( _source,1 );&lt;BR /&gt;&amp;nbsp; &lt;BR /&gt;&amp;nbsp; b1 = _mm_slli_si128( a1,1);&lt;BR /&gt;&amp;nbsp; b2 = _mm_slli_si128( a2,1);&lt;BR /&gt;&amp;nbsp; a1 = _mm_srli_si128( a1,15);&lt;BR /&gt;&amp;nbsp; a2 = _mm_srli_si128( a2,15);&lt;BR /&gt;&amp;nbsp; &lt;BR /&gt;&amp;nbsp;&amp;nbsp; _dest&amp;nbsp; =&amp;nbsp; _mm256_castsi128_si256 ( _mm_or_si128(b1,a2) );&lt;BR /&gt;&amp;nbsp;&amp;nbsp; _dest =&amp;nbsp; _mm256_insertf128_ps (&amp;nbsp; _dest, _mm_or_si128(b2,a1), 1 );&lt;/P&gt;
&lt;P&gt;[/cpp]&lt;/P&gt;</description>
      <pubDate>Sun, 02 Dec 2012 04:25:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962570#M4431</guid>
      <dc:creator>ale3</dc:creator>
      <dc:date>2012-12-02T04:25:12Z</dc:date>
    </item>
    <item>
      <title>Yes, Intel AVX supports only</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962571#M4432</link>
      <description>Yes, Intel AVX supports only FP instructions for 256bit. If you need other instructions, like the shift that you are describing, it is better to use the 128bit instructions. You might save some instructions because of the non-destructive source that comes with AVX, but that's about it. For integer instructions with 256bit registers, you will have to wait for AVX2.</description>
      <pubDate>Wed, 05 Dec 2012 22:12:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962571#M4432</guid>
      <dc:creator>Thomas_W_Intel</dc:creator>
      <dc:date>2012-12-05T22:12:01Z</dc:date>
    </item>
    <item>
      <title>So if I want to left shift an</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962572#M4433</link>
      <description>&lt;P&gt;So if I want to left shift an _m256 float point array, there is no way to do it using AVX instructions?&lt;/P&gt;

&lt;P&gt;I was wondering if a combination of __mm256_shuffle_ps and __mm256_permute_ps would make it. Is that possible? If yes, I just could not understand the meaning of __MM_SHUFFLE macro on the context of __mm256_permute_ps function, how can I use it?&lt;/P&gt;

&lt;P&gt;Or if you still think to use 128bit is better, what intrinsic 128 bit functions should I use for that?&lt;/P&gt;</description>
      <pubDate>Fri, 31 Jan 2014 00:38:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962572#M4433</guid>
      <dc:creator>Silva__Rafael</dc:creator>
      <dc:date>2014-01-31T00:38:23Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...I need to perform quite</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962573#M4434</link>
      <description>&amp;gt;&amp;gt;...I need to perform quite often a shift operation to rotate an Avx Vector 1 byte on the left...

Is it a cyclical operation or No? Does it mean that in case of a vector of &lt;STRONG&gt;N&lt;/STRONG&gt; elements an element[0] should be moved to an element[N-1]?

And of course all the rest vector elements are shifted to the left.</description>
      <pubDate>Fri, 31 Jan 2014 21:54:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962573#M4434</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2014-01-31T21:54:00Z</dc:date>
    </item>
    <item>
      <title>Quote:Sergey Kostrov wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962574#M4435</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Sergey Kostrov wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Is it a cyclical operation or No? Does it mean that in case of a vector of &lt;/SPAN&gt;&lt;STRONG style="font-size: 1em; line-height: 1.5;"&gt;N&lt;/STRONG&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt; elements an element[0] should be moved to an element[N-1]?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Not in my case, it is not cyclical. In my case the element[0] is not necessary any more.&lt;/P&gt;</description>
      <pubDate>Sat, 01 Feb 2014 17:19:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962574#M4435</guid>
      <dc:creator>Silva__Rafael</dc:creator>
      <dc:date>2014-02-01T17:19:05Z</dc:date>
    </item>
    <item>
      <title>rmendes.silva::</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962575#M4436</link>
      <description>&lt;P&gt;&lt;A href="http://software.intel.com/en-us/user/512404" style="font-family: Arial, Helvetica, sans-serif; font-size: 11.20000171661377px; line-height: 13.200002670288086px; background-color: rgb(238, 238, 238);"&gt;rmendes.silva&lt;/A&gt;::&lt;BR /&gt;
	&lt;BR /&gt;
	Shifts left by 4 bytes:&lt;BR /&gt;
	__m256i&amp;nbsp;&amp;nbsp; &amp;nbsp;r = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 8), r1, r2;&lt;BR /&gt;
	r1 = _mm256_slli_si256(r, 4);&lt;BR /&gt;
	r2 = _mm256_srli_si256(r, 12);&lt;BR /&gt;
	r2 = _mm256_permute2f128_si256(r2, r2, 0x08);&lt;BR /&gt;
	r = _mm256_or_si256(r1, r2);&lt;BR /&gt;
	&lt;BR /&gt;
	With AVX2:&lt;BR /&gt;
	&lt;BR /&gt;
	r1 = _mm256_permute2f128_si256(r, r, 0x08);&lt;BR /&gt;
	r = _mm256_alignr_epi8(r, r1, 12);&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 03 Feb 2014 09:14:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962575#M4436</guid>
      <dc:creator>Vladimir_Sedach</dc:creator>
      <dc:date>2014-02-03T09:14:43Z</dc:date>
    </item>
    <item>
      <title>Ok Vladimir, but this is for</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962576#M4437</link>
      <description>&lt;P&gt;Ok Vladimir, but this is for integer values. What about float point? I didn't found anything to do the same with float points.&lt;/P&gt;</description>
      <pubDate>Tue, 04 Feb 2014 15:24:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962576#M4437</guid>
      <dc:creator>Silva__Rafael</dc:creator>
      <dc:date>2014-02-04T15:24:33Z</dc:date>
    </item>
    <item>
      <title>rmendes.silva::</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962577#M4438</link>
      <description>&lt;P&gt;rmendes.silva::&lt;/P&gt;

&lt;P&gt;It's almost the same with floats:&lt;BR /&gt;
	__m256 &amp;nbsp; &amp;nbsp;r, r1, r2;&lt;/P&gt;

&lt;P&gt;AVX:&lt;BR /&gt;
	r1 = _mm256_slli_si256(_mm256_castps_si256(r), 4);&lt;BR /&gt;
	r2 = _mm256_srli_si256(_mm256_castps_si256(r), 12);&lt;BR /&gt;
	r2 = _mm256_permute2f128_ps(r2, r2, 0x08);&lt;BR /&gt;
	r = _mm256_or_ps(r1, r2);&lt;/P&gt;

&lt;P&gt;AVX2:&lt;BR /&gt;
	r1 = _mm256_permute2f128_ps(r, r, 0x08);&lt;BR /&gt;
	r = _mm256_alignr_epi8(_mm256_castps_si256(r), _mm256_castps_si256(r1), 12);&lt;/P&gt;

&lt;P&gt;If you need doubles, replace 4, 12 and 12 by 8, 8 and 8.&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 04 Feb 2014 19:54:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962577#M4438</guid>
      <dc:creator>Vladimir_Sedach</dc:creator>
      <dc:date>2014-02-04T19:54:32Z</dc:date>
    </item>
    <item>
      <title>In Intel documentation _mm256</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962578#M4439</link>
      <description>&lt;P&gt;In Intel documentation _mm256_slli_si256 is only included on AVX2 documentation, so I think it is not available for AVX, is it?&lt;/P&gt;</description>
      <pubDate>Tue, 04 Feb 2014 20:25:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962578#M4439</guid>
      <dc:creator>Silva__Rafael</dc:creator>
      <dc:date>2014-02-04T20:25:31Z</dc:date>
    </item>
    <item>
      <title>Your right. Perhaps, it's not</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962579#M4440</link>
      <description>&lt;P&gt;Your right. Perhaps, it's not a good idea to do it with AVX at all.&lt;BR /&gt;
	I'm afraid this version wont be too fast.&lt;BR /&gt;
	&lt;BR /&gt;
	__m256&amp;nbsp;&amp;nbsp; &amp;nbsp;r;&lt;BR /&gt;
	__m128&amp;nbsp;&amp;nbsp; &amp;nbsp;r0, r1;&lt;/P&gt;

&lt;P&gt;r0 = _mm256_castps256_ps128(r);&lt;BR /&gt;
	r1 = _mm256_extractf128_ps(r, 1);&lt;BR /&gt;
	r1 = _mm_insert_ps(r1, r0, 0xF0); //r1[3] = r0[3]&lt;BR /&gt;
	r0 = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(r0), 4));&lt;BR /&gt;
	r1 = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 1, 0, 3)); //rotate&lt;BR /&gt;
	r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1);&lt;/P&gt;

&lt;P&gt;If you don't care about r[0]:&lt;BR /&gt;
	r = _mm256_shuffle_ps(r, r, _MM_SHUFFLE(2, 1, 0, 3));&lt;BR /&gt;
	r0 = _mm256_castps256_ps128(r);&lt;BR /&gt;
	r1 = _mm256_extractf128_ps(r, 1);&lt;BR /&gt;
	r1 = _mm_move_ss(r1, r0);&lt;BR /&gt;
	r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 04 Feb 2014 23:03:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962579#M4440</guid>
      <dc:creator>Vladimir_Sedach</dc:creator>
      <dc:date>2014-02-04T23:03:00Z</dc:date>
    </item>
    <item>
      <title>Yes, you're right about AVX.</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962580#M4441</link>
      <description>&lt;P&gt;Yes, you're right about AVX. Doesn't seem to be a good idea do it with AVX, if we want performance. I will try, but I'm also afraid that would not be so fast. Thank you.&lt;/P&gt;</description>
      <pubDate>Wed, 05 Feb 2014 10:46:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962580#M4441</guid>
      <dc:creator>Silva__Rafael</dc:creator>
      <dc:date>2014-02-05T10:46:20Z</dc:date>
    </item>
    <item>
      <title>Maybe you have the chance to</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962581#M4442</link>
      <description>&lt;P&gt;Maybe you have the chance to organise the data in memory right for AVX before processing?&lt;/P&gt;

&lt;P&gt;You mentioned it is not cyclic so this might be an option.&lt;/P&gt;</description>
      <pubDate>Thu, 27 Feb 2014 11:59:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962581#M4442</guid>
      <dc:creator>Christian_M_2</dc:creator>
      <dc:date>2014-02-27T11:59:16Z</dc:date>
    </item>
    <item>
      <title>I've considered this</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962582#M4443</link>
      <description>&lt;P&gt;I've considered this Christian, but I can't see how to do that without moving things around, which is exactly what I want to avoid.&lt;/P&gt;</description>
      <pubDate>Mon, 03 Mar 2014 18:42:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962582#M4443</guid>
      <dc:creator>Silva__Rafael</dc:creator>
      <dc:date>2014-03-03T18:42:27Z</dc:date>
    </item>
    <item>
      <title>rmendes.silva,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962583#M4444</link>
      <description>&lt;P&gt;&lt;A href="http://software.intel.com/en-us/user/512404" style="font-family: Arial, Helvetica, sans-serif; font-size: 11.20000171661377px; line-height: 13.200002670288086px; background-color: rgb(238, 238, 238);"&gt;rmendes.silva&lt;/A&gt;,&lt;BR /&gt;
	&lt;BR /&gt;
	SLL_256()&amp;nbsp;shifts left by an arbitrary number of elements.&lt;BR /&gt;
	All the "if" checks are removed by optimization since "offs" is a const.&lt;BR /&gt;
	It can be used as is with&amp;nbsp;__m256, and needs just to replace all the casts otherwise.&lt;BR /&gt;
	&lt;BR /&gt;
	Please let me know if it's fast/slow in your case.&lt;BR /&gt;
	It also would be nice to see a snippet of your code that uses the shift.&lt;BR /&gt;
	Perhaps a faster approach could be found.&lt;/P&gt;

&lt;P&gt;// r: result&lt;BR /&gt;
	// a: src vector&lt;BR /&gt;
	// offs: number of elements to shift&lt;BR /&gt;
	// elem_n: number of&amp;nbsp;&amp;nbsp;elements in vector (8 for float)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;#define SLL_256(r, a, offs, elem_n) \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;{ \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128&amp;nbsp;&amp;nbsp; &amp;nbsp;r0, r1; \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;const int&amp;nbsp;&amp;nbsp; &amp;nbsp;size = sizeof(a) / elem_n; \&lt;BR /&gt;
	\&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;if (!offs) \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r = a; \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;else if (offs == elem_n / 2) \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r = _mm256_permute2f128_ps(a, a, 0x08); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;else if (offs &amp;gt;= elem_n) \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r = _mm256_setzero_ps(); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;else if (offs &amp;lt; elem_n / 2) \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;{ \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r0 = _mm256_castps256_ps128(a); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r1 = _mm256_extractf128_ps(a, 1); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r1 = _mm_castsi128_ps(_mm_alignr_epi8(_mm_castps_si128(r1), _mm_castps_si128(r0), (elem_n / 2 - offs) * size)); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r0 = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(r0), offs * size)); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r = _mm256_insertf128_ps(_mm256_castps128_ps256(r0), r1, 1); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;} \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;else \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;{ \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r0 = _mm256_castps256_ps128(a); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r0 = _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(r0), (offs - elem_n / 2) * size)); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r = _mm256_permute2f128_ps(_mm256_castps128_ps256(r0), _mm256_castps128_ps256(r0), 0x08); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;} \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;}&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 04 Mar 2014 12:39:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962583#M4444</guid>
      <dc:creator>Vladimir_Sedach</dc:creator>
      <dc:date>2014-03-04T12:39:00Z</dc:date>
    </item>
    <item>
      <title>Vladimir,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962584#M4445</link>
      <description>&lt;P&gt;Vladimir,&lt;/P&gt;

&lt;P&gt;Great post. This should be a starting point for Intel compiler intrinsic developers to offer an official _mm256_... intrinsic function. Such that whenever an improvement in AVX design is made, that we mere users need&amp;nbsp;not revisit our code.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Tue, 04 Mar 2014 13:27:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962584#M4445</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2014-03-04T13:27:57Z</dc:date>
    </item>
    <item>
      <title>Jim,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962585#M4446</link>
      <description>&lt;P&gt;Jim,&lt;BR /&gt;
	&lt;BR /&gt;
	I really appreciate your words, thanks.&lt;BR /&gt;
	Added a similar&amp;nbsp;SRL_256() to shift right.&lt;BR /&gt;
	This time they accept 256-bit vectors of any type.&lt;BR /&gt;
	Just anxious&lt;SPAN class="muted" style="color: rgb(128, 128, 128);"&gt;&amp;nbsp;a&lt;/SPAN&gt;&amp;nbsp;bit this version could be slower with a not very smart compiler.&amp;nbsp;&lt;BR /&gt;
	&lt;BR /&gt;
	&lt;SPAN style="font-family: Arial, Helvetica, sans-serif; font-size: 12.000001907348633px; line-height: 14.400002479553223px;"&gt;// r: result&lt;/SPAN&gt;&lt;BR style="font-family: Arial, Helvetica, sans-serif; font-size: 12.000001907348633px; line-height: 14.400002479553223px;" /&gt;
	&lt;SPAN style="font-family: Arial, Helvetica, sans-serif; font-size: 12.000001907348633px; line-height: 14.400002479553223px;"&gt;// a: src vector&lt;/SPAN&gt;&lt;BR style="font-family: Arial, Helvetica, sans-serif; font-size: 12.000001907348633px; line-height: 14.400002479553223px;" /&gt;
	&lt;SPAN style="font-family: Arial, Helvetica, sans-serif; font-size: 12.000001907348633px; line-height: 14.400002479553223px;"&gt;// offs: number of elements to shift (must be a const)&lt;/SPAN&gt;&lt;BR style="font-family: Arial, Helvetica, sans-serif; font-size: 12.000001907348633px; line-height: 14.400002479553223px;" /&gt;
	&lt;SPAN style="font-family: Arial, Helvetica, sans-serif; font-size: 12.000001907348633px; line-height: 14.400002479553223px;"&gt;// elem_n: number of&amp;nbsp;&amp;nbsp;elements in vector (8 for "float")&lt;/SPAN&gt;&lt;BR style="font-family: Arial, Helvetica, sans-serif; font-size: 12.000001907348633px; line-height: 14.400002479553223px;" /&gt;
	#define SLL_256(r, a, offs, elem_n) \&lt;BR /&gt;
	{ \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m256i&amp;nbsp;&amp;nbsp; &amp;nbsp;*pr = (__m256i *)&amp;amp;r; \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m256i&amp;nbsp;&amp;nbsp; &amp;nbsp;*pa = (__m256i *)&amp;amp;a; \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i&amp;nbsp;&amp;nbsp; &amp;nbsp;r0, r1; \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;const int&amp;nbsp;&amp;nbsp; &amp;nbsp;size = sizeof(a) / elem_n; \&lt;BR /&gt;
	\&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;if (!offs) \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;*pr = *pa; \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;else if (offs == elem_n / 2) \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;*pr = _mm256_permute2f128_si256(*pa, *pa, 0x08); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;else if (offs &amp;gt;= elem_n) \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;*pr = _mm256_setzero_si256(); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;else if (offs &amp;lt; elem_n / 2) \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;{ \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r0 = _mm256_castsi256_si128(*pa); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r1 = _mm256_extractf128_si256(*pa, 1); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r1 = _mm_alignr_epi8(r1, r0, (elem_n / 2 - offs) * size); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r0 = _mm_slli_si128(r0, offs * size); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;*pr = _mm256_insertf128_si256(_mm256_castsi128_si256(r0), r1, 1); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;} \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;else \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;{ \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r0 = _mm256_castsi256_si128(*pa); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r0 = _mm_slli_si128(r0, (offs - elem_n / 2) * size); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;*pr = _mm256_permute2f128_si256(_mm256_castsi128_si256(r0), _mm256_castsi128_si256(r0), 0x08); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;} \&lt;BR /&gt;
	}&lt;/P&gt;

&lt;P&gt;#define SRL_256(r, a, offs, elem_n) \&lt;BR /&gt;
	{ \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m256i&amp;nbsp;&amp;nbsp; &amp;nbsp;*pr = (__m256i *)&amp;amp;r; \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m256i&amp;nbsp;&amp;nbsp; &amp;nbsp;*pa = (__m256i *)&amp;amp;a; \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__m128i&amp;nbsp;&amp;nbsp; &amp;nbsp;r0, r1; \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;const int&amp;nbsp;&amp;nbsp; &amp;nbsp;size = sizeof(a) / elem_n; \&lt;BR /&gt;
	\&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;if (!offs) \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;*pr = *pa; \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;else if (offs == elem_n / 2) \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;*pr = _mm256_permute2f128_si256(*pa, *pa, 0x81); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;else if (offs &amp;gt;= elem_n) \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;*pr = _mm256_setzero_si256(); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;else if (offs &amp;lt; elem_n / 2) \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;{ \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r0 = _mm256_castsi256_si128(*pa); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r1 = _mm256_extractf128_si256(*pa, 1); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r0 = _mm_alignr_epi8(r1, r0, offs * size); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r1 = _mm_srli_si128(r1, offs * size); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;*pr = _mm256_insertf128_si256(_mm256_castsi128_si256(r0), r1, 1); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;} \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;else \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;{ \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r1 = _mm256_extractf128_si256(*pa, 1); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;r1 = _mm_srli_si128(r1, (offs - elem_n / 2) * size); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;*pr = _mm256_permute2f128_si256(_mm256_castsi128_si256(r1), _mm256_castsi128_si256(r1), 0x80); \&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;} \&lt;BR /&gt;
	}&lt;/P&gt;</description>
      <pubDate>Tue, 04 Mar 2014 15:40:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962585#M4446</guid>
      <dc:creator>Vladimir_Sedach</dc:creator>
      <dc:date>2014-03-04T15:40:38Z</dc:date>
    </item>
    <item>
      <title>Hi Vladimir,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962586#M4447</link>
      <description>&lt;P&gt;Hi Vladimir,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;I will try this approach an will post here it gets faster of not. Thanks.&lt;/P&gt;</description>
      <pubDate>Wed, 05 Mar 2014 20:23:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Vector-shifts/m-p/962586#M4447</guid>
      <dc:creator>Silva__Rafael</dc:creator>
      <dc:date>2014-03-05T20:23:45Z</dc:date>
    </item>
  </channel>
</rss>

