<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Looking for smartest way to insert a DWORD into AVX register in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927960#M3123</link>
    <description>&lt;P&gt;Hi all,&lt;/P&gt;
&lt;P&gt;I'm looking for the smartest(=fastest) way to insert a DWORD into an AVX register.&lt;/P&gt;
&lt;P&gt;Here is what I found so far:&lt;/P&gt;
&lt;P&gt;AVX vinsertps doesn't work because it clears the upper 128bits and the immediate value can't address the upper 128bits anyway&lt;/P&gt;
&lt;P&gt;AVX vpinsrd doesn't work for the same reason, and - truly sad unless the docs are wrong - hasn't been promoted in AVX2, even though the immediate value has space to encode where to insert also in 256bit vectors.&lt;/P&gt;
&lt;P&gt;There are lots of multi-instruction workarounds I could think of, but I hoped that the Intel engineers have a smart trick for this basic operation which I overlooked?&lt;/P&gt;
&lt;P&gt;Thanks,&lt;/P&gt;
&lt;P&gt;Elmar&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 20 Jun 2013 12:39:18 GMT</pubDate>
    <dc:creator>Elmar</dc:creator>
    <dc:date>2013-06-20T12:39:18Z</dc:date>
    <item>
      <title>Looking for smartest way to insert a DWORD into AVX register</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927960#M3123</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;
&lt;P&gt;I'm looking for the smartest(=fastest) way to insert a DWORD into an AVX register.&lt;/P&gt;
&lt;P&gt;Here is what I found so far:&lt;/P&gt;
&lt;P&gt;AVX vinsertps doesn't work because it clears the upper 128bits and the immediate value can't address the upper 128bits anyway&lt;/P&gt;
&lt;P&gt;AVX vpinsrd doesn't work for the same reason, and - truly sad unless the docs are wrong - hasn't been promoted in AVX2, even though the immediate value has space to encode where to insert also in 256bit vectors.&lt;/P&gt;
&lt;P&gt;There are lots of multi-instruction workarounds I could think of, but I hoped that the Intel engineers have a smart trick for this basic operation which I overlooked?&lt;/P&gt;
&lt;P&gt;Thanks,&lt;/P&gt;
&lt;P&gt;Elmar&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 20 Jun 2013 12:39:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927960#M3123</guid>
      <dc:creator>Elmar</dc:creator>
      <dc:date>2013-06-20T12:39:18Z</dc:date>
    </item>
    <item>
      <title>Did you consider</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927961#M3124</link>
      <description>Did you consider

/*
 * Scalar to 128/256-bit vector broadcast operations.
*/
extern __m256i __ICL_INTRINCC &lt;STRONG&gt;_mm256_broadcastd_epi32&lt;/STRONG&gt;( __m128i );

intrinsic function?</description>
      <pubDate>Fri, 21 Jun 2013 03:01:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927961#M3124</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-21T03:01:17Z</dc:date>
    </item>
    <item>
      <title>Hi Sergey,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927962#M3125</link>
      <description>&lt;P&gt;Hi Sergey,&lt;/P&gt;
&lt;P&gt;thanks, but vpbroadcastd fills the entire vector, I want to insert a single dword at a given location (like vpinsrd), and I want to do that fast, without consuming an extra temporary register (e.g. if I combine a vpbroadcastd with a vpblendd, that's a workaround that needs an extra register).&lt;/P&gt;
&lt;P&gt;CU,&lt;/P&gt;
&lt;P&gt;Elmar&lt;/P&gt;</description>
      <pubDate>Fri, 21 Jun 2013 06:31:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927962#M3125</guid>
      <dc:creator>Elmar</dc:creator>
      <dc:date>2013-06-21T06:31:43Z</dc:date>
    </item>
    <item>
      <title>What about these two</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927963#M3126</link>
      <description>What about these two intrinsic functions:
...
extern __m256i __ICL_INTRINCC &lt;STRONG&gt;_mm256_set_epi32&lt;/STRONG&gt;( int, int, int, int, int, int, int, int );
...
and
...
extern __m256i __ICL_INTRINCC &lt;STRONG&gt;_mm256_setr_epi32&lt;/STRONG&gt;( int, int, int, int, int, int, int, int );
...
Examples of application for &lt;STRONG&gt;_mm256_set_epi32&lt;/STRONG&gt; could look like:
...
__m256i v1 = &lt;STRONG&gt;_mm256_set_epi32&lt;/STRONG&gt;( 0, 77, 0, 0, 0, 0, 0, 0 );
or
__m256i v2 = &lt;STRONG&gt;_mm256_set_epi32&lt;/STRONG&gt;( 0, 0, 0, 0, 0, 0, 77, 0 );
...</description>
      <pubDate>Fri, 21 Jun 2013 12:59:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927963#M3126</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-21T12:59:16Z</dc:date>
    </item>
    <item>
      <title>@Sergey Kostrov: These are</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927964#M3127</link>
      <description>&lt;P&gt;@Sergey Kostrov: These are multi-instruction constructs, which basically come down to broadcasts or moves+shuffles. And the OP seem to want to inject a single dword into an existing register filled with data.&lt;/P&gt;
&lt;P&gt;I don't think there is a single instruction for this. But depending on the surrounding code and actual requirements you could use vshufps on the two registers (the one with the old contents and the other with the loaded dword). The downside is that the lower and upper halves of ymm are shuffled the same way, so you'll have to insert two values this way. This can be mitigated by copying the half to be preserved from the original data register to the dword register first (see vperm2f128). But you save a mask register that would be needed in case of blend.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 24 Jun 2013 07:12:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927964#M3127</guid>
      <dc:creator>andysem</dc:creator>
      <dc:date>2013-06-24T07:12:59Z</dc:date>
    </item>
    <item>
      <title>Quote:Elmar wrote:AVX</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927965#M3128</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Elmar wrote:&lt;BR /&gt;AVX vinsertps doesn't work because it clears the upper 128bits and the immediate value can't address the upper 128bits anyway&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;simply use&amp;nbsp;&lt;EM&gt;vinsertps&lt;/EM&gt; followed by&amp;nbsp;&lt;EM&gt;vinsertf128&lt;/EM&gt;, this is the fastest available option AFAIK, I use it for my&amp;nbsp;AVX legacy generic gather path detailed here for example: &lt;A href="http://software.intel.com/en-us/comment/reply/285867/1740679"&gt;http://software.intel.com/en-us/comment/reply/285867/1740679&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 24 Jun 2013 09:17:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927965#M3128</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2013-06-24T09:17:00Z</dc:date>
    </item>
    <item>
      <title>Quote:andysem wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927966#M3129</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;andysem wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I don't think there is a single instruction for this. But depending on the surrounding code and actual requirements you could use vshufps on the two registers (the one with the old contents and the other with the loaded dword). The downside is that the lower and upper halves of ymm are shuffled the same way, so you'll have to insert two values this way. This can be mitigated by copying the half to be preserved from the original data register to the dword register first (see vperm2f128). But you save a mask register that would be needed in case of blend.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;But vshufps can only insert a QWORD, since two adjacent DWORDS must come from the same operand,no?&lt;/P&gt;
&lt;P&gt;For inserting a DWORD, I currently use vinsertps or vpermilps to place the DWORD at the right spot in an unused register, and then vblendps to move the DWORD into the target register (note that vblendps takes an immediate blend factor, not a mask register). If the DWORD crosses a lane, I need a third instruction for the cross-lane-shuffle.&lt;/P&gt;
&lt;P&gt;I had hoped that Intel engineers would immediately fire the optimal solution at me (in terms of false dependencies, latency etc.), but it seems that they are busy (hopefully cleaning up the AVX2 manual #319433-014, because that's full of bugs ;-))... &lt;/P&gt;
&lt;P&gt;Thanks,&lt;/P&gt;
&lt;P&gt;Elmar&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 24 Jun 2013 22:54:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927966#M3129</guid>
      <dc:creator>Elmar</dc:creator>
      <dc:date>2013-06-24T22:54:33Z</dc:date>
    </item>
    <item>
      <title>Elmar, I did a verification</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927967#M3130</link>
      <description>Elmar, I did a verification and with these intrinsics:

&amp;gt;&amp;gt;...
&amp;gt;&amp;gt;__m256i v1 = _mm256_set_epi32( 0, 77, 0, 0, 0, 0, 0, 0 );
&amp;gt;&amp;gt;or
&amp;gt;&amp;gt;__m256i v2 = _mm256_setr_epi32( 0, 0, 0, 0, 0, 0, 77, 0 );
&amp;gt;&amp;gt;...

a performance impact is possible and implementation of a similar functionality with native instruvtions could be faster. Please do a performance evaluation if you decide to use these two intrinsics functions.</description>
      <pubDate>Mon, 24 Jun 2013 23:16:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927967#M3130</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-06-24T23:16:21Z</dc:date>
    </item>
    <item>
      <title>&gt; But vshufps can only insert</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927968#M3131</link>
      <description>&lt;P&gt;&amp;gt; But vshufps can only insert a QWORD, since two adjacent DWORDS must come from the same operand,no?&lt;/P&gt;
&lt;P&gt;You're right, sorry for the confusion. It seems, inserts and blends are the way to go.&lt;/P&gt;</description>
      <pubDate>Tue, 25 Jun 2013 07:30:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Looking-for-smartest-way-to-insert-a-DWORD-into-AVX-register/m-p/927968#M3131</guid>
      <dc:creator>andysem</dc:creator>
      <dc:date>2013-06-25T07:30:14Z</dc:date>
    </item>
  </channel>
</rss>

