<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Intraregister sum  in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Intraregister-sum/m-p/808873#M839</link>
    <description>If intention is to add 8 elements in a YMM register [x0, x1, ..... x7]. I dont think you will get any performance gain from AVX, it will be same as SSE2. &lt;BR /&gt;AVX has lane concept. 4 elements are in upper lane(x4-x7) and 4 elements are in lower lane (x0-x3). first you need to bring 4 elements down.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;__m256 uLane = _mm256_permute2f128_ps(ymm0, 0x01);&lt;BR /&gt;&lt;BR /&gt;// depending how you want to add 2 elements - result may differ as pointed out by Tim earlier. &lt;BR /&gt;//efficeint way is add two now:&lt;BR /&gt;ymm0 = _mm256_add_ps(ymm0, uLane);&lt;BR /&gt;&lt;BR /&gt;follow SSE2 code now (as lower lane of ymm0 has 4 elements).&lt;BR /&gt;....</description>
    <pubDate>Thu, 22 Jul 2010 21:48:17 GMT</pubDate>
    <dc:creator>Brijender_B_Intel</dc:creator>
    <dc:date>2010-07-22T21:48:17Z</dc:date>
    <item>
      <title>Intraregister sum</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Intraregister-sum/m-p/808868#M834</link>
      <description>Dear Intel users,&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;i need to do many times an intraregister sum with intrinsic. For example:&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;x += a[0]+ a[1] + a[2] + a[3]&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;META http-equiv="content-type" content="text/html; charset=utf-8" /&gt;and a should be _m128 type.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;How can i do that? Which is the faster way?&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Thanks in advance!&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Sat, 29 May 2010 09:08:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Intraregister-sum/m-p/808868#M834</guid>
      <dc:creator>unrue</dc:creator>
      <dc:date>2010-05-29T09:08:06Z</dc:date>
    </item>
    <item>
      <title>Intraregister sum</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Intraregister-sum/m-p/808869#M835</link>
      <description>There is little consensus on this, except that the way you have written it may be one of the slower, yet straightforward attempts to speed it up or cut down on numeric variations such as&lt;BR /&gt;x += (a[0]+ a[1]) + (a[2] + a[3]);&lt;BR /&gt;are likely to be ignored by icc -fast (default) or even gcc -ffast-math.&lt;BR /&gt;If you are using SSE3, you can write in horizontal add, which will not be the fastest on all CPU types, although it should produce minimum number of instructions.</description>
      <pubDate>Mon, 31 May 2010 20:34:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Intraregister-sum/m-p/808869#M835</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2010-05-31T20:34:47Z</dc:date>
    </item>
    <item>
      <title>Intraregister sum</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Intraregister-sum/m-p/808870#M836</link>
      <description>Dear&lt;B&gt;&lt;A rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=480380" class="basic" href="https://community.intel.com/en-us/profile/480380/"&gt;cikikamakuro&lt;/A&gt;,&lt;/B&gt;&lt;DIV&gt;&lt;B&gt;&lt;BR /&gt;&lt;/B&gt;&lt;/DIV&gt;&lt;DIV&gt;the definition of hadd with two vector a and b is:&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;results= b2+b3 | b1+b0 | a2+a3 | a1+a0&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;this is not i want:&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;a0+a1+a2+a3&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;I can do using some shift or other, but not in only one assembly operation.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;B&gt;&lt;BR /&gt;&lt;/B&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;B&gt;&lt;BR /&gt;&lt;/B&gt;&lt;/DIV&gt;</description>
      <pubDate>Mon, 07 Jun 2010 19:33:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Intraregister-sum/m-p/808870#M836</guid>
      <dc:creator>unrue</dc:creator>
      <dc:date>2010-06-07T19:33:08Z</dc:date>
    </item>
    <item>
      <title>Intraregister sum</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Intraregister-sum/m-p/808871#M837</link>
      <description>&lt;DIV id="tiny_quote"&gt;
                &lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=475713" class="basic" href="https://community.intel.com/en-us/profile/475713/"&gt;unrue&lt;/A&gt;&lt;/DIV&gt;
                &lt;DIV style="background-color: #e5e5e5; padding: 5px; border: 1px inset; margin-left: 2px; margin-right: 2px;"&gt;a0+a1+a2+a3&lt;I&gt;&lt;BR /&gt;&lt;DIV&gt;I can do using some shift or other, but not in only one assembly operation.&lt;/DIV&gt;&lt;DIV&gt;&lt;B&gt;&lt;BR /&gt;&lt;/B&gt;&lt;/DIV&gt;&lt;/I&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;/P&gt;Yes, haddps has to be used twice to produce the sum of the 4 operands. On Intel CPUs, other methods are likely to be slightly faster. If you're concerned about such detail, you may also wish to consider whether you want (a0+a1)+(a2+a3) or (a0+a2)+(a1+a3). The difference in numerical results usually is more noticeable than the difference in timing.&lt;BR /&gt;</description>
      <pubDate>Tue, 08 Jun 2010 15:39:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Intraregister-sum/m-p/808871#M837</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2010-06-08T15:39:28Z</dc:date>
    </item>
    <item>
      <title>Intraregister sum</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Intraregister-sum/m-p/808872#M838</link>
      <description>&lt;P&gt;//accumulating xmm[0]+xmm[1]+xmm[2]+xmm[3] into xmm[0]&lt;BR /&gt;//SSE3:&lt;/P&gt;&lt;P&gt;haddps xmm0,xmm0&lt;/P&gt;&lt;P&gt;haddps xmm0,xmm0&lt;/P&gt;&lt;P&gt;//SSE2:&lt;/P&gt;&lt;P&gt;movhlps xmm1, xmm0 // Get bit 64-127 from xmm1&lt;/P&gt;&lt;P&gt;addps xmm0, xmm1 // Sums are in 2 dwords&lt;/P&gt;&lt;P&gt;pshufd xmm1, xmm0, 1 // Get bit 32-63 from xmm0&lt;/P&gt;&lt;P&gt;addss xmm0, xmm1 // Sum is in one dword&lt;/P&gt;&lt;P&gt;//SSE:&lt;/P&gt;&lt;P&gt;movaps xmm1, xmm0&lt;/P&gt;&lt;P&gt;shufps xmm1, xmm1,(2+4*3+16*0+64*1)&lt;/P&gt;&lt;P&gt;addps xmm0, xmm1&lt;/P&gt;&lt;P&gt;movaps xmm1, xmm0&lt;/P&gt;&lt;P&gt;shufps xmm1, xmm0,(1+4*1+16*3+64*3)&lt;/P&gt;&lt;P&gt;addss xmm0, xmm1&lt;BR /&gt;&lt;BR /&gt;------------------------------&lt;BR /&gt;&lt;BR /&gt;I did not found yet how to do the same think on AVX m256 (ymm[0]+ymm[1]+...+ymm[7])&lt;BR /&gt;if anyone has done it. please let me know here&lt;/P&gt;</description>
      <pubDate>Thu, 22 Jul 2010 19:08:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Intraregister-sum/m-p/808872#M838</guid>
      <dc:creator>xavierasm</dc:creator>
      <dc:date>2010-07-22T19:08:03Z</dc:date>
    </item>
    <item>
      <title>Intraregister sum</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Intraregister-sum/m-p/808873#M839</link>
      <description>If intention is to add 8 elements in a YMM register [x0, x1, ..... x7]. I dont think you will get any performance gain from AVX, it will be same as SSE2. &lt;BR /&gt;AVX has lane concept. 4 elements are in upper lane(x4-x7) and 4 elements are in lower lane (x0-x3). first you need to bring 4 elements down.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;__m256 uLane = _mm256_permute2f128_ps(ymm0, 0x01);&lt;BR /&gt;&lt;BR /&gt;// depending how you want to add 2 elements - result may differ as pointed out by Tim earlier. &lt;BR /&gt;//efficeint way is add two now:&lt;BR /&gt;ymm0 = _mm256_add_ps(ymm0, uLane);&lt;BR /&gt;&lt;BR /&gt;follow SSE2 code now (as lower lane of ymm0 has 4 elements).&lt;BR /&gt;....</description>
      <pubDate>Thu, 22 Jul 2010 21:48:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Intraregister-sum/m-p/808873#M839</guid>
      <dc:creator>Brijender_B_Intel</dc:creator>
      <dc:date>2010-07-22T21:48:17Z</dc:date>
    </item>
  </channel>
</rss>

