<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hi Zheng in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924321#M3063</link>
    <description>&lt;P&gt;Hi Zheng&lt;/P&gt;
&lt;P&gt;Do you have a transition penalty between SSE and AVX-256bit code?Maybe during the execution of your code SSE and AVX-256 got intermixed into YMM registers?&lt;/P&gt;</description>
    <pubDate>Tue, 17 Sep 2013 06:12:00 GMT</pubDate>
    <dc:creator>Bernard</dc:creator>
    <dc:date>2013-09-17T06:12:00Z</dc:date>
    <item>
      <title>do _mm256_load_ps slower than _mm_load_ps?</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924319#M3061</link>
      <description>&lt;P&gt;I'm tried to improve performance of simple code via SSE and AVX, but I found the AVX code need more time then the SSE code:&lt;/P&gt;
&lt;P&gt;&lt;I&gt;void testfun()&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;{ &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;int dataLen = 4800; &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;int N = 10000000;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;float *buf1 = reinterpret_cast&amp;lt;float*&amp;gt;(_aligned_malloc(sizeof(float)*dataLen, 32)); &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;float *buf2 = reinterpret_cast&amp;lt;float*&amp;gt;(_aligned_malloc(sizeof(float)*dataLen, 32)); &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;float *buf3 = reinterpret_cast&amp;lt;float*&amp;gt;(_aligned_malloc(sizeof(float)*dataLen, 32)); &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;for(int i=0; i&amp;lt;dataLen; i++) &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;{ &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;buf1&lt;I&gt; = 1; &amp;nbsp;&amp;nbsp;buf2&lt;I&gt; = 1; &amp;nbsp;&amp;nbsp;buf3&lt;I&gt; = 0; &amp;nbsp;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;}&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;int timePassed; &amp;nbsp;int t; &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;//=========================SSE CODE=====================================&amp;nbsp; &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;t = clock(); &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;__m128 *p1, *p2, *p3; &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;for(int j=0;j&amp;lt;N; j++) &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;{&amp;nbsp; &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;p1 = (__m128 *)buf1; &amp;nbsp;&amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;p2 = (__m128 *)buf2; &amp;nbsp;&amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;p3 = (__m128 *)buf3;&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;for(int i=0; i&amp;lt;dataLen/4; i++) &amp;nbsp;&amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;{ &amp;nbsp;&amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;*p3 = _mm_add_ps(_mm_mul_ps(*p1, *p2), *p3); &amp;nbsp;&amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;p1++; &amp;nbsp;&amp;nbsp;&amp;nbsp;p2++; &amp;nbsp;&amp;nbsp;&amp;nbsp;p3++; &amp;nbsp;&amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;} &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;}&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;timePassed = clock() - t; &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;std::cout&amp;lt;&amp;lt;"SSE time used: "&amp;lt;&amp;lt;timePassed&amp;lt;&amp;lt;"ms"&amp;lt;&amp;lt;std::endl;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;for(int i=0; i&amp;lt;dataLen; i++) &amp;nbsp;{ &amp;nbsp;&amp;nbsp;buf3&lt;I&gt; = 0; &amp;nbsp;} &amp;nbsp;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;t = clock(); &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;//=========================AVX&lt;/I&gt;&lt;I&gt;　&lt;/I&gt;&lt;I&gt;CODE=====================================&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;__m256&amp;nbsp; *pp1, *pp2, *pp3; &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;for(int j=0;j&amp;lt;N; j++)&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;{&amp;nbsp; &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;pp1 = (__m256*) buf1; &amp;nbsp;&amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;pp2 = (__m256*) buf2; &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;pp3 = (__m256*) buf3; &amp;nbsp;&amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;for(int i=0; i&amp;lt;dataLen/8; i++) &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;{&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;*pp3 = _mm256_add_ps(_mm256_mul_ps(*pp1, *pp2), *pp3); &amp;nbsp;&amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;pp1++; &amp;nbsp;&amp;nbsp;&amp;nbsp;pp2++; &amp;nbsp;&amp;nbsp;&amp;nbsp;pp3++; &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;} &amp;nbsp;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;}&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;timePassed = clock() - t; &amp;nbsp;std::cout&amp;lt;&amp;lt;"AVX time used: "&amp;lt;&amp;lt;timePassed&amp;lt;&amp;lt;"ms"&amp;lt;&amp;lt;std::endl;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;&amp;nbsp;_aligned_free(buf1); &amp;nbsp;_aligned_free(buf2);&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;I&gt;}&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;I changed the "dataLen" and get different efficiency:&lt;/P&gt;
&lt;P&gt;dataLen = 400&amp;nbsp;&amp;nbsp; SSE time:758&amp;nbsp;&amp;nbsp;ms&amp;nbsp;&amp;nbsp;&amp;nbsp; AVX time:483&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ms&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp; SSE &amp;gt; AVX&lt;/P&gt;
&lt;P&gt;dataLen = 2400&amp;nbsp; SSE time:4212 ms&amp;nbsp;&amp;nbsp;&amp;nbsp; AVX time:2636&amp;nbsp;&amp;nbsp;&amp;nbsp; ms&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;SSE &amp;gt; AVX&lt;/P&gt;
&lt;P&gt;dataLen = 2864 SSE time:6115&amp;nbsp;&amp;nbsp; ms&amp;nbsp;&amp;nbsp; AVX time:6146&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ms&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;SSE ~= AVX&lt;/P&gt;
&lt;P&gt;dataLen = 3200&amp;nbsp; SSE time:8049&amp;nbsp;&amp;nbsp;&amp;nbsp;ms&amp;nbsp;&amp;nbsp;&amp;nbsp; AVX time:9297&amp;nbsp;&amp;nbsp;&amp;nbsp; ms&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;SSE &amp;lt; AVX&lt;/P&gt;
&lt;P&gt;dataLen = 4000&amp;nbsp; SSE time:10170&amp;nbsp; ms&amp;nbsp;&amp;nbsp;&amp;nbsp; AVX time:11690&amp;nbsp;&amp;nbsp;&amp;nbsp;ms&amp;nbsp;&amp;nbsp;&amp;nbsp; SSE &amp;lt; AVX&lt;/P&gt;
&lt;P&gt;My L1 cache is 32KB, L2 cache 1MB.It seems that&amp;nbsp; sometimes&amp;nbsp; load 256 Bytes&amp;nbsp;&amp;nbsp;&amp;nbsp;is slower load 128Bytes,&amp;nbsp; why?It is the same result if I change the code to SIMD Instructions, just like"_mm256_load_ps&amp;nbsp; ","_mm_load_ps ","mm_add_ps" .....&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 10 Sep 2013 03:40:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924319#M3061</guid>
      <dc:creator>zhang_h_</dc:creator>
      <dc:date>2013-09-10T03:40:12Z</dc:date>
    </item>
    <item>
      <title>You haven't indicated your</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924320#M3062</link>
      <description>&lt;P&gt;You haven't indicated your processor. My guess is you are Sandybridge.&lt;/P&gt;
&lt;P&gt;Next gen Haswell will correct for this in your example by a 2x wider data bus between the CPU and the L1/L2/L3&lt;/P&gt;
&lt;P&gt;see &lt;A href="http://www.realworldtech.com/haswell-cpu/5/"&gt;http://www.realworldtech.com/haswell-cpu/5/&lt;/A&gt;&amp;nbsp;for some insight.&lt;/P&gt;
&lt;P&gt;Also Haswell can perform FMA (Fused Multiply and Add) in one instruction (... = (B*C) + A)&lt;BR /&gt;And depending on Haswell CPU, you can have upto 4 memory banks.&lt;/P&gt;
&lt;P&gt;Use Sandybridge for learning how to use _mm256 (and converting applications).&lt;BR /&gt;Use Haswell for production code.&lt;/P&gt;
&lt;P&gt;Sandybridge gave you a year to adopt your code for Haswell (and later).&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 15 Sep 2013 17:16:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924320#M3062</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2013-09-15T17:16:08Z</dc:date>
    </item>
    <item>
      <title>Hi Zheng</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924321#M3063</link>
      <description>&lt;P&gt;Hi Zheng&lt;/P&gt;
&lt;P&gt;Do you have a transition penalty between SSE and AVX-256bit code?Maybe during the execution of your code SSE and AVX-256 got intermixed into YMM registers?&lt;/P&gt;</description>
      <pubDate>Tue, 17 Sep 2013 06:12:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924321#M3063</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-09-17T06:12:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;float *buf1 = reinterpret</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924322#M3064</link>
      <description>&amp;gt;&amp;gt;float *buf1 = reinterpret_cast&lt;FLOAT&gt;(_aligned_malloc(sizeof(float)*dataLen, 32));  
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;float *buf2 = reinterpret_cast&lt;FLOAT&gt;(_aligned_malloc(sizeof(float)*dataLen, 32));  
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;float *buf3 = reinterpret_cast&lt;FLOAT&gt;(_aligned_malloc(sizeof(float)*dataLen, 32));

Did you verify that these three pointers are really aligned on 32-byte boundary? Also, You've overcomplicated memory allocation and why do you need reinterpret_cast C++ operator?&lt;/FLOAT&gt;&lt;/FLOAT&gt;&lt;/FLOAT&gt;</description>
      <pubDate>Mon, 23 Sep 2013 05:27:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924322#M3064</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-09-23T05:27:48Z</dc:date>
    </item>
    <item>
      <title>Quote:jimdempseyatthecove</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924323#M3065</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;You haven't indicated your processor. My guess is you are Sandybridge.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;Use Haswell for production code.&lt;/P&gt;
&lt;P&gt;Sandybridge gave you a year to adopt your code for Haswell (and later).&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Hi jim, Thanks for your reply!&lt;/P&gt;
&lt;P&gt;Yes my processor is Sandybridge(xeon e3 1225 v2).Others can recreate my results on sandybridge processor.&lt;/P&gt;
&lt;P&gt;For extra large arrays and simple calculation,like&lt;/P&gt;
&lt;P&gt;for(int i=0; i&amp;lt;1000000000; i++)&lt;/P&gt;
&lt;P&gt;{&lt;/P&gt;
&lt;P&gt;A&lt;I&gt; += B&lt;I&gt;*C&lt;I&gt;;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;}&lt;/P&gt;
&lt;P&gt;This loop is memory band width limited,so it can not &amp;nbsp;fully display the FMA's advantage.&lt;/P&gt;
&lt;P&gt;I've tested my memory read and write speed , it is&amp;nbsp;about 20GB/s(DDR3, Dual channel),I think this speed is fast,but still not enough.&lt;/P&gt;
&lt;P&gt;Can I get faster?&lt;/P&gt;</description>
      <pubDate>Mon, 21 Oct 2013 08:39:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924323#M3065</guid>
      <dc:creator>zhang_h_</dc:creator>
      <dc:date>2013-10-21T08:39:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;This loop it is memory</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924324#M3066</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;This loop it is memory band width limited,so it can not &amp;nbsp;fully display the FMA's advantage.&amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;Can you run VTune analysis on that code to see where the pipeline stalls are?Here I mean front-end pipeline stalls.&lt;/P&gt;</description>
      <pubDate>Mon, 21 Oct 2013 08:50:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924324#M3066</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-10-21T08:50:43Z</dc:date>
    </item>
    <item>
      <title>Quote:iliyapolak wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924325#M3067</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;iliyapolak wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi Zheng&lt;/P&gt;
&lt;P&gt;Do you have a transition penalty between SSE and AVX-256bit code?Maybe during the execution of your code SSE and AVX-256 got intermixed into YMM registers?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I have seen the assembly code,and it use the&amp;nbsp;XMM registers in the sse code , and use the YMM registers in the AVX code!&lt;/P&gt;
&lt;P&gt;Also I can comment the SSE code or comment the AVX code , and the result is the same!&lt;/P&gt;</description>
      <pubDate>Mon, 21 Oct 2013 09:12:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924325#M3067</guid>
      <dc:creator>zhang_h_</dc:creator>
      <dc:date>2013-10-21T09:12:55Z</dc:date>
    </item>
    <item>
      <title>Quote:iliyapolak wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924326#M3068</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;iliyapolak wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;This loop it is memory band width limited,so it can not &amp;nbsp;fully display the FMA's advantage.&amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;Can you run VTune analysis on that code to see where the pipeline stalls are?Here I mean front-end pipeline stalls.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi,I don't have this software...&lt;/P&gt;
&lt;P&gt;I get the conclusion just by some tests. For example, reduce the size of the array and add the iterations of the&amp;nbsp;loop&lt;/P&gt;
&lt;P&gt;(&amp;nbsp;keep the amount of calculation the same)&amp;nbsp; then I will get different performance.&lt;/P&gt;
&lt;P&gt;When the array can all stored in the cache, then the performance is the best!&lt;/P&gt;</description>
      <pubDate>Mon, 21 Oct 2013 10:12:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924326#M3068</guid>
      <dc:creator>zhang_h_</dc:creator>
      <dc:date>2013-10-21T10:12:42Z</dc:date>
    </item>
    <item>
      <title>Quote:Sergey Kostrov wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924327#M3069</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Sergey Kostrov wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;gt;&amp;gt;float *buf1 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32));&lt;BR /&gt; &amp;gt;&amp;gt;&lt;BR /&gt; &amp;gt;&amp;gt;float *buf2 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32));&lt;BR /&gt; &amp;gt;&amp;gt;&lt;BR /&gt; &amp;gt;&amp;gt;float *buf3 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32));&lt;/P&gt;
&lt;P&gt;Did you verify that these three pointers are really aligned on 32-byte boundary? Also, You've overcomplicated memory allocation and why do you need reinterpret_cast C++ operator?&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Hi,&amp;nbsp; I have checked the allocated memory,&amp;nbsp;they&amp;nbsp; are&amp;nbsp;&amp;nbsp;really aligned on 32-byte.&lt;/P&gt;
&lt;P&gt;If I change them to 16Byte aligned , then the allocated memory is 16Byte aligened but&amp;nbsp;not 32 Byte aligened.&lt;/P&gt;
&lt;P&gt;"reinterpret_cast&amp;lt;float*&amp;gt;&amp;nbsp;" just change the&amp;nbsp;address &amp;nbsp;to a float pointer. I am not so familiar...&lt;/P&gt;
&lt;P&gt;could you please give a example for&amp;nbsp; creating a 32 Byte aligened emory? Thanks!&lt;/P&gt;</description>
      <pubDate>Mon, 21 Oct 2013 10:51:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924327#M3069</guid>
      <dc:creator>zhang_h_</dc:creator>
      <dc:date>2013-10-21T10:51:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;Hi,I don't have this</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924328#M3070</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;Hi,I don't have this software...&lt;/P&gt;
&lt;P&gt;I get the conclusion just by some tests. For example, reduce the size of the array and add the iterations of the&amp;nbsp;loop&amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;
&lt;P&gt;You can download trial version of Parallel Studio.It is hard to say without collecting cpu counters data what is exactly the limiting factor &amp;nbsp;in your case&lt;/P&gt;</description>
      <pubDate>Mon, 21 Oct 2013 11:32:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924328#M3070</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-10-21T11:32:22Z</dc:date>
    </item>
    <item>
      <title>On Sandy Bridge the internal</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924329#M3071</link>
      <description>&lt;P&gt;On Sandy Bridge the internal data path is still 128 bits. See: &lt;A href="http://www.realworldtech.com/sandy-bridge/6/"&gt;http://www.realworldtech.com/sandy-bridge/6/&lt;/A&gt;&amp;nbsp;&lt;BR /&gt;Similar issue with Ivy Bridge.&lt;/P&gt;
&lt;P&gt;Haswell has expanded the internal data path to 256 bits. See: &lt;A href="http://www.hardwaresecrets.com/printpage/Inside-the-Intel-Haswell-Microarchitecture/1777"&gt;http://www.hardwaresecrets.com/printpage/Inside-the-Intel-Haswell-Microarchitecture/1777&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 21 Oct 2013 12:42:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924329#M3071</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2013-10-21T12:42:22Z</dc:date>
    </item>
    <item>
      <title>Intel has not made definitive</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924330#M3072</link>
      <description>&lt;P&gt;Intel has not made definitive statements on this, but I can think of a few reasons why the 128-bit loads might be more efficient than the 256-bit loads on Sandy Bridge processors (and presumably Ivy Bridge processors as well).&lt;/P&gt;
&lt;P&gt;(1) Two 128-bit loads can be issued to the two load ports in a single cycle, while the 256 bit loads are issued to one port which is then occupied for two cycles.&amp;nbsp;&amp;nbsp; It is certainly plausible that the former case allows better low-level scheduling.&lt;/P&gt;
&lt;P&gt;(2) When the data is bigger than the L1 Dcache: There is evidence that the L1 Dcache can either provide 32 Bytes/cycle to the core *or* receive 32 bytes per cycle to reload a line from the L2, but not both at the same time.&amp;nbsp; Again, having independent 128-bit loads that can execute in a single cycle might allow better scheduling with respect to the timing of the L1 Dcache refills from the L1 cache than having 256-bit loads that occupy the L1 for two cycles.&lt;/P&gt;
&lt;P&gt;(3) Intel has not disclosed enough details about the L1 Data Cache banking to really understand what is going on there.&amp;nbsp; It is possible that 256-bit loads hit bank conflicts more often than 128-bit loads, or that the impact of these delays is larger (because the 256-bit loads occupy a port for two cycles instead of one cycle).&lt;/P&gt;
&lt;P&gt;(4) For data bigger than L1:&amp;nbsp; The L1 hardware prefetcher is activated by "streams" of load addresses.&amp;nbsp; Using 128-bit loads gets you a "stream" of loads faster than using 256-bit loads. Since the hardware prefetchers have to start all over again for every 4 KiB page (64 cache lines), being able to start prefetching from the L2 even a few cycles earlier might make a noticeable difference.&lt;/P&gt;
&lt;P&gt;(5) It is important to check the assembly code carefully when using intrinsics!&amp;nbsp; Although these "look like" inline assembly, they are not, and the compiler may perform high-level optimizations that you don't expect.&amp;nbsp; I think that the differences seen here are real, but some of the details may depend on exactly what code the compiler decides to generate.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 21 Oct 2013 19:34:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924330#M3072</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2013-10-21T19:34:17Z</dc:date>
    </item>
    <item>
      <title>The core LS performance isn't</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924331#M3073</link>
      <description>&lt;P&gt;The core LS performance isn't impacted by cache size or the hw pref in terms of answering this question. &amp;nbsp;If you ran and measured LS latency for the given instruction, you'd find that 256-bit loads within a cacheline have a latency of 9 clks on SB/IB. &amp;nbsp;That's 2 clks greater than the non-256-bit loads. &amp;nbsp;Now if your 256-bit load spans a cacheline boundary.. you pay a penalty of 21 clks on SB/IB (it's 13 clks on HW). &amp;nbsp;That's why this is not prudent on SB/IB. &amp;nbsp;If you are using 128-bit vector.. and you're not 16B aligned.. then it's crossing a cacheline boundary 1/4 of the time... but in 256-bit it's happening 1/2 the time. &amp;nbsp;If you're 16B aligned.. then you won't pay this cacheline spanning penalty in SSE/AVX128.. but you will in AVX256.. since you're not 32B aligned. &amp;nbsp;To align.. just do it in C code by adding some buffer to your malloc and then anding by 31 and then taking that number / your granularity and use that index to feed your code. &amp;nbsp;&lt;/P&gt;
&lt;P&gt;perfwise&lt;/P&gt;</description>
      <pubDate>Thu, 24 Oct 2013 13:13:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924331#M3073</guid>
      <dc:creator>perfwise</dc:creator>
      <dc:date>2013-10-24T13:13:25Z</dc:date>
    </item>
    <item>
      <title>I suppose that in case of 256</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924332#M3074</link>
      <description>&lt;P&gt;I suppose that in case of 256-bit load two additional cycles are needed to physically transport additional 16bytes of data.&lt;/P&gt;</description>
      <pubDate>Thu, 24 Oct 2013 14:03:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924332#M3074</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-10-24T14:03:29Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924333#M3075</link>
      <description>&amp;gt;&amp;gt;...
&amp;gt;&amp;gt;could you please give a example for  creating a 32 Byte aligened emory?..

In case of Release configuration try to use &lt;STRONG&gt;_mm_malloc&lt;/STRONG&gt; and &lt;STRONG&gt;_mm_free&lt;/STRONG&gt; intrinsic functions. Here is an example:

...
int *piMemoryBlock1 = NULL;
...
piMemoryBlock1 = ( int * )_mm_malloc( 777 * sizeof( int ), 32 );
...
// Some Processing
...
if( piMemoryBlock1 != NULL )
_mm_free( piMemoryBlock1 );
...

PS: I use these two intrinsic functions in ~95% of all memory allocation / de-allocation cases.

PS2: In Debug configuration I use Microsoft's &lt;STRONG&gt;malloc_dbg&lt;/STRONG&gt;  and &lt;STRONG&gt;free_dbg&lt;/STRONG&gt; functions since they allow to detect memory leaks and buffer overflows.</description>
      <pubDate>Fri, 01 Nov 2013 21:09:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/do-mm256-load-ps-slower-than-mm-load-ps/m-p/924333#M3075</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-11-01T21:09:00Z</dc:date>
    </item>
  </channel>
</rss>

