<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Your English is quite good. in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-to-CPU-mov-bandwidth-limitations/m-p/1061337#M6880</link>
    <description>&lt;P&gt;Your English is quite good.&lt;/P&gt;

&lt;P&gt;The CPU performs the SIMD arithmetic on data that is sourced from one or combination of&lt;/P&gt;

&lt;P&gt;RAM&lt;BR /&gt;
	L3 Cache&lt;BR /&gt;
	L2 Cache&lt;BR /&gt;
	L1 Cache&lt;BR /&gt;
	register&lt;/P&gt;

&lt;P&gt;Where RAM is the slowest and register is the fastest&lt;/P&gt;

&lt;P&gt;It is the goal of the programmer to construct the algorithm such that it favors reuse of data brought into the faster end of the hierarchy.&lt;/P&gt;

&lt;P&gt;Some computational problems only touch the source data once per iteration (some problems have one iteration others many).&lt;/P&gt;

&lt;P&gt;Most computational problems touch the source data several times per iteration. For example a Finite Element analysis may analyze a point and its 26 nearest neighbors or its 124 nearest neighbors. For these types of problems, you often can structure the algorithm such that experiences more reuse of re-referencing of data&amp;nbsp; in the order of register, L1, L2, L3, then&amp;nbsp;RAM.&lt;/P&gt;

&lt;P&gt;A good educational reference would be to consult some references:&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/sites/products/papers/tpt_ieee.pdf"&gt;https://software.intel.com/sites/products/papers/tpt_ieee.pdf&lt;/A&gt;&lt;BR /&gt;
	&lt;A href="https://software.intel.com/en-us/articles/3d-finite-differences-on-multi-core-processors"&gt;https://software.intel.com/en-us/articles/3d-finite-differences-on-multi-core-processors&lt;/A&gt;&lt;BR /&gt;
	&lt;A href="https://software.intel.com/en-us/articles/using-simd-technologies-on-intel-architecture-to-speed-up-game-code"&gt;https://software.intel.com/en-us/articles/using-simd-technologies-on-intel-architecture-to-speed-up-game-code&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sun, 01 Feb 2015 15:15:34 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2015-02-01T15:15:34Z</dc:date>
    <item>
      <title>Memory to CPU (mov) bandwidth limitations</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-to-CPU-mov-bandwidth-limitations/m-p/1061336#M6879</link>
      <description>(sorry for weak english I am not native english, Not sure if right forum, first time here - This is general about some hardware limits i do not understand technical reason and I would very like to know)

We have now parallelised SIMD arithmetic (like 8 float mulls or divisions in one step) theoretical (but also nearly practical) arithmetical bandwidth per core is thus like 4GHz * 8 floats = about 30 GFLOPS per core or something like that

But we still AFAIK have quite low RAM to CPU bandwidth at the level of read or write of 1 or 2 int of float per nanosecond, such ram-2-cpu bandwidth when i am testing it is like only 2 GLOP per second per core or something like that;

(both those values are rough but this difference seem to be physical truth at least from my experience)

I mean arithmetic can be paralelised (like 8-vectorised) but load/store movs are not - thus SIMD paralistation has obly a fraction of its potential power

This is extremally crusial to increase this memory bandwith (much more important than increasing arithmetic) but from some technical reason I dont know this is not improved

The question is what is the real reason for this, why simd movs CANT be parallelised, why they are not?</description>
      <pubDate>Sun, 01 Feb 2015 13:20:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-to-CPU-mov-bandwidth-limitations/m-p/1061336#M6879</guid>
      <dc:creator>albus_d_</dc:creator>
      <dc:date>2015-02-01T13:20:00Z</dc:date>
    </item>
    <item>
      <title>Your English is quite good.</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-to-CPU-mov-bandwidth-limitations/m-p/1061337#M6880</link>
      <description>&lt;P&gt;Your English is quite good.&lt;/P&gt;

&lt;P&gt;The CPU performs the SIMD arithmetic on data that is sourced from one or combination of&lt;/P&gt;

&lt;P&gt;RAM&lt;BR /&gt;
	L3 Cache&lt;BR /&gt;
	L2 Cache&lt;BR /&gt;
	L1 Cache&lt;BR /&gt;
	register&lt;/P&gt;

&lt;P&gt;Where RAM is the slowest and register is the fastest&lt;/P&gt;

&lt;P&gt;It is the goal of the programmer to construct the algorithm such that it favors reuse of data brought into the faster end of the hierarchy.&lt;/P&gt;

&lt;P&gt;Some computational problems only touch the source data once per iteration (some problems have one iteration others many).&lt;/P&gt;

&lt;P&gt;Most computational problems touch the source data several times per iteration. For example a Finite Element analysis may analyze a point and its 26 nearest neighbors or its 124 nearest neighbors. For these types of problems, you often can structure the algorithm such that experiences more reuse of re-referencing of data&amp;nbsp; in the order of register, L1, L2, L3, then&amp;nbsp;RAM.&lt;/P&gt;

&lt;P&gt;A good educational reference would be to consult some references:&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/sites/products/papers/tpt_ieee.pdf"&gt;https://software.intel.com/sites/products/papers/tpt_ieee.pdf&lt;/A&gt;&lt;BR /&gt;
	&lt;A href="https://software.intel.com/en-us/articles/3d-finite-differences-on-multi-core-processors"&gt;https://software.intel.com/en-us/articles/3d-finite-differences-on-multi-core-processors&lt;/A&gt;&lt;BR /&gt;
	&lt;A href="https://software.intel.com/en-us/articles/using-simd-technologies-on-intel-architecture-to-speed-up-game-code"&gt;https://software.intel.com/en-us/articles/using-simd-technologies-on-intel-architecture-to-speed-up-game-code&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 01 Feb 2015 15:15:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-to-CPU-mov-bandwidth-limitations/m-p/1061337#M6880</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2015-02-01T15:15:34Z</dc:date>
    </item>
    <item>
      <title>Data misalignment is a</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-to-CPU-mov-bandwidth-limitations/m-p/1061338#M6881</link>
      <description>&lt;P&gt;Data misalignment is a typical reason for ineffectiveness of simd parallel move. Most vectorizing compilers look for opportunities to adjust alignment assuming a long enough stream. &amp;nbsp;Details vary with CPU. For example, misaligned 128 bit moves are ok for Sandy bridge where 256 bit &amp;nbsp;moves are not.&lt;/P&gt;</description>
      <pubDate>Mon, 02 Feb 2015 14:04:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-to-CPU-mov-bandwidth-limitations/m-p/1061338#M6881</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2015-02-02T14:04:43Z</dc:date>
    </item>
    <item>
      <title>Intel processors have</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-to-CPU-mov-bandwidth-limitations/m-p/1061339#M6882</link>
      <description>&lt;P&gt;Intel processors have supported aligned SIMD loads since they supported SIMD.&amp;nbsp; These can be either MOV instructions (e.g., MOVAPD) or memory arguments to SIMD arithmetic instructions.&lt;/P&gt;

&lt;P&gt;Different types of SIMD memory operations have different alignment restrictions, and different performance penalties for SIMD access to data that is not SIMD-aligned.&amp;nbsp;&amp;nbsp; AVX is much easier to work with than SSE2/3/4, so compilers tend to use the SIMD memory access instructions much more often than they used to.&lt;/P&gt;

&lt;P&gt;SIMD memory operations don't necessarily provide any performance benefit.&amp;nbsp; For data that is outside of the L1 cache, all data motion takes place in 64-Byte cache lines.&amp;nbsp;&amp;nbsp; In some cases, Sandy Bridge processors actually get slightly better memory performance with scalar SSE or scalar AVX loads than with SIMD loads.&amp;nbsp;&amp;nbsp; For Haswell processors the AVX instructions give better bandwidth in all the cases I have tested, but there may still be counter-examples.&lt;/P&gt;

&lt;P&gt;As the vector width of the SIMD units increases, it does become increasingly difficult to deal with the cases where data rearrangement is required.&amp;nbsp; For packed doubles in SSE you only needed to be able to load the low part, the high part, or swap the two parts.&amp;nbsp; For packed doubles in AVX there are many more permutations of rearrangements that sometimes need to be dealt with.&amp;nbsp; By the the time you get to 8-element vectors of doubles in AVX-512, it can be extremely challenging to figure out how to rearrange data in the registers without losing the speedup that you are trying to get from the wide SIMD architecture.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 03 Feb 2015 00:40:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Memory-to-CPU-mov-bandwidth-limitations/m-p/1061339#M6882</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2015-02-03T00:40:04Z</dc:date>
    </item>
  </channel>
</rss>

