<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic There are a number of in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Intercore-communication-and-Cache-Sharing/m-p/1148671#M7803</link>
    <description>&lt;P&gt;There are a number of different ways to implement these transactions, and it is difficult to be sure that you understand a particular implementation well enough to correctly predict what it is going to do....&amp;nbsp;&amp;nbsp; Section 11.4 of Volume 3 of the SWDM is a very high-level description, and does not contain all of the cache states and transactions that are implemented in specific processors.&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;In your case the trickiest part is to ensure that the "consumer" thread does not start reading the "data" before the "producer" thread has completed writing it.&amp;nbsp;&amp;nbsp; Once that is correctly implemented, the "consumer" does not need to "move" anything -- it can simply access the data by loading from the appropriate addresses -- the L3 cache will handle the transfers efficiently.&amp;nbsp; If the consumer needs to copy data to a different address range, a simple read/write loop or memcpy() call should provide very close to the best possible performance.&lt;/P&gt;

&lt;P&gt;The L3 cache can deliver data at a sustained rate of about 14 Bytes/cycle on the Xeon E7 v4 processors, so the total time for the consumer to read all the data will be composed of three parts:&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;A synchronization overhead -- the time required for the producer to notify the consumer that the data is ready to be read.&amp;nbsp;
		&lt;OL&gt;
			&lt;LI&gt;This involves a fair number of cache transactions, with timing that depends on a great many factors, including (at least): core frequency, uncore frequency, number of cores on the die, locations of the producer core, the consumer core, and the L3 slice containing the "flag" variable used for synchronization, etc.&lt;/LI&gt;
			&lt;LI&gt;I don't have results on a Xeon E7 v4, but I measured a range of 200-300 cycles on a 12-core Xeon E5 v3 (depending on the relative placement of the cores and the L3 slice handling the cache line used for the handoff).&lt;/LI&gt;
		&lt;/OL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;A pipeline startup -- the time required for the consumer to receive the first data elements from the producer after the synchronization tells the consumer that it can start reading the data.
		&lt;OL&gt;
			&lt;LI&gt;There are 2.5 intervention latencies in the synchronization above, so this should be 80-120 cycles.&lt;/LI&gt;
		&lt;/OL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;The bulk transfer
		&lt;OL&gt;
			&lt;LI&gt;This should average about 14 bytes per cycle for loads of L3-resident data.&lt;/LI&gt;
			&lt;LI&gt;E.g., for 2 KiB, I would expect about 150 cycles.&lt;/LI&gt;
		&lt;/OL&gt;
	&lt;/LI&gt;
&lt;/OL&gt;

&lt;P&gt;Adding these three parts together gives an estimate of 430 to 570 cycles for 2048 Bytes, or about 3.6 Bytes/cycle to 4.8 Bytes/cycle.&lt;/P&gt;

&lt;P&gt;Because the transfer is dominated by the overhead here, the efficiency should improve for larger transfer sizes (up to the limit of the L1, when things get more complex).&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 17 Jul 2017 19:14:59 GMT</pubDate>
    <dc:creator>McCalpinJohn</dc:creator>
    <dc:date>2017-07-17T19:14:59Z</dc:date>
    <item>
      <title>Intercore communication and Cache Sharing</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Intercore-communication-and-Cache-Sharing/m-p/1148670#M7802</link>
      <description>&lt;P&gt;What would be the fastest (read lowest latency) method for moving several KiB of data from one core to another on the same physical package?&lt;/P&gt;

&lt;P&gt;Suppose core 0 writes some data to L3 cache. My understanding from the SDM Vol 3, Section 11.4 is that core 0 gains ownership of the cache lines (if it did not have it already) even if they were previously shared. For core 1 to read the newly written data, the cache lines should normally be flushed with write to ram and fetched by the other core through a cache miss. Is it possible to transfer ownership of the L3 lines from one core to the other to allow core 1 to access the data without waiting for the store and load via ram?&lt;/P&gt;

&lt;P&gt;In particular, I'm targeting Xeon E7 processors in the v4 series.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 17 Jul 2017 14:06:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Intercore-communication-and-Cache-Sharing/m-p/1148670#M7802</guid>
      <dc:creator>Tim_G_2</dc:creator>
      <dc:date>2017-07-17T14:06:28Z</dc:date>
    </item>
    <item>
      <title>There are a number of</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Intercore-communication-and-Cache-Sharing/m-p/1148671#M7803</link>
      <description>&lt;P&gt;There are a number of different ways to implement these transactions, and it is difficult to be sure that you understand a particular implementation well enough to correctly predict what it is going to do....&amp;nbsp;&amp;nbsp; Section 11.4 of Volume 3 of the SWDM is a very high-level description, and does not contain all of the cache states and transactions that are implemented in specific processors.&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;In your case the trickiest part is to ensure that the "consumer" thread does not start reading the "data" before the "producer" thread has completed writing it.&amp;nbsp;&amp;nbsp; Once that is correctly implemented, the "consumer" does not need to "move" anything -- it can simply access the data by loading from the appropriate addresses -- the L3 cache will handle the transfers efficiently.&amp;nbsp; If the consumer needs to copy data to a different address range, a simple read/write loop or memcpy() call should provide very close to the best possible performance.&lt;/P&gt;

&lt;P&gt;The L3 cache can deliver data at a sustained rate of about 14 Bytes/cycle on the Xeon E7 v4 processors, so the total time for the consumer to read all the data will be composed of three parts:&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;A synchronization overhead -- the time required for the producer to notify the consumer that the data is ready to be read.&amp;nbsp;
		&lt;OL&gt;
			&lt;LI&gt;This involves a fair number of cache transactions, with timing that depends on a great many factors, including (at least): core frequency, uncore frequency, number of cores on the die, locations of the producer core, the consumer core, and the L3 slice containing the "flag" variable used for synchronization, etc.&lt;/LI&gt;
			&lt;LI&gt;I don't have results on a Xeon E7 v4, but I measured a range of 200-300 cycles on a 12-core Xeon E5 v3 (depending on the relative placement of the cores and the L3 slice handling the cache line used for the handoff).&lt;/LI&gt;
		&lt;/OL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;A pipeline startup -- the time required for the consumer to receive the first data elements from the producer after the synchronization tells the consumer that it can start reading the data.
		&lt;OL&gt;
			&lt;LI&gt;There are 2.5 intervention latencies in the synchronization above, so this should be 80-120 cycles.&lt;/LI&gt;
		&lt;/OL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;The bulk transfer
		&lt;OL&gt;
			&lt;LI&gt;This should average about 14 bytes per cycle for loads of L3-resident data.&lt;/LI&gt;
			&lt;LI&gt;E.g., for 2 KiB, I would expect about 150 cycles.&lt;/LI&gt;
		&lt;/OL&gt;
	&lt;/LI&gt;
&lt;/OL&gt;

&lt;P&gt;Adding these three parts together gives an estimate of 430 to 570 cycles for 2048 Bytes, or about 3.6 Bytes/cycle to 4.8 Bytes/cycle.&lt;/P&gt;

&lt;P&gt;Because the transfer is dominated by the overhead here, the efficiency should improve for larger transfer sizes (up to the limit of the L1, when things get more complex).&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 17 Jul 2017 19:14:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Intercore-communication-and-Cache-Sharing/m-p/1148671#M7803</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2017-07-17T19:14:59Z</dc:date>
    </item>
  </channel>
</rss>

