<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Run each (presumably in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/reading-two-cache-lines-issue/m-p/1079502#M7054</link>
    <description>&lt;P&gt;Run each (presumably optimized) configuration under VTune, then look at the disassembly code. This will tell you the instruction sequence difference.&lt;/P&gt;

&lt;P&gt;Note, it is possible that the/a while loop that has longer instruction latencies could take fewer iterations to exit (and thus fewer cpu_pause()'s)&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
    <pubDate>Tue, 02 Aug 2016 15:05:30 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2016-08-02T15:05:30Z</dc:date>
    <item>
      <title>reading two cache lines issue</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/reading-two-cache-lines-issue/m-p/1079501#M7053</link>
      <description>&lt;P&gt;First, I did 2 tests:&amp;nbsp;&lt;/P&gt;

&lt;P&gt;1) Prepared randomly list (randomly means that the next list item has random address within L1D and multiple of cache line size, this is done to eliminate ability of prefetcher to help reader), where every list item size is equal cache line size (64b), and the total number of items equal 32K (L1D size) / 64 (cache line size), t&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;hen from the other core I run through the list and measure time, then this time I divide by number of elements, so as a result I get how long it would take to load one cache line from different core. It's consistent to what is said in the Intel documentation, about 50 cycles.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;2) Of course, If I prepare 2 such lists and then from the other core run through both lists this number won't change too much, because CPU can issue 2 loads per cycle, and these 2 lists are completely independent of each other, so there is no data dependency and cpu can issue loads of two next list items &lt;/SPAN&gt;simultaneously. This is consistent with what is said in the documentation as well.&lt;/P&gt;

&lt;P&gt;What I do now: I have 4 variables (8 bytes each) which are aligned at the cache line size boundary, 2 of them mimic one pseudo queue and the other 2 mimic the other pseudo queue. So the first thread changes the first variable ('data'), then changes the second variable ('counter'), then reads from the third variable first and the fourth variable, the second thread does the opposite. So the first thread writes to the first pseudo queue and reads from th e second, the second thread reads from the first one and writes to the second.&lt;/P&gt;

&lt;P&gt;code looks like this:&lt;/P&gt;

&lt;P&gt;first thread:&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;for (size_t i = 0; i &amp;lt; count; ++i)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; {&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; data0 = i;&lt;BR /&gt;
	// &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;barrier();&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; value1 = i;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; unsigned long long vtmp0;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; unsigned long long tmp0 = value2;&lt;BR /&gt;
	// &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;barrier();&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; vtmp0 = data1;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; while (tmp0 != i)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; {&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cpu_pause();&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; tmp0 = value2;&lt;BR /&gt;
	// &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;barrier();&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; vtmp0 = data1;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; }&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; v += tmp0 + vtmp0; // just calculating something&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; }&lt;/P&gt;

&lt;P&gt;the second thread:&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; for (size_t i = 0; i &amp;lt; count; ++i)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; {&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; unsigned long long vtmp0;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; unsigned long long tmp0 = value1;&lt;BR /&gt;
	// &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;barrier();&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; vtmp0 = data0;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; while (tmp0 != i)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; {&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cpu_pause();&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; tmp0 = value1;&lt;BR /&gt;
	// &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;barrier();&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; vtmp0 = data0;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; }&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; v += tmp0 + vtmp0; // just calculating something&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; data1 = i;&lt;BR /&gt;
	// &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;barrier();&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; value2 = i;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; }&lt;/P&gt;

&lt;P&gt;of course I need those barrier() to force the compiler to obey the order of reads and writes&lt;/P&gt;

&lt;P&gt;If barrier()s are not commented out, than the time is like 210&lt;/P&gt;

&lt;P&gt;if barrier()s are commented out, than the time is like 175&lt;/P&gt;

&lt;P&gt;I don't understand why this difference ever exists?&lt;/P&gt;

&lt;P&gt;Any ideas?&lt;/P&gt;</description>
      <pubDate>Tue, 02 Aug 2016 07:48:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/reading-two-cache-lines-issue/m-p/1079501#M7053</guid>
      <dc:creator>Aleksandr_A_1</dc:creator>
      <dc:date>2016-08-02T07:48:58Z</dc:date>
    </item>
    <item>
      <title>Run each (presumably</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/reading-two-cache-lines-issue/m-p/1079502#M7054</link>
      <description>&lt;P&gt;Run each (presumably optimized) configuration under VTune, then look at the disassembly code. This will tell you the instruction sequence difference.&lt;/P&gt;

&lt;P&gt;Note, it is possible that the/a while loop that has longer instruction latencies could take fewer iterations to exit (and thus fewer cpu_pause()'s)&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Tue, 02 Aug 2016 15:05:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/reading-two-cache-lines-issue/m-p/1079502#M7054</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-08-02T15:05:30Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...It's consistent to what</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/reading-two-cache-lines-issue/m-p/1079503#M7055</link>
      <description>&amp;gt;&amp;gt;...It's consistent to what is said in the Intel documentation, about &lt;STRONG&gt;50 cycles&lt;/STRONG&gt;...

Where did you get that number? Source, please.</description>
      <pubDate>Fri, 05 Aug 2016 21:14:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/reading-two-cache-lines-issue/m-p/1079503#M7055</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2016-08-05T21:14:58Z</dc:date>
    </item>
    <item>
      <title>Quote:Sergey Kostrov wrote:</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/reading-two-cache-lines-issue/m-p/1079504#M7056</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Sergey Kostrov wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&amp;gt;&amp;gt;...It's consistent to what is said in the Intel documentation, about &lt;STRONG&gt;50 cycles&lt;/STRONG&gt;...&lt;/P&gt;

&lt;P&gt;Where did you get that number? Source, please.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;64-ia-32-architectures-optimization-manual.pdf&lt;/P&gt;

&lt;P&gt;2.3.5.1 Load and store operations overview&lt;/P&gt;

&lt;P&gt;lookup order and lookup latency&lt;/P&gt;

&lt;P&gt;L2 and L1 DCache in another core 43 clean hit, 60 - dirty hit&lt;/P&gt;

&lt;DIV data-canvas-width="67.45929000000001" style="padding: 0px; margin: 0px; color: transparent; position: absolute; white-space: pre; cursor: text; transform-origin: 0% 0% 0px; left: 120.2px; top: 109.921px; font-size: 18.3px; font-family: sans-serif; transform: scaleX(1.03784);"&gt;2.3.5.1&lt;/DIV&gt;

&lt;DIV data-canvas-width="291.67821000000004" style="padding: 0px; margin: 0px; color: transparent; position: absolute; white-space: pre; cursor: text; transform-origin: 0% 0% 0px; left: 213.459px; top: 109.921px; font-size: 18.3px; font-family: sans-serif; transform: scaleX(1.0163);"&gt;Load and Store Operation Overview
	&lt;DIV data-canvas-width="67.45929000000001" style="padding: 0px; margin: 0px; color: transparent; position: absolute; white-space: pre; cursor: text; transform-origin: 0% 0% 0px; left: 120.2px; top: 109.921px; font-size: 18.3px; font-family: sans-serif; transform: scaleX(1.03784);"&gt;2.3.5.1&lt;/DIV&gt;

	&lt;DIV data-canvas-width="291.67821000000004" style="padding: 0px; margin: 0px; color: transparent; position: absolute; white-space: pre; cursor: text; transform-origin: 0% 0% 0px; left: 213.459px; top: 109.921px; font-size: 18.3px; font-family: sans-serif; transform: scaleX(1.0163);"&gt;Load and Store Operation Overvi&lt;/DIV&gt;
&lt;/DIV&gt;

&lt;DIV data-canvas-width="67.45929000000001" style="padding: 0px; margin: 0px; color: transparent; position: absolute; white-space: pre; cursor: text; transform-origin: 0% 0% 0px; left: 120.2px; top: 109.921px; font-size: 18.3px; font-family: sans-serif; transform: scaleX(1.03784);"&gt;2.3.5.1&lt;/DIV&gt;

&lt;DIV data-canvas-width="291.67821000000004" style="padding: 0px; margin: 0px; color: transparent; position: absolute; white-space: pre; cursor: text; transform-origin: 0% 0% 0px; left: 213.459px; top: 109.921px; font-size: 18.3px; font-family: sans-serif; transform: scaleX(1.0163);"&gt;Load and Store Operation Overview&lt;/DIV&gt;

&lt;DIV data-canvas-width="67.45929000000001" style="padding: 0px; margin: 0px; color: transparent; position: absolute; white-space: pre; cursor: text; transform-origin: 0% 0% 0px; left: 120.2px; top: 109.921px; font-size: 18.3px; font-family: sans-serif; transform: scaleX(1.03784);"&gt;2.3.5.1&lt;/DIV&gt;

&lt;DIV data-canvas-width="291.67821000000004" style="padding: 0px; margin: 0px; color: transparent; position: absolute; white-space: pre; cursor: text; transform-origin: 0% 0% 0px; left: 213.459px; top: 109.921px; font-size: 18.3px; font-family: sans-serif; transform: scaleX(1.0163);"&gt;Load and Store Operation Overview&lt;/DIV&gt;</description>
      <pubDate>Mon, 08 Aug 2016 08:04:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/reading-two-cache-lines-issue/m-p/1079504#M7056</guid>
      <dc:creator>Aleksandr_A_1</dc:creator>
      <dc:date>2016-08-08T08:04:56Z</dc:date>
    </item>
    <item>
      <title>Quote:jimdempseyatthecove</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/reading-two-cache-lines-issue/m-p/1079505#M7057</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Run each (presumably optimized) configuration under VTune, then look at the disassembly code. This will tell you the instruction sequence difference.&lt;/P&gt;

&lt;P&gt;Note, it is possible that the/a while loop that has longer instruction latencies could take fewer iterations to exit (and thus fewer cpu_pause()'s)&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;let's assume that instructions are reordered. What is happening is that one thread changes tцo cache lines and another thread reads them. Of course reader impedes writer with reads, but there is a pause instruction there, which is around 50 cycle, so writer should be able to write before ready interrupts writer again&lt;/P&gt;</description>
      <pubDate>Mon, 08 Aug 2016 08:08:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/reading-two-cache-lines-issue/m-p/1079505#M7057</guid>
      <dc:creator>Aleksandr_A_1</dc:creator>
      <dc:date>2016-08-08T08:08:18Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;let's assume that</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/reading-two-cache-lines-issue/m-p/1079506#M7058</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;let's assume that instructions are reordered.&lt;/P&gt;

&lt;P&gt;It is the programmers responsibility to work with the compiler to assure that the instruction set sequence is that required. (and to verify this with inspection)&lt;/P&gt;

&lt;P&gt;This said, the IA32 and Intel64 (not necessarily IA64/Itanium) are strong ordered systems that preserve write ordering cache coherency.&lt;/P&gt;

&lt;P&gt;Can you post your test program for others to examine?&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 08 Aug 2016 12:38:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/reading-two-cache-lines-issue/m-p/1079506#M7058</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-08-08T12:38:42Z</dc:date>
    </item>
  </channel>
</rss>

