<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Thanks John, for your in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Significant-increase-in-cache-to-cache-data-transfer/m-p/1139782#M6617</link>
    <description>&lt;P&gt;Thanks John, for your response. When I run just basic command ./mlc --c2c_latency. Then I get 13.5ns but I have doubt on this number because as I add -r option means (./mlc -r --c2c_latency) then latency again becomes 66ns. So it is still confusing how there is this much difference.&lt;/P&gt;

&lt;P&gt;Just to let you know about the problem I working on. We have a set of heavy calculations. So I am trying to create a 3 stage pipeline on cpu by dividing calculation in multiple stages. In this pipeline, only 128 bytes are send to next cpu for next set of calculation. So its like single producer/consumer.&lt;/P&gt;

&lt;P&gt;When I measured time, on older cpu it was data transfer was taking 16% of overall calculation but on new cpu this time has become 25% which is causing all the problem. Any Ideas?&lt;/P&gt;</description>
    <pubDate>Tue, 17 Oct 2017 13:01:49 GMT</pubDate>
    <dc:creator>Vinay_Y_</dc:creator>
    <dc:date>2017-10-17T13:01:49Z</dc:date>
    <item>
      <title>Significant increase in cache to cache data transfer</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Significant-increase-in-cache-to-cache-data-transfer/m-p/1139780#M6615</link>
      <description>&lt;DIV class="field field-name-comment-body field-type-text-long field-label-hidden"&gt;
	&lt;DIV class="field-items"&gt;
		&lt;DIV class="field-item even"&gt;
			&lt;P&gt;Hi All,&lt;/P&gt;

			&lt;P&gt;I am working on low latency software where I need to transfer data between cores very fast. I was exploring these two machine with intel mlc too.&lt;/P&gt;

			&lt;P&gt;I ran exactly this command on both machine&lt;BR /&gt;
				&lt;STRONG&gt;sudo ./mlc -e -r -c3 -i2 -l128 --c2c_latency&lt;/STRONG&gt;&lt;/P&gt;

			&lt;P&gt;and following are the results for different CPUs&lt;/P&gt;

			&lt;P&gt;&lt;STRONG&gt;[CPU 1]&lt;/STRONG&gt;&lt;BR /&gt;
				&amp;nbsp;Intel(R) Xeon(R) Gold 6144 CPU @ 3.50GHz&lt;BR /&gt;
				&amp;nbsp;No of numa node = 1&lt;BR /&gt;
				&amp;nbsp;&lt;STRONG&gt;uname -a&lt;/STRONG&gt;&lt;BR /&gt;
				&amp;nbsp;&amp;nbsp; &lt;EM&gt;Linux cresco31 4.4.87-18.29-default #1 SMP Wed Sep 13 07:07:43 UTC 2017 (3e35b20) x86_64 x86_64 x86_64 GNU/Linu&lt;/EM&gt;x&lt;BR /&gt;
				&amp;nbsp;&lt;STRONG&gt;sudo ./mlc -e -r -c3 -i2 -l128 --c2c_latency&lt;/STRONG&gt;&lt;BR /&gt;
				&amp;nbsp;&amp;nbsp; &lt;EM&gt;Latency = 231.1 core clocks (66.0 ns&lt;/EM&gt;)&lt;/P&gt;

			&lt;P&gt;&lt;STRONG&gt;[CPU 2]&lt;/STRONG&gt;&lt;BR /&gt;
				&amp;nbsp;Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz&lt;BR /&gt;
				&amp;nbsp;No of numa node = 1&lt;BR /&gt;
				&amp;nbsp;&lt;STRONG&gt;uname -a&lt;/STRONG&gt;&lt;BR /&gt;
				&amp;nbsp;&amp;nbsp; &lt;EM&gt;Linux cresco29 4.4.73-18.17-default #1 SMP Fri Jun 23 20:25:06 UTC 2017 (f462a66) x86_64 x86_64 x86_64 GNU/Linux&lt;/EM&gt;&lt;BR /&gt;
				&amp;nbsp;&lt;STRONG&gt;sudo ./mlc -e -r -c3 -i2 -l128 --c2c_latency&lt;/STRONG&gt;&lt;BR /&gt;
				&amp;nbsp;&amp;nbsp; &lt;EM&gt;Latency = 157.6 core clocks (45.0 ns)&lt;/EM&gt;&lt;/P&gt;

			&lt;P&gt;We can see that for newer CPU cache to cache latency has significantly increased. Does this means that new CPUs are slower in this regard?&lt;/P&gt;

			&lt;P&gt;Thanks&lt;/P&gt;

			&lt;P&gt;Vinay&lt;/P&gt;
		&lt;/DIV&gt;
	&lt;/DIV&gt;
&lt;/DIV&gt;</description>
      <pubDate>Thu, 12 Oct 2017 11:26:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Significant-increase-in-cache-to-cache-data-transfer/m-p/1139780#M6615</guid>
      <dc:creator>Vinay_Y_</dc:creator>
      <dc:date>2017-10-12T11:26:39Z</dc:date>
    </item>
    <item>
      <title>As core counts increase,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Significant-increase-in-cache-to-cache-data-transfer/m-p/1139781#M6616</link>
      <description>&lt;P&gt;As core counts increase, designers often have to make changes that allow increased throughput but at the cost of increased latency.&amp;nbsp; The Xeon E5-2643 v2 is a 6-core part built on a single ring, while the Intel Gold 6144 is built on a two-dimensional mesh, so it is not surprising that there is an increase in cache-to-cache transfer latency.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;The specific numbers you show are a bit odd, and when I try this test I also get numbers that don't make any sense -- they are slow and they don't change when I change the values for the "-i" and "-c" options.&amp;nbsp; (I am testing on a two-socket Xeon Platinum 8160 node -- 24 cores, 2.1 GHz nominal, 3.7 GHz max Turbo.)&amp;nbsp;&amp;nbsp; There may be something funny with the core bindings on the Xeon Scalable processors.&lt;/P&gt;

&lt;P&gt;The default version of the command "sudo ./mlc --c2c_latency" gives more reasonable results:&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;# ./mlc&amp;nbsp; --c2c_latency&lt;BR /&gt;
		Intel(R) Memory Latency Checker - v3.4&lt;BR /&gt;
		Command line parameters: --c2c_latency&lt;/P&gt;

	&lt;P&gt;Measuring cache-to-cache transfer latency (in ns)...&lt;BR /&gt;
		Local Socket L2-&amp;gt;L2 HIT&amp;nbsp; latency&amp;nbsp;&amp;nbsp; &amp;nbsp;48.3&lt;BR /&gt;
		Local Socket L2-&amp;gt;L2 HITM latency&amp;nbsp;&amp;nbsp; &amp;nbsp;48.3&lt;BR /&gt;
		Remote Socket L2-&amp;gt;L2 HITM latency (data address homed in writer socket)&lt;BR /&gt;
		&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Reader Numa Node&lt;BR /&gt;
		Writer Numa Node&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
		&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; -&amp;nbsp;&amp;nbsp; &amp;nbsp; 112.2&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
		&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; &amp;nbsp; 113.1&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; -&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
		Remote Socket L2-&amp;gt;L2 HITM latency (data address homed in reader socket)&lt;BR /&gt;
		&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Reader Numa Node&lt;BR /&gt;
		Writer Numa Node&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
		&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; -&amp;nbsp;&amp;nbsp; &amp;nbsp; 177.9&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
		&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp; &amp;nbsp; 181.2&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; -&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;/P&gt;

	&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;</description>
      <pubDate>Mon, 16 Oct 2017 15:18:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Significant-increase-in-cache-to-cache-data-transfer/m-p/1139781#M6616</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2017-10-16T15:18:20Z</dc:date>
    </item>
    <item>
      <title>Thanks John, for your</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Significant-increase-in-cache-to-cache-data-transfer/m-p/1139782#M6617</link>
      <description>&lt;P&gt;Thanks John, for your response. When I run just basic command ./mlc --c2c_latency. Then I get 13.5ns but I have doubt on this number because as I add -r option means (./mlc -r --c2c_latency) then latency again becomes 66ns. So it is still confusing how there is this much difference.&lt;/P&gt;

&lt;P&gt;Just to let you know about the problem I working on. We have a set of heavy calculations. So I am trying to create a 3 stage pipeline on cpu by dividing calculation in multiple stages. In this pipeline, only 128 bytes are send to next cpu for next set of calculation. So its like single producer/consumer.&lt;/P&gt;

&lt;P&gt;When I measured time, on older cpu it was data transfer was taking 16% of overall calculation but on new cpu this time has become 25% which is causing all the problem. Any Ideas?&lt;/P&gt;</description>
      <pubDate>Tue, 17 Oct 2017 13:01:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Significant-increase-in-cache-to-cache-data-transfer/m-p/1139782#M6617</guid>
      <dc:creator>Vinay_Y_</dc:creator>
      <dc:date>2017-10-17T13:01:49Z</dc:date>
    </item>
  </channel>
</rss>

