<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Glad to hear it! in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-Bridge-i5-2500k-uncore-ARB-events/m-p/1103411#M5913</link>
    <description>&lt;P&gt;Glad to hear it!&lt;/P&gt;</description>
    <pubDate>Mon, 21 Mar 2016 16:27:30 GMT</pubDate>
    <dc:creator>A_T_Intel</dc:creator>
    <dc:date>2016-03-21T16:27:30Z</dc:date>
    <item>
      <title>Sandy Bridge i5-2500k uncore ARB events</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-Bridge-i5-2500k-uncore-ARB-events/m-p/1103408#M5910</link>
      <description>&lt;P&gt;Hi All,&lt;/P&gt;

&lt;P&gt;I read Intel SDM and found out two uncore events:&amp;nbsp;UNC_ARB_TRK_REQUEST.ALL and&amp;nbsp;UNC_ARB_TRK_OCCUPANCY.ALL. From my understanding, ARB occupancy refers to memory requests waiting to be serviced by iMC. So average memory request latency can be calculated as &amp;nbsp;&lt;SPAN style="font-size: 19.512px; line-height: 19.512px;"&gt;UNC_ARB_TRK_OCCUPANCY.ALL /&amp;nbsp;UNC_ARB_TRK_REQUEST.ALL. Is that correct?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;I tried to monitor the memory request latency by setting up MSRs in the following way:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;/* enable uncore monitor */
wrmsr(MSR_UNC_PERF_GLOBAL_CTRL, 0x20000000);

int i;

/* need to lock bus to initialize occupancy */

asm volatile ("xchgl %0, %%eax"::"m" (i):"eax");

wrmsr(MSR_UNC_ARB_PERFEVTSEL0, 0x80 | (0x01 &amp;lt;&amp;lt; 8) | (0x01 &amp;lt;&amp;lt; 22));

wrmsr(MSR_UNC_ARB_PERFEVTSEL1, 0x81 | (0x01 &amp;lt;&amp;lt; 8) | (0x01 &amp;lt;&amp;lt; 22));&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Periodically I checked the latency value by doing:&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;arb_occupancy = rdmsr(MSR_UNC_ARB_PER_CTR0);

arb_request = rdmsr(MSR_UNC_ARB_PER_CTR1);

latency = arb_occupancy / arb_request;&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I ran two benchmarks A and B. A is sequential memory access (int array[...], size much larger than L3 cache, 32 bit program):&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;while (1) {
  for (i = 0; i &amp;lt; SIZE; i ++) {
    tmp += i;
    array&lt;I&gt; += i;
    tmp = (tmp + 1) % 1030;
  }
}&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;B is pseudo random memory access:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;while (1) {
  for (k = 0; k &amp;lt; 1030; k++) {
    j = 0;
    for (i = k; i &amp;lt; SIZE; i += 1030) {
      index = i + j;
      array[index] += index;
      j = (j + 1) % 1030;
    }
  }
}&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I ran A or B in the idle system separately and monitored their latencies. I would expect sequential one has smaller latency than random one. But they were the same.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;So did I misunderstand those ARB events? Or set those MSRs in the wrong way? Or issue with my benchmarks? Any thought would be much appreciated! Thanks!&lt;/P&gt;</description>
      <pubDate>Tue, 08 Mar 2016 16:58:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-Bridge-i5-2500k-uncore-ARB-events/m-p/1103408#M5910</guid>
      <dc:creator>Ying_Y_</dc:creator>
      <dc:date>2016-03-08T16:58:46Z</dc:date>
    </item>
    <item>
      <title>Hi Ying Y.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-Bridge-i5-2500k-uncore-ARB-events/m-p/1103409#M5911</link>
      <description>&lt;P&gt;Hi Ying Y.&lt;/P&gt;

&lt;P&gt;The formula you have for the average memory request latency looks correct to me (UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL).&amp;nbsp; This will give you the latency in terms of core clocks.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Are you getting reasonable values for UNC_ARB_TRK_REQUESTS.ALL for both cases?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;If so,&amp;nbsp;my first thought is that the hardware prefetchers are adequately prefetching both cases. I would first try to disable the hardware prefetchers.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;If not, then I'd look at the assembly and make sure the compiler is doing what you expect with your code.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 16 Mar 2016 00:27:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-Bridge-i5-2500k-uncore-ARB-events/m-p/1103409#M5911</guid>
      <dc:creator>A_T_Intel</dc:creator>
      <dc:date>2016-03-16T00:27:26Z</dc:date>
    </item>
    <item>
      <title>Quote:Perry Taylor (Intel)</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-Bridge-i5-2500k-uncore-ARB-events/m-p/1103410#M5912</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Perry Taylor (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hi Ying Y.&lt;/P&gt;

&lt;P&gt;The formula you have for the average memory request latency looks correct to me (UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL).&amp;nbsp; This will give you the latency in terms of core clocks.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Are you getting reasonable values for UNC_ARB_TRK_REQUESTS.ALL for both cases?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;If so,&amp;nbsp;my first thought is that the hardware prefetchers are adequately prefetching both cases. I would first try to disable the hardware prefetchers.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;If not, then I'd look at the assembly and make sure the compiler is doing what you expect with your code.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 16.26px; line-height: 24.39px;"&gt;Thanks Perry! I think I figured out! The problem is, for the pseudo random access, adjacent accesses go to different memory banks, leading to better bank level parallelism.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 18 Mar 2016 04:05:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-Bridge-i5-2500k-uncore-ARB-events/m-p/1103410#M5912</guid>
      <dc:creator>Ying_Y_</dc:creator>
      <dc:date>2016-03-18T04:05:07Z</dc:date>
    </item>
    <item>
      <title>Glad to hear it!</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-Bridge-i5-2500k-uncore-ARB-events/m-p/1103411#M5913</link>
      <description>&lt;P&gt;Glad to hear it!&lt;/P&gt;</description>
      <pubDate>Mon, 21 Mar 2016 16:27:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Sandy-Bridge-i5-2500k-uncore-ARB-events/m-p/1103411#M5913</guid>
      <dc:creator>A_T_Intel</dc:creator>
      <dc:date>2016-03-21T16:27:30Z</dc:date>
    </item>
  </channel>
</rss>

