<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic bug in Haswell-E Offcore Response counters? in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/bug-in-Haswell-E-Offcore-Response-counters/m-p/1180097#M7413</link>
    <description>&lt;P&gt;On a Haswell-E processor (Xeon&amp;nbsp;E7-4830 v3, family_signature=06_3f), it seems like the offcore response counters work only for a response type of &amp;nbsp;ANY. &amp;nbsp;Otherwise, they return 0. &amp;nbsp;Below are the details.&lt;/P&gt;

&lt;P&gt;I'm testing a cache ping-pong program with two threads on two sockets. &amp;nbsp;If I set requests to DMND_DATA_RD (bit 0) and response to ANY (bit 16), I get expected results:&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;  % perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x10001/ taskset -c 0,12 ./a.out 

 Performance counter stats for 'taskset -c 0,12 ./a.out':

        20,063,594      cpu/event=0xb7,umask=0x1,offcore_rsp=0x10001/
&lt;/PRE&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;But for any other settings of the response, I get zero. &amp;nbsp;For example, with L3_HITM:&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;    % perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x40001/ taskset -c 0,12 ./a.out

 Performance counter stats for 'taskset -c 0,12 ./a.out':

                 0      cpu/event=0xb7,umask=0x1,offcore_rsp=0x40001/
&lt;/PRE&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Is this known behavior? &amp;nbsp;Am I doing something wrong? &amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;For reference, this is the tested ping-pong program:&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include &amp;lt;pthread.h&amp;gt;

volatile unsigned int x;

void* run_t1(void* r)
{
    int i;
    for (i = 0; i &amp;lt; 10000000; i++) {
       while (x != i) continue;
       x = ~0;
    }
    return NULL;
}

void* run_t2(void* r)
{
    int i;
    for (i = 0; i &amp;lt; 10000000; i++) {
        while (x != ~0) continue;
        x = i + 1;
    }
    return NULL;
}

int main (int argc, char** argv)
{
    pthread_t threads[2];
    void* status;
    int i;

    pthread_create(&amp;amp;threads[0], NULL, run_t1, NULL);
    pthread_create(&amp;amp;threads[1], NULL, run_t2, NULL);

    for (i = 0; i &amp;lt; 2; i++)
        pthread_join(threads&lt;I&gt;, &amp;amp;status);

    return 0;
}
&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Thanks!&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 08 Sep 2017 14:54:54 GMT</pubDate>
    <dc:creator>TPtac</dc:creator>
    <dc:date>2017-09-08T14:54:54Z</dc:date>
    <item>
      <title>bug in Haswell-E Offcore Response counters?</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/bug-in-Haswell-E-Offcore-Response-counters/m-p/1180097#M7413</link>
      <description>&lt;P&gt;On a Haswell-E processor (Xeon&amp;nbsp;E7-4830 v3, family_signature=06_3f), it seems like the offcore response counters work only for a response type of &amp;nbsp;ANY. &amp;nbsp;Otherwise, they return 0. &amp;nbsp;Below are the details.&lt;/P&gt;

&lt;P&gt;I'm testing a cache ping-pong program with two threads on two sockets. &amp;nbsp;If I set requests to DMND_DATA_RD (bit 0) and response to ANY (bit 16), I get expected results:&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;  % perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x10001/ taskset -c 0,12 ./a.out 

 Performance counter stats for 'taskset -c 0,12 ./a.out':

        20,063,594      cpu/event=0xb7,umask=0x1,offcore_rsp=0x10001/
&lt;/PRE&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;But for any other settings of the response, I get zero. &amp;nbsp;For example, with L3_HITM:&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;    % perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x40001/ taskset -c 0,12 ./a.out

 Performance counter stats for 'taskset -c 0,12 ./a.out':

                 0      cpu/event=0xb7,umask=0x1,offcore_rsp=0x40001/
&lt;/PRE&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Is this known behavior? &amp;nbsp;Am I doing something wrong? &amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;For reference, this is the tested ping-pong program:&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include &amp;lt;pthread.h&amp;gt;

volatile unsigned int x;

void* run_t1(void* r)
{
    int i;
    for (i = 0; i &amp;lt; 10000000; i++) {
       while (x != i) continue;
       x = ~0;
    }
    return NULL;
}

void* run_t2(void* r)
{
    int i;
    for (i = 0; i &amp;lt; 10000000; i++) {
        while (x != ~0) continue;
        x = i + 1;
    }
    return NULL;
}

int main (int argc, char** argv)
{
    pthread_t threads[2];
    void* status;
    int i;

    pthread_create(&amp;amp;threads[0], NULL, run_t1, NULL);
    pthread_create(&amp;amp;threads[1], NULL, run_t2, NULL);

    for (i = 0; i &amp;lt; 2; i++)
        pthread_join(threads&lt;I&gt;, &amp;amp;status);

    return 0;
}
&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Thanks!&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 08 Sep 2017 14:54:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/bug-in-Haswell-E-Offcore-Response-counters/m-p/1180097#M7413</guid>
      <dc:creator>TPtac</dc:creator>
      <dc:date>2017-09-08T14:54:54Z</dc:date>
    </item>
    <item>
      <title>I don't have any Xeon E7 v3</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/bug-in-Haswell-E-Offcore-Response-counters/m-p/1180098#M7414</link>
      <description>&lt;P&gt;I don't have any Xeon E7 v3 systems, but I have run into cases on Xeon E5 v3 where the transactions were not what I expected. The two examples that come to mind immediately are:&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;Several cross-chip interactions use different transaction types in different snooping modes.&amp;nbsp;&lt;/LI&gt;
	&lt;LI&gt;Hardware prefetches can occur in cases where they are not expected.&lt;/LI&gt;
&lt;/OL&gt;

&lt;P&gt;The programming of these events can be more subtle than a first reading (or second reading, or third reading) of the documentation might suggest.&amp;nbsp; There are examples of how the offcore response counters can be used at &lt;A href="https://download.01.org/perfmon/HSX/haswellx_offcore_v19.tsv.&amp;nbsp;" target="_blank"&gt;https://download.01.org/perfmon/HSX/haswellx_offcore_v19.tsv.&amp;nbsp;&lt;/A&gt;; The "MSRValue" fields here set a lot more bits than you appear to be setting -- for example OFFCORE_RESPONSE.DEMAND_DATA_RD.LLC_MISS.ANY_RESPONSE shows an MSRValue of 0x3fbfc00001.&amp;nbsp; This MSR value includes&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;Setting all of bits 37:31, which are the "Snoop Response" bits described in Table 18-38 (referenced in Section 18.11.4 of Volume 3 of the SWDM (document 325384-062).&lt;/LI&gt;
	&lt;LI&gt;Setting 11 of the 15 bits in the "Supplier" field (bits 30:16), described in Table 18-50 (Section 18.11.4.1)&lt;/LI&gt;
	&lt;LI&gt;Setting only bit 0 of the "Request Type" field (bits 15:0), described in Table 18-47.&amp;nbsp; This matches your configuration.&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;Of course I have also seen plenty of bugs in these counters as well...&lt;/P&gt;</description>
      <pubDate>Mon, 11 Sep 2017 15:17:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/bug-in-Haswell-E-Offcore-Response-counters/m-p/1180098#M7414</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2017-09-11T15:17:30Z</dc:date>
    </item>
    <item>
      <title>Thanks for the information,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/bug-in-Haswell-E-Offcore-Response-counters/m-p/1180099#M7415</link>
      <description>&lt;P&gt;Thanks for the information, John! &amp;nbsp;It helped me make some progress. &amp;nbsp;Looks like my mistake was not setting a snoop response information bit when I set a non-ANY supplier information bit. &amp;nbsp;But the results still don't make sense; it looks like the individual suppliers don't add up to the values with ANY:&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80010001/ taskset -c 0,12 ./a.out
        20,044,681      cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80010001/
% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80020001/ taskset -c 0,12 ./a.out
                 0      cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80020001/
% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80040001/ taskset -c 0,12 ./a.out
             7,212      cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80040001/
% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80080001/ taskset -c 0,12 ./a.out
               684      cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80080001/
% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80100001/ taskset -c 0,12 ./a.out
            14,659      cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80100001/
% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80200001/ taskset -c 0,12 ./a.out
             1,456      cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80200001/
% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80400001/ taskset -c 0,12 ./a.out
               736      cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80400001/

&lt;/PRE&gt;

&lt;P&gt;Interestingly, the&amp;nbsp;&lt;SPAN style="font-size: 1em;"&gt;value from the haswellx_offcore_v19.tsv isn't supported by my processor:&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3fbfc00001/ taskset -c 0,12 ./a.out 
 Performance counter stats for 'taskset -c 0,12 ./a.out':

   &amp;lt;not supported&amp;gt;      cpu/event=0xb7,umask=0x1,offcore_rsp=0x3fbfc00001/
&lt;/PRE&gt;

&lt;P&gt;But the processor is definitely a Haswell-E:&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E7-4830 v3 @ 2.10GHz&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 12 Sep 2017 08:44:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/bug-in-Haswell-E-Offcore-Response-counters/m-p/1180099#M7415</guid>
      <dc:creator>TPtac</dc:creator>
      <dc:date>2017-09-12T08:44:53Z</dc:date>
    </item>
    <item>
      <title>The "not supported" message</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/bug-in-Haswell-E-Offcore-Response-counters/m-p/1180100#M7416</link>
      <description>&lt;P&gt;The "not supported" message is likely a software limitation -- I have never seen the HW preventing one from setting any bit fields in the performance counters before.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Some MSRs do have protected bit fields -- you might try writing the "&amp;lt;not supported&amp;gt;" bit pattern to MSR 0x1A6 using the "wrmsr.c" program from msrtools-1.3&amp;nbsp; to see if the hardware is preventing writing to some of the bits.&amp;nbsp; (Table 18-50 says that bits 26:23 are reserved, but they are set in the bit field above.)&amp;nbsp;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;On my Xeon E5 v3 systems, there is no problem writing the value &lt;CODE class="plain"&gt;0x3fbfc00001&lt;/CODE&gt; to MSR 0x1A6, so it is probably overzealous software noticing that you are writing to what are documented to be reserved bits.&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Yet another reason why I write (almost) all my own performance monitoring code....&lt;/P&gt;

&lt;P&gt;For this example, I manually set up PMC0 to 0x004301b7 and set MSR 0x1a6 to 0x3fbfc00001 (both on core 0).&amp;nbsp; Then I disabled the HW prefetchers and ran the STREAM benchmark pinned to core 0.&amp;nbsp;&amp;nbsp; For the STREAM parameters (N=80M, NTIMES=100), I expected about 384 billion cache line reads, and this counter incremented by 386.6 billion during the run.&amp;nbsp; So it looks like the "&amp;lt;not supported&amp;gt;" bit pattern does count demand LLC misses reasonably accurately for at least one test case.&amp;nbsp;&amp;nbsp; Re-enabling the HW prefetchers reduced the count to 0.97 billion, indicating that the HW prefetchers are able to keep ahead of the demand loads for the single-threaded test case.&amp;nbsp; This is not surprising, since the sustained BW for the STREAM kernels was between 19 GB/s and 20 GB/s -- less than 30% of the peak BW of the four DDR4/2133 DRAM channels on socket 0.&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 12 Sep 2017 13:35:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/bug-in-Haswell-E-Offcore-Response-counters/m-p/1180100#M7416</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2017-09-12T13:35:55Z</dc:date>
    </item>
  </channel>
</rss>

