<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic seemingly bogus mem-store hardware performance counter data in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/seemingly-bogus-mem-store-hardware-performance-counter-data/m-p/1168968#M7213</link>
    <description>I'm experimenting with perf on a 4 socket Ivy Bridge system running Centos7 and having a hard time understanding some of the output.

A typical case is the following really simple C program

main() {
    while (1) {
    }
}

The following is the assembly code


       main():
       push   %rbp
       mov    %rsp,%rbp
       jmp    4

I name the executable sf and run perf for 10 seconds with the following performance counters

perf stat -e cpu-clock,major-faults,minor-faults,context-switches,cpu-migrations,instructions,cycles,stalled-cycles-frontend,mem-stores sf

and get the following results

       9064.316086      cpu-clock (msec)            #    1.000 CPUs utilized
                          0      major-faults                     #    0.000 K/sec
                      109      minor-faults                     #    0.012 K/sec
                          5      context-switches             #    0.001 K/sec
                          1      cpu-migrations                #    0.000 K/sec
    25,983,671,864      instructions                     #    1.00  insn per cycle
                                                                          #    0.50  stalled cycles per insn
    26,018,167,454      cycles                             #    2.870 GHz
    13,024,243,875      stalled-cycles-frontend   #    50.06% frontend cycles idle
             6,171,784      mem-stores                    #    0.681 M/sec

       9.065339662 seconds time elapsed


How can there be so many mem-stores, which increase the longer I run?  How can the IPC only be 1.00 and why are we stalling in the frontend when all its doing is some register activity and maybe in the worst case going to the L1?

I would expect the IPC to be at least 2 and closer to 3 and I wouldn't expect any memory stores. By the way, I had the whole machine to myself when I ran this.

My conclusion is that perf can't be trusted. Is this correct and if so are there alternatives to perf for getting correct hardware performance data?</description>
    <pubDate>Thu, 26 Sep 2019 04:57:33 GMT</pubDate>
    <dc:creator>gostanian__richard</dc:creator>
    <dc:date>2019-09-26T04:57:33Z</dc:date>
    <item>
      <title>seemingly bogus mem-store hardware performance counter data</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/seemingly-bogus-mem-store-hardware-performance-counter-data/m-p/1168968#M7213</link>
      <description>I'm experimenting with perf on a 4 socket Ivy Bridge system running Centos7 and having a hard time understanding some of the output.

A typical case is the following really simple C program

main() {
    while (1) {
    }
}

The following is the assembly code


       main():
       push   %rbp
       mov    %rsp,%rbp
       jmp    4

I name the executable sf and run perf for 10 seconds with the following performance counters

perf stat -e cpu-clock,major-faults,minor-faults,context-switches,cpu-migrations,instructions,cycles,stalled-cycles-frontend,mem-stores sf

and get the following results

       9064.316086      cpu-clock (msec)            #    1.000 CPUs utilized
                          0      major-faults                     #    0.000 K/sec
                      109      minor-faults                     #    0.012 K/sec
                          5      context-switches             #    0.001 K/sec
                          1      cpu-migrations                #    0.000 K/sec
    25,983,671,864      instructions                     #    1.00  insn per cycle
                                                                          #    0.50  stalled cycles per insn
    26,018,167,454      cycles                             #    2.870 GHz
    13,024,243,875      stalled-cycles-frontend   #    50.06% frontend cycles idle
             6,171,784      mem-stores                    #    0.681 M/sec

       9.065339662 seconds time elapsed


How can there be so many mem-stores, which increase the longer I run?  How can the IPC only be 1.00 and why are we stalling in the frontend when all its doing is some register activity and maybe in the worst case going to the L1?

I would expect the IPC to be at least 2 and closer to 3 and I wouldn't expect any memory stores. By the way, I had the whole machine to myself when I ran this.

My conclusion is that perf can't be trusted. Is this correct and if so are there alternatives to perf for getting correct hardware performance data?</description>
      <pubDate>Thu, 26 Sep 2019 04:57:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/seemingly-bogus-mem-store-hardware-performance-counter-data/m-p/1168968#M7213</guid>
      <dc:creator>gostanian__richard</dc:creator>
      <dc:date>2019-09-26T04:57:33Z</dc:date>
    </item>
    <item>
      <title>According to https://www</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/seemingly-bogus-mem-store-hardware-performance-counter-data/m-p/1168969#M7214</link>
      <description>&lt;P&gt;According to &lt;A href="https://www.agner.org/optimize/instruction_tables.pdf" target="_blank"&gt;https://www.agner.org/optimize/instruction_tables.pdf&lt;/A&gt; (page 193), the "push" instruction on an Ivy Bridge processor has a latency of three cycles.&amp;nbsp; Page 195 reports that the "jmp" instruction has a maximum throughput of one instruction every 2 cycles on Ivy Bridge.&amp;nbsp; From those values an IPC of 1.00 seems reasonable enough.&lt;/P&gt;&lt;P&gt;For test cases like this -- single thread, 10-second runtime -- "perf stat" usually manages to report reasonable counter values.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Estimating the "expected" IPC can be tricky, especially when there are multiple uops in an instruction or multiple instructions that fuse into a single uop.&amp;nbsp;&amp;nbsp; Agner's Instruction Tables provide good information about these cases.&amp;nbsp; Agner's Microarchitecture document is very helpful for understanding the historical progression of the implementations &lt;A href="https://www.agner.org/optimize/microarchitecture.pdf" target="_blank"&gt;https://www.agner.org/optimize/microarchitecture.pdf&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 01 Oct 2019 23:12:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/seemingly-bogus-mem-store-hardware-performance-counter-data/m-p/1168969#M7214</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2019-10-01T23:12:48Z</dc:date>
    </item>
  </channel>
</rss>

