<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic In the case of frequent rdpmc in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955587#M2287</link>
    <description>&lt;P&gt;In the case of frequent rdpmc calls the usage of cpuid or the like might probably render the measurement invalid.&lt;/P&gt;

&lt;P&gt;Suppose e.g. some data is being prefetched. Now due to the delay induced by cpuid the prefetch is done and thus you won't observe a miss even if it's there in non-measured code.&lt;/P&gt;</description>
    <pubDate>Fri, 21 Mar 2014 14:39:15 GMT</pubDate>
    <dc:creator>Olaf_Krzikalla</dc:creator>
    <dc:date>2014-03-21T14:39:15Z</dc:date>
    <item>
      <title>Timely interaction of performance counters</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955585#M2285</link>
      <description>&lt;P&gt;Hi @all,&lt;/P&gt;

&lt;P&gt;ist there an in-depth explanation of the timely interaction of performance counters (esp. cache miss counters) with the rest of the code? Maybe a specific section in App.B of the Optimization Reference Manual I have missed so far?&lt;/P&gt;

&lt;P&gt;An example:&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;
(pmc configured for counting L1D cache misses)
rdpmc
(store eax)
mov xmm0, [esi]  // read from [esi]
mov xmm1, [edi]  // read from [edi]
rdpmc

&lt;/PRE&gt;

&lt;P&gt;Now assume, that esi and edi both point to the same location, which initially is not in L1. Then, which difference of the L1 pmc will be observable?&lt;BR /&gt;
	&amp;nbsp;And why? IMHO there are a lot of things (pipelining, out-of-order-execution, stalling), which can influence the result. Is this documented?&lt;/P&gt;

&lt;P&gt;Thanks for your help&lt;BR /&gt;
	Olaf&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 21 Mar 2014 13:59:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955585#M2285</guid>
      <dc:creator>Olaf_Krzikalla</dc:creator>
      <dc:date>2014-03-21T13:59:47Z</dc:date>
    </item>
    <item>
      <title>Should not you use</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955586#M2286</link>
      <description>&lt;P&gt;Should not you use serialization instruction like&amp;nbsp;cpuid before using rdpmc?&lt;/P&gt;</description>
      <pubDate>Fri, 21 Mar 2014 14:31:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955586#M2286</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-03-21T14:31:00Z</dc:date>
    </item>
    <item>
      <title>In the case of frequent rdpmc</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955587#M2287</link>
      <description>&lt;P&gt;In the case of frequent rdpmc calls the usage of cpuid or the like might probably render the measurement invalid.&lt;/P&gt;

&lt;P&gt;Suppose e.g. some data is being prefetched. Now due to the delay induced by cpuid the prefetch is done and thus you won't observe a miss even if it's there in non-measured code.&lt;/P&gt;</description>
      <pubDate>Fri, 21 Mar 2014 14:39:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955587#M2287</guid>
      <dc:creator>Olaf_Krzikalla</dc:creator>
      <dc:date>2014-03-21T14:39:15Z</dc:date>
    </item>
    <item>
      <title>By looking at posted assembly</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955588#M2288</link>
      <description>&lt;P&gt;By looking at posted assembly code snippet it seems that pointers are not incremented and reciprocal throughput of rdpmc is ~39 cycles(Agner)&lt;/P&gt;

&lt;P&gt;so I suppose that movaps xmm0,[esi] will not be "noticed" by rdpmc instruction because the load of xmm register will be executed concurrently with the execution of rdpmc.&lt;/P&gt;</description>
      <pubDate>Fri, 21 Mar 2014 14:41:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955588#M2288</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-03-21T14:41:23Z</dc:date>
    </item>
    <item>
      <title>IIRC because of out-of-order</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955589#M2289</link>
      <description>&lt;P&gt;IIRC because of out-of-order execution the second rdpmc instruction could retire before the first one was executed.&lt;/P&gt;</description>
      <pubDate>Fri, 21 Mar 2014 14:51:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955589#M2289</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-03-21T14:51:34Z</dc:date>
    </item>
    <item>
      <title>Intel(r) VTune(tm) has</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955590#M2290</link>
      <description>&lt;P&gt;Intel(r) VTune(tm) has relatively low limits and defaults on sampling rate so it seems that overhead of counter use can't be ignored unless such limits are observed. &amp;nbsp;The strategy of reserving a core for VTune seems more important for Intel(r) Xeon Phi(tm) than for host (at least after tinkering with the graphics options so as to reduce those interruptions).&lt;/P&gt;

&lt;P&gt;In my experience, adding serialization instructions adds more overhead than simply sampling at large enough intervals to be able to neglect pipelining and out-of-order variations, but I don't put a lot of credence in simple statements on this.&lt;/P&gt;</description>
      <pubDate>Fri, 21 Mar 2014 15:33:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955590#M2290</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-03-21T15:33:43Z</dc:date>
    </item>
    <item>
      <title>As noted elsewhere, the RDPMC</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955591#M2291</link>
      <description>&lt;P&gt;As noted elsewhere, the RDPMC instructions are not ordered with respect to other instructions, so they might be executed at unexpected times.&amp;nbsp; Although Intel processors "try" to execute instructions in program order, they will go out of order whenever an instruction has a delay.&amp;nbsp; For performance counters, the problem typically shows up when an RDPMC instruction follows a long-latency instruction (in program order).&amp;nbsp; The hardware will generally issue the instructions in order, but the RDPMC instruction may start execution at the same time as the preceding long-latency instruction, so it will not catch the full latency of that preceding instruction.&lt;/P&gt;

&lt;P&gt;There is a "trick" that might work to provide partial ordering on RDPMC.&amp;nbsp; Since RDPMC has an input argument (the counter number), it is possible to build a dependence between the result of the instruction that you want to test and the input argument to the RDPMC instruction.&amp;nbsp; Historically, people have used instructions like XOR to take the output of one instruction and create a false dependency into the input argument of the RDPMC instruction.&amp;nbsp; However, recent Intel processors actually recognize idioms like XOR %eac,%eac as clearing a register (and therefore breaking any potential dependency between prior and future uses of %eac).&amp;nbsp; Agner Fog's microarchitecture documentation discusses which instruction sequences are recognized in this fashion.&amp;nbsp; From a quick look at his documentation, it looks like SBB (Subtract with Borrow) is not subject to this idiom recognition, so it could be used to establish a fairly low latency false dependency between instructions to enforce ordering.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;In general, you would need to create this false dependency on both sides of the instruction sequence under test.&amp;nbsp; I.e., the output of the initial RDPMC would need to be a false input to the first instruction under test and the last instruction under test needs to be a false input to the final RDPMC.&amp;nbsp; Unfortunately even this is not enough if the sequence of instructions under test is not serialized, and it is nearly impossible to set up a case for which the initial RDPMC is a false input to *all* of the instructions under test and the final RDPMC has a false input dependency on *all* the instructions under test.&lt;/P&gt;

&lt;P&gt;One half of the problem can be solved with the RDTSCP instruction, which will not execute until all prior instructions (in program order) have executed.&amp;nbsp;&amp;nbsp; The output of the RDTSCP can then be run through an SBB instruction to create a false input dependency for some subsequent instruction.&lt;/P&gt;

&lt;P&gt;I don't know how the RDTSCP definition of "execute" works with respect to instructions that can be rejected and retried.&amp;nbsp; This occurs frequently with floating-point instructions -- they are issued to the execution units after the instructions that define their inputs have been issued, but if those instructions include memory accesses that miss the cache the floating-point instructions may try to execute and find that their arguments are not actually present.&amp;nbsp; They are then rejected and retried some time later.&amp;nbsp;&amp;nbsp; One would prefer that the RDTSCP instruction not execute until all prior instructions have *completed execution*, but I can't tell whether support for such semantics exists in the hardware.&lt;/P&gt;

&lt;P&gt;All of this leads to the oft-repeated advice -- don't expect the performance counters to provide "in-order" counts for very short code sections.&amp;nbsp;&amp;nbsp; Measuring sections that take a minimum of many hundreds of cycles is usually necessary to make the uncertainty in the exact time of execution of the RDPMC instructions irrelevant.&lt;/P&gt;

&lt;P&gt;An exception to the above is Xeon Phi.&amp;nbsp; The RDTSC instruction takes only 5-6 cycles and the core executes in order, so RDTSC can be used on a very fine granularity -- for example to time the latency of individual load instructions that miss the L1 and L2 caches.&lt;/P&gt;</description>
      <pubDate>Fri, 21 Mar 2014 17:34:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955591#M2291</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2014-03-21T17:34:06Z</dc:date>
    </item>
    <item>
      <title>Hi again,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955592#M2292</link>
      <description>&lt;P&gt;Hi again,&lt;/P&gt;

&lt;P&gt;thank you all for your helpful answers. Maybe some background: I am trying to trace an application and understand its cache behavior more precisely by recording the pmc at the start of each basic block. Now I know, that at least the three things time overhead, space overhead and out-of-order execution will certainly tamper the measurement. The interesting question is: how much? And I think this question can be best answered by knowing what is going on behind the scenes.&lt;/P&gt;

&lt;P&gt;Partially breaking the out-of-order problem by introducing false dependencies might add other mistakes due to the increased time overhead (beside the slowdown). Thatswhy I am not solely interested in making the measurement as precise as possible but also interested in knowing and understanding mistakes introduced by a fast yet imprecise measurement.&lt;/P&gt;

&lt;P&gt;Best Olaf&lt;/P&gt;</description>
      <pubDate>Mon, 24 Mar 2014 15:53:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955592#M2292</guid>
      <dc:creator>Olaf_Krzikalla</dc:creator>
      <dc:date>2014-03-24T15:53:02Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;In my experience, adding</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955593#M2293</link>
      <description>&lt;P&gt;&lt;SPAN style="font-family: Arial, Helvetica, sans-serif; font-size: 12px; line-height: 14.399999618530273px;"&gt;&amp;gt;&amp;gt;&amp;gt;In my experience, adding serialization instructions adds more overhead than simply sampling at large enough intervals&amp;gt;&amp;gt;&amp;gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;That is understandable because of large impact on the &amp;nbsp;performance as cpuid instruction has.&lt;/P&gt;</description>
      <pubDate>Mon, 24 Mar 2014 17:41:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Timely-interaction-of-performance-counters/m-p/955593#M2293</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-03-24T17:41:43Z</dc:date>
    </item>
  </channel>
</rss>

