<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Query for Performance counter usage on Sandy Bridge Architecture. Number of stalls &amp;gt; Number of clock cycles in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Query-for-Performance-counter-usage-on-Sandy-Bridge-Architecture/m-p/944373#M2032</link>
    <description>&lt;P&gt;Hello All,&lt;/P&gt;
&lt;P&gt;I am using PAPI 5.1.0 for doing some performance counter analysis on a Sandy Bridge Machine having 2 processors each with 6 cores.&lt;BR /&gt;I am using events&amp;nbsp;&lt;/P&gt;
&lt;P&gt;CPU_CLK_UNHALTED, &amp;nbsp;&lt;/P&gt;
&lt;P&gt;RESOURCE_STALLS:ANY,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;UOPS_DISPATCHED:STALL_CYCLES,&lt;/P&gt;
&lt;P&gt;UOPS_ISSUED:STALL_CYCLES.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;I run 6 copies of a test program (with only 1 copy invoking PAPI) on 6 physically different cores (sharing main memory) using taskset and get the following output. As can be seen, Dispatch stall cycles and Issue stall cycles are &amp;nbsp;greater than cpu_clk_unhalted cycles. Is this type of data possible or am I doing some thing wrong?&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;CPU_CLK_UNHALTED, &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;27494626969,&lt;/P&gt;
&lt;P&gt;RESOURCE_STALLS:ANY, &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;23483871000,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;UOPS_DISPATCHED:STALL_CYCLES,&amp;nbsp;28114602519&lt;/P&gt;
&lt;P&gt;UOPS_ISSUED:STALL_CYCLES. &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 31941881082&lt;/P&gt;
&lt;P&gt;Many regards,&lt;BR /&gt;Rakhi&lt;/P&gt;</description>
    <pubDate>Tue, 09 Jul 2013 05:24:13 GMT</pubDate>
    <dc:creator>Rakhi_H_</dc:creator>
    <dc:date>2013-07-09T05:24:13Z</dc:date>
    <item>
      <title>Query for Performance counter usage on Sandy Bridge Architecture. Number of stalls &gt; Number of clock cycles</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Query-for-Performance-counter-usage-on-Sandy-Bridge-Architecture/m-p/944373#M2032</link>
      <description>&lt;P&gt;Hello All,&lt;/P&gt;
&lt;P&gt;I am using PAPI 5.1.0 for doing some performance counter analysis on a Sandy Bridge Machine having 2 processors each with 6 cores.&lt;BR /&gt;I am using events&amp;nbsp;&lt;/P&gt;
&lt;P&gt;CPU_CLK_UNHALTED, &amp;nbsp;&lt;/P&gt;
&lt;P&gt;RESOURCE_STALLS:ANY,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;UOPS_DISPATCHED:STALL_CYCLES,&lt;/P&gt;
&lt;P&gt;UOPS_ISSUED:STALL_CYCLES.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;I run 6 copies of a test program (with only 1 copy invoking PAPI) on 6 physically different cores (sharing main memory) using taskset and get the following output. As can be seen, Dispatch stall cycles and Issue stall cycles are &amp;nbsp;greater than cpu_clk_unhalted cycles. Is this type of data possible or am I doing some thing wrong?&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;CPU_CLK_UNHALTED, &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;27494626969,&lt;/P&gt;
&lt;P&gt;RESOURCE_STALLS:ANY, &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;23483871000,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;UOPS_DISPATCHED:STALL_CYCLES,&amp;nbsp;28114602519&lt;/P&gt;
&lt;P&gt;UOPS_ISSUED:STALL_CYCLES. &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 31941881082&lt;/P&gt;
&lt;P&gt;Many regards,&lt;BR /&gt;Rakhi&lt;/P&gt;</description>
      <pubDate>Tue, 09 Jul 2013 05:24:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Query-for-Performance-counter-usage-on-Sandy-Bridge-Architecture/m-p/944373#M2032</guid>
      <dc:creator>Rakhi_H_</dc:creator>
      <dc:date>2013-07-09T05:24:13Z</dc:date>
    </item>
    <item>
      <title>Please check this link : http</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Query-for-Performance-counter-usage-on-Sandy-Bridge-Architecture/m-p/944374#M2033</link>
      <description>&lt;P&gt;Please check this link : &lt;A href="http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe-2011-documentation"&gt;http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe-2011-documentation&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;IIRC per clock cycle can be retired up to 4 uops.It seems that your cpu spent a lot of time waiting it could be memory stalls or data dependencies or long latency instructions or even branch misprediction.&lt;/P&gt;
&lt;P&gt;&lt;A href="http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe-2011-documentation"&gt;&lt;/A&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 09 Jul 2013 06:04:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Query-for-Performance-counter-usage-on-Sandy-Bridge-Architecture/m-p/944374#M2033</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-07-09T06:04:30Z</dc:date>
    </item>
    <item>
      <title>Yes! The test program has a</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Query-for-Performance-counter-usage-on-Sandy-Bridge-Architecture/m-p/944375#M2034</link>
      <description>&lt;P&gt;Yes! The test program has a of memory stalls.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I was referring to figure 2 and 3 in "Performance Analysis Guide for Intel® CoreTM i7 Processor and Intel® XeonTM 5500 processors" By Dr David Levinthal PhD. Version 1.0.&lt;/P&gt;
&lt;P&gt;I now realize that the model is a simple, serialized model, where as the processor is complicated. As many instructions can be issued and dispatched in a single clock cycle ( ~4), the stalls reported is a sum of stalls on all paths? Hence the numbers are greater than expected.&lt;/P&gt;
&lt;P&gt;Am I correct?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks again,&lt;/P&gt;
&lt;P&gt;Rakhi&lt;/P&gt;</description>
      <pubDate>Tue, 09 Jul 2013 09:35:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Query-for-Performance-counter-usage-on-Sandy-Bridge-Architecture/m-p/944375#M2034</guid>
      <dc:creator>Rakhi_H_</dc:creator>
      <dc:date>2013-07-09T09:35:09Z</dc:date>
    </item>
    <item>
      <title>As front end is serialized</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Query-for-Performance-counter-usage-on-Sandy-Bridge-Architecture/m-p/944376#M2035</link>
      <description>&lt;P&gt;As front end is serialized reading binary encoded bitstream(in reality probably coupled with an additive noise) the decoding stage will break down machine code instructions coupled with data into simplier more primitive instructions micro-ops and try to exploit instruction level parallelism moreover cpu will try to keep busy its pielines by executing out-of-order for example during the memory stalls or even prefetching some data needed ahead of time.&lt;/P&gt;</description>
      <pubDate>Tue, 09 Jul 2013 12:46:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Query-for-Performance-counter-usage-on-Sandy-Bridge-Architecture/m-p/944376#M2035</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-07-09T12:46:56Z</dc:date>
    </item>
    <item>
      <title>Ok. I think I understand now.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Query-for-Performance-counter-usage-on-Sandy-Bridge-Architecture/m-p/944377#M2036</link>
      <description>&lt;P&gt;Ok. I think I understand now. Thanks a ton&lt;/P&gt;
&lt;P&gt;Rakhi&lt;/P&gt;</description>
      <pubDate>Tue, 09 Jul 2013 13:16:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Query-for-Performance-counter-usage-on-Sandy-Bridge-Architecture/m-p/944377#M2036</guid>
      <dc:creator>Rakhi_H_</dc:creator>
      <dc:date>2013-07-09T13:16:21Z</dc:date>
    </item>
    <item>
      <title>You are welcome:)</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Query-for-Performance-counter-usage-on-Sandy-Bridge-Architecture/m-p/944378#M2037</link>
      <description>&lt;P&gt;You are welcome:)&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 09 Jul 2013 14:13:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Query-for-Performance-counter-usage-on-Sandy-Bridge-Architecture/m-p/944378#M2037</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-07-09T14:13:39Z</dc:date>
    </item>
    <item>
      <title>Btw can you post branching</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Query-for-Performance-counter-usage-on-Sandy-Bridge-Architecture/m-p/944379#M2038</link>
      <description>&lt;P&gt;Btw can you post branching related counters output?&lt;/P&gt;</description>
      <pubDate>Tue, 09 Jul 2013 14:14:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Query-for-Performance-counter-usage-on-Sandy-Bridge-Architecture/m-p/944379#M2038</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-07-09T14:14:45Z</dc:date>
    </item>
  </channel>
</rss>

