<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic The &amp;quot;normal&amp;quot; FLOP counts for in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/what-MEM-UOPS-RETIRED-ALL-LOADS-represent-on-Broadwell/m-p/1153439#M6891</link>
    <description>&lt;P&gt;The "normal" FLOP counts for these events is given in the Chapter 19 of Volume 3 of the Intel Architectures SW Developer's Manual or in the performance counter event listings at &lt;A href="https://download.01.org/perfmon/SKX/skylakex_core_v1.06.json" target="_blank"&gt;https://download.01.org/perfmon/SKX/skylakex_core_v1.06.json&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;The counters are set up so that an FMA instruction will increment the counter twice, so multiplying by the width will give the expected FLOPS value.&amp;nbsp; The downside of this convention is that it makes it a bit harder to determine arithmetic intensity in terms of instruction counts -- you will need to re-compile with the "-no-fma" flag and look at the difference in counts between the original and no-fma cases to determine how many FMA instructions were used.&lt;/P&gt;</description>
    <pubDate>Fri, 10 Nov 2017 21:28:33 GMT</pubDate>
    <dc:creator>McCalpinJohn</dc:creator>
    <dc:date>2017-11-10T21:28:33Z</dc:date>
    <item>
      <title>what MEM_UOPS_RETIRED:ALL_LOADS represent on Broadwell</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/what-MEM-UOPS-RETIRED-ALL-LOADS-represent-on-Broadwell/m-p/1153434#M6886</link>
      <description>&lt;P&gt;I am trying to figure out how much data are read and written from memory using MEM_UOPS_RETIRED:ALL_LOADS, but I am not sure what MEM_UOPS_RETIRED:ALL_LOADS represent exactly on Broadwell&lt;/P&gt;

&lt;P&gt;Assume MEM_UOPS_RETIRED:ALL_LOADS=5,543,619,579, wondering how much data are actually transferred between cache and memory?&lt;/P&gt;

&lt;P&gt;5,543,619,579 bytes or 5,543,619,579 x 64 bytes ? or other answer?&lt;/P&gt;

&lt;P&gt;Thank you very much in advance! Jin&lt;/P&gt;</description>
      <pubDate>Wed, 08 Nov 2017 07:10:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/what-MEM-UOPS-RETIRED-ALL-LOADS-represent-on-Broadwell/m-p/1153434#M6886</guid>
      <dc:creator>Jin__Chao</dc:creator>
      <dc:date>2017-11-08T07:10:31Z</dc:date>
    </item>
    <item>
      <title>There is no way to get the</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/what-MEM-UOPS-RETIRED-ALL-LOADS-represent-on-Broadwell/m-p/1153435#M6887</link>
      <description>&lt;P&gt;There is no way to get the amount of data traffic from the MEM_UOPS_RETIRED events because they only count the number of accesses and not the size of each access.&lt;/P&gt;

&lt;P&gt;Even if you knew that all loads were the same size, the specific event MEM_UOPS_RETIRED.ALL_LOADS would only tell you the amount of data loaded from the L1 Data Cache to the core.&amp;nbsp;&amp;nbsp; If you want the amount of data transferred from the DRAM memory to the caches, the most reliable measurements will come from the memory controller counters in the uncore.&amp;nbsp; These can be significantly less convenient to use, depending on your hardware and software environment.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;You can get an approximation to the amount of data loaded from memory to the caches using the OFFCORE_RESPONSE event.&amp;nbsp; This is a core hardware performance counter event, but it requires programming an additional register to specify exactly what you want to count.&amp;nbsp; The programming of this extra register requires software support from your OS, and understanding which bit fields need to be set is quite a challenge.&amp;nbsp; The best way to figure out how to use these events is to start with the examples provided for your processor at &lt;A href="https://download.01.org/perfmon/" target="_blank"&gt;https://download.01.org/perfmon/&lt;/A&gt; or in the tables for your processor in Chapter 19 of Volume 3 of the Intel Architectures Software Developer's Manual (Intel document 325384).&amp;nbsp;&amp;nbsp; The description of the bits in the auxiliary off-core response register are in the sections of Chapter 18 (in the same document) that have "off-core response" in the title.&amp;nbsp;&amp;nbsp; Understanding how to use these events typically required both the explanation in Chapter 18 and the examples in Chapter 19 (or at &lt;A href="https://download.01.org/perfmon/)" target="_blank"&gt;https://download.01.org/perfmon/)&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Wed, 08 Nov 2017 16:45:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/what-MEM-UOPS-RETIRED-ALL-LOADS-represent-on-Broadwell/m-p/1153435#M6887</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2017-11-08T16:45:04Z</dc:date>
    </item>
    <item>
      <title>Thank you for your quick</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/what-MEM-UOPS-RETIRED-ALL-LOADS-represent-on-Broadwell/m-p/1153436#M6888</link>
      <description>&lt;P&gt;Thank you for your quick advice!&lt;/P&gt;

&lt;P&gt;I have a further question on OFFCORE_RESPONSE events.&lt;/P&gt;

&lt;P&gt;I saw some people calculates memory bandwidth utilization for&amp;nbsp;&lt;SPAN class="ms-rteThemeFontFace-1"&gt;Ivy Bridge&lt;/SPAN&gt; using&lt;/P&gt;

&lt;P&gt;64 * (OFFCORE_RESPONSE_0:L3_MISS_LOCAL + OFFCORE_RESPONSE_0:L3_MISS_REMOTE) / time&lt;/P&gt;

&lt;P&gt;Wondering if this formula still works on Broadwell?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 08 Nov 2017 23:58:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/what-MEM-UOPS-RETIRED-ALL-LOADS-represent-on-Broadwell/m-p/1153436#M6888</guid>
      <dc:creator>Jin__Chao</dc:creator>
      <dc:date>2017-11-08T23:58:00Z</dc:date>
    </item>
    <item>
      <title>I don't see any place where</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/what-MEM-UOPS-RETIRED-ALL-LOADS-represent-on-Broadwell/m-p/1153437#M6889</link>
      <description>&lt;P&gt;I don't see any place where events with those exact names are defined....&lt;/P&gt;

&lt;P&gt;The OFFCORE_RESPONSE event should provide a good estimate of the DRAM read traffic by:&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;setting the "request type" bits for demand data reads, demand data RFOs, demand Ifetch, prefetch data read, prefetch RFO, prefetch L3 data read, prefetch L3 RFO,&lt;/LI&gt;
	&lt;LI&gt;setting the "supplier information" bits for local DRAM, L3 miss to remote DRAM, and "No Supplier Info available",&lt;/LI&gt;
	&lt;LI&gt;setting the "snoop response" bits for "snoop none", "snoop not needed", "snoop miss", and "snoop no forward".&lt;/LI&gt;
&lt;/OL&gt;

&lt;P&gt;These are all described in Chapter 18 of Volume 3 of the Intel Architectures SW Developer's Manual, in the section on Haswell processors.&amp;nbsp; Be sure to note the difference in the supplier information bits for the Haswell client and Haswell Xeon E5 processors.&amp;nbsp;&amp;nbsp; I did not see anything that suggested that Broadwell offcore response events are different than on Haswell.&lt;/P&gt;</description>
      <pubDate>Thu, 09 Nov 2017 15:22:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/what-MEM-UOPS-RETIRED-ALL-LOADS-represent-on-Broadwell/m-p/1153437#M6889</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2017-11-09T15:22:07Z</dc:date>
    </item>
    <item>
      <title>Many thanks, John!</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/what-MEM-UOPS-RETIRED-ALL-LOADS-represent-on-Broadwell/m-p/1153438#M6890</link>
      <description>&lt;P&gt;Many thanks, John!&lt;/P&gt;

&lt;P&gt;I am going through the developer manual now.&lt;/P&gt;

&lt;P&gt;The final question in this thread is how to calculate how many FLOPs are executed?&lt;/P&gt;

&lt;P&gt;I assume the answer is to add the following counters according to their width?&lt;/P&gt;

&lt;P&gt;FP_ARITH_INST_RETIRED.SCALAR_DOUBLE&lt;BR /&gt;
	FP_ARITH_INST_RETIRED.SCALAR_SINGLE&lt;BR /&gt;
	FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE&lt;BR /&gt;
	FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE&lt;BR /&gt;
	FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE&lt;BR /&gt;
	FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE&lt;BR /&gt;
	FP_ARITH_INST_RETIRED.SCALAR&lt;BR /&gt;
	FP_ARITH_INST_RETIRED.PACKED&lt;BR /&gt;
	FP_ARITH_INST_RETIRED.SINGLE&lt;BR /&gt;
	FP_ARITH_INST_RETIRED.DOUBLE&lt;/P&gt;

&lt;P&gt;I am actually trying to calculate Operational Intensity (FLOPs/Byte) for applications.&lt;/P&gt;</description>
      <pubDate>Fri, 10 Nov 2017 06:29:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/what-MEM-UOPS-RETIRED-ALL-LOADS-represent-on-Broadwell/m-p/1153438#M6890</guid>
      <dc:creator>Jin__Chao</dc:creator>
      <dc:date>2017-11-10T06:29:44Z</dc:date>
    </item>
    <item>
      <title>The "normal" FLOP counts for</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/what-MEM-UOPS-RETIRED-ALL-LOADS-represent-on-Broadwell/m-p/1153439#M6891</link>
      <description>&lt;P&gt;The "normal" FLOP counts for these events is given in the Chapter 19 of Volume 3 of the Intel Architectures SW Developer's Manual or in the performance counter event listings at &lt;A href="https://download.01.org/perfmon/SKX/skylakex_core_v1.06.json" target="_blank"&gt;https://download.01.org/perfmon/SKX/skylakex_core_v1.06.json&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;The counters are set up so that an FMA instruction will increment the counter twice, so multiplying by the width will give the expected FLOPS value.&amp;nbsp; The downside of this convention is that it makes it a bit harder to determine arithmetic intensity in terms of instruction counts -- you will need to re-compile with the "-no-fma" flag and look at the difference in counts between the original and no-fma cases to determine how many FMA instructions were used.&lt;/P&gt;</description>
      <pubDate>Fri, 10 Nov 2017 21:28:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/what-MEM-UOPS-RETIRED-ALL-LOADS-represent-on-Broadwell/m-p/1153439#M6891</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2017-11-10T21:28:33Z</dc:date>
    </item>
    <item>
      <title>The other option is the Intel</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/what-MEM-UOPS-RETIRED-ALL-LOADS-represent-on-Broadwell/m-p/1153440#M6892</link>
      <description>&lt;P&gt;The other option is the Intel PCM (https://github.com/opcm/pcm), which directly reads the performance counters at the memory controller. I have tested PCM on Broadwell, and the number is accurate.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 13 Nov 2017 08:38:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/what-MEM-UOPS-RETIRED-ALL-LOADS-represent-on-Broadwell/m-p/1153440#M6892</guid>
      <dc:creator>ZWang45</dc:creator>
      <dc:date>2017-11-13T08:38:39Z</dc:date>
    </item>
  </channel>
</rss>

