<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Why flops more than 100% in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Why-flops-more-than-100/m-p/1181440#M7447</link>
    <description>&lt;P&gt;I run my program on IVYBridge, I collect the following events. Sometimes flops more than 100%.&lt;/P&gt;

&lt;P&gt;&amp;nbsp; FP_COMP_OPS_EXE.X87&lt;BR /&gt;
	&amp;nbsp; FP_COMP_OPS_EXE.SSE_FP_PACKED_DOUBLE&lt;BR /&gt;
	&amp;nbsp; FP_COMP_OPS_EXE.SSE_FP_SCALAR_SINGLE&lt;BR /&gt;
	&amp;nbsp; FP_COMP_OPS_EXE.SSE_FP_PACKED_SINGLE&lt;BR /&gt;
	&amp;nbsp; FP_COMP_OPS_EXE.SSE_FP_SCALAR_DOUBLE&lt;BR /&gt;
	&amp;nbsp; SIMD_FP_256.PACKED_SINGLE&lt;BR /&gt;
	&amp;nbsp; SIMD_FP_256.PACKED_DOUBLE&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;
X87	PackedD	ScalarS	PackedS	ScalarD	256PackedS	256PackedD	Max	Time1	Time2

14588.000000 	21455680.000000 	24.000000 	0.000000 	66765430.000000 	1247.000000 	59014271960.000000 	422400000000.000000 	0.500438 	0.500555  	111.723512 

&lt;/PRE&gt;

&lt;P&gt;I use the following formula.&lt;/P&gt;

&lt;P&gt;100 * ( (x87+4*256PackedD+4*256PackedD)/Time1 + (2*PackedD+ScalarS+2*PackedS+ScalarD)/Time2 ) / Max&lt;/P&gt;</description>
    <pubDate>Wed, 13 Sep 2017 09:55:18 GMT</pubDate>
    <dc:creator>GHui</dc:creator>
    <dc:date>2017-09-13T09:55:18Z</dc:date>
    <item>
      <title>Why flops more than 100%</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Why-flops-more-than-100/m-p/1181440#M7447</link>
      <description>&lt;P&gt;I run my program on IVYBridge, I collect the following events. Sometimes flops more than 100%.&lt;/P&gt;

&lt;P&gt;&amp;nbsp; FP_COMP_OPS_EXE.X87&lt;BR /&gt;
	&amp;nbsp; FP_COMP_OPS_EXE.SSE_FP_PACKED_DOUBLE&lt;BR /&gt;
	&amp;nbsp; FP_COMP_OPS_EXE.SSE_FP_SCALAR_SINGLE&lt;BR /&gt;
	&amp;nbsp; FP_COMP_OPS_EXE.SSE_FP_PACKED_SINGLE&lt;BR /&gt;
	&amp;nbsp; FP_COMP_OPS_EXE.SSE_FP_SCALAR_DOUBLE&lt;BR /&gt;
	&amp;nbsp; SIMD_FP_256.PACKED_SINGLE&lt;BR /&gt;
	&amp;nbsp; SIMD_FP_256.PACKED_DOUBLE&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;
X87	PackedD	ScalarS	PackedS	ScalarD	256PackedS	256PackedD	Max	Time1	Time2

14588.000000 	21455680.000000 	24.000000 	0.000000 	66765430.000000 	1247.000000 	59014271960.000000 	422400000000.000000 	0.500438 	0.500555  	111.723512 

&lt;/PRE&gt;

&lt;P&gt;I use the following formula.&lt;/P&gt;

&lt;P&gt;100 * ( (x87+4*256PackedD+4*256PackedD)/Time1 + (2*PackedD+ScalarS+2*PackedS+ScalarD)/Time2 ) / Max&lt;/P&gt;</description>
      <pubDate>Wed, 13 Sep 2017 09:55:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Why-flops-more-than-100/m-p/1181440#M7447</guid>
      <dc:creator>GHui</dc:creator>
      <dc:date>2017-09-13T09:55:18Z</dc:date>
    </item>
    <item>
      <title>The SIMD floating-point</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Why-flops-more-than-100/m-p/1181441#M7448</link>
      <description>&lt;P&gt;The SIMD floating-point counters on Sandy Bridge and Ivy Bridge are known to overcount.&amp;nbsp;&amp;nbsp; The amount of overcounting depends on how long the FP arithmetic instructions have to wait for their input arguments to be ready.&amp;nbsp;&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;For data in L1 cache, the overcounting is very small (~3% on DGEMM).&lt;/LI&gt;
	&lt;LI&gt;For data in L2 cache, the overcounting is somewhat larger -- I seem to recall values in the 10% range.&lt;/LI&gt;
	&lt;LI&gt;For data in memory, the overcounting can be very large.&amp;nbsp; With the STREAM benchmark using all cores, I have seen overcounting ratios of 6x to 10x.&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;The floating-point counters on Broadwell and Skylake are a new implementation and don't appear to have this problem.&lt;/P&gt;</description>
      <pubDate>Wed, 13 Sep 2017 13:45:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Why-flops-more-than-100/m-p/1181441#M7448</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2017-09-13T13:45:42Z</dc:date>
    </item>
  </channel>
</rss>

