<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic &amp;gt;&amp;gt;&amp;gt;As far as I understand in Analyzers</title>
    <link>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807322#M1711</link>
    <description>&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 14.3999996185303px;"&gt;&amp;gt;&amp;gt;&amp;gt;As far as I understand during execution of packed AVX instructions the vector can be filled just partly. Is there a way to determine whether a vector was completely filled or nor&amp;gt;&amp;gt;&amp;gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 14.3999996185303px;"&gt;​I presume that you are&amp;nbsp;referring&amp;nbsp;to XMMx/YMMx registers. I this case you can see with debugger if specific register is filled with 4 or 8 scalars.&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 06 Mar 2015 10:15:24 GMT</pubDate>
    <dc:creator>Bernard</dc:creator>
    <dc:date>2015-03-06T10:15:24Z</dc:date>
    <item>
      <title>Interpreting the AVX counter results</title>
      <link>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807314#M1703</link>
      <description>As far as I understand during execution of packed AVX instructions the vector can be filled just partly. Is there a way to determine whether a vector was completely filled or nor?&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Or is it correct to assume that compiler does its job well and the cases when the vector is not filled occur rearly (e.g. when we are out of data in the end of the loop)?&lt;/DIV&gt;</description>
      <pubDate>Thu, 07 Jun 2012 16:40:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807314#M1703</guid>
      <dc:creator>Pavel_Mezentsev</dc:creator>
      <dc:date>2012-06-07T16:40:51Z</dc:date>
    </item>
    <item>
      <title>Interpreting the AVX counter results</title>
      <link>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807315#M1704</link>
      <description>Can you use VTune Amplifier XE 2011 to do Event based sampling, with PMU events named&lt;H1&gt;SIMD_FP_256&lt;/H1&gt;?&lt;BR /&gt;&lt;BR /&gt;Review countnumber to know if the results are under expectation.&lt;BR /&gt;&lt;BR /&gt;&lt;TABLE width="95%" style="width: 95%;"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TH width="20%" style="width: 15%;"&gt;&lt;H3 class="TableHead"&gt;Event Name Extension&lt;/H3&gt;&lt;/TH&gt;&lt;TH width="15%" style="width: 15%;"&gt;&lt;H3 class="TableHead"&gt;Mask&lt;/H3&gt;&lt;/TH&gt;&lt;TH width="15%" style="width: 15%;"&gt;&lt;H3 class="TableHead"&gt;Definition&lt;/H3&gt;&lt;/TH&gt;&lt;TH width="55%" style="width: 55%;"&gt;&lt;H3 class="TableHead"&gt;Description&lt;/H3&gt;&lt;/TH&gt;&lt;TH width="15%" style="width: 15%;"&gt;&lt;H3 class="TableHead"&gt;Counter&lt;/H3&gt;&lt;/TH&gt;&lt;TH width="55%" style="width: 55%;"&gt;&lt;H3 class="TableHead"&gt;Counter (HT off)&lt;/H3&gt;&lt;/TH&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;A&gt;PACKED_SINGLE&lt;/A&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;0x01&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;This events counts the number of &lt;SPAN style="background-color: #3399ff; color: #ffffff;"&gt;AVX&lt;/SPAN&gt;-256 Computational FP single precision uops issued during the cycle. Note: Packed &lt;SPAN style="background-color: #3399ff; color: #ffffff;"&gt;AVX&lt;/SPAN&gt;-256 can be counted as one, and will count for SIMD FP 128.&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;0,1,2,3&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;0,1,2,3,4,5,6,7&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;A&gt;PACKED_DOUBLE&lt;/A&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;0x02&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;This event counts the number of &lt;SPAN style="background-color: #3399ff; color: #ffffff;"&gt;AVX&lt;/SPAN&gt;-256 Computational FP doube precision uops issued during the cycle. Note: Packed &lt;SPAN style="background-color: #3399ff; color: #ffffff;"&gt;AVX&lt;/SPAN&gt;-256 can be counted as one, and will count for SIMD FP 128.&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;0,1,2,3&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;0,1,2,3,4,5,6,7&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 08 Jun 2012 08:53:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807315#M1704</guid>
      <dc:creator>Peter_W_Intel</dc:creator>
      <dc:date>2012-06-08T08:53:22Z</dc:date>
    </item>
    <item>
      <title>Interpreting the AVX counter results</title>
      <link>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807316#M1705</link>
      <description>Yes, I've done the profiling using VTune.&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;The thing is that I'm analyzing the performance of a huge application. In particular I'm trying to understand if the code uses many FP operations and if it has been vectorized successfully.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;In particular I got the following result for one of the runs:&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV id="_mcePaste"&gt;CPU_CLK_UNHALTED.REF_TSC	4,983,560,000,000&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;CPU_CLK_UNHALTED.THREAD	5,670,360,000,000&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE	358,000,000,000&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE	1,164,920,000,000&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;FP_COMP_OPS_EXE.SSE_SCALAR_SINGLE	21,200,000,000&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;FP_COMP_OPS_EXE.X87	223,200,000,000&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;INST_RETIRED.ANY	7,926,840,000,000&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;And the counters SIMD_FP_256 are all zeroes.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;I've also measured the HPL code and got the following results:&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;CPU_CLK_UNHALTED.REF_TSC&lt;SPAN style="white-space: pre;"&gt;	&lt;/SPAN&gt;2,675,264,000,000&lt;/DIV&gt;&lt;DIV&gt;CPU_CLK_UNHALTED.THREAD&lt;SPAN style="white-space: pre;"&gt;	&lt;/SPAN&gt;2,922,426,000,000&lt;/DIV&gt;&lt;DIV&gt;FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE&lt;SPAN style="white-space: pre;"&gt;	&lt;/SPAN&gt;2,816,000,000&lt;/DIV&gt;&lt;DIV&gt;FP_COMP_OPS_EXE.SSE_SCALAR_DOUBLE&lt;SPAN style="white-space: pre;"&gt;	&lt;/SPAN&gt;18,080,000,000&lt;/DIV&gt;&lt;DIV&gt;FP_COMP_OPS_EXE.X87&lt;SPAN style="white-space: pre;"&gt;	&lt;/SPAN&gt;460,000,000&lt;/DIV&gt;&lt;DIV&gt;INST_RETIRED.ANY&lt;SPAN style="white-space: pre;"&gt;	&lt;/SPAN&gt;7,581,522,000,000&lt;/DIV&gt;&lt;DIV&gt;SIMD_FP_256.PACKED_DOUBLE&lt;SPAN style="white-space: pre;"&gt;	&lt;/SPAN&gt;4,582,812,000,000&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;What I don't understand is how to interpret the results. What is the difference between FP_COMP_OPS_EXE and SIMD_FP_256? And is it justified to to say that each increment of the counter means that actually 4 flop were executed (for DP)? And during one processor cycle there may occur 2 increments (one for add and one for mul)?&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;So any clarifications on the subject would be appreciated!&lt;/DIV&gt;</description>
      <pubDate>Fri, 08 Jun 2012 09:32:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807316#M1705</guid>
      <dc:creator>Pavel_Mezentsev</dc:creator>
      <dc:date>2012-06-08T09:32:34Z</dc:date>
    </item>
    <item>
      <title>Interpreting the AVX counter results</title>
      <link>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807317#M1706</link>
      <description>&lt;DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;DIV&gt;SIMD_FP_256.PACKED_DOUBLE 4,582,812,000,000; which counts SSE, AVX-128 FPand AVX-256 FP computational double precious uops issued&lt;BR /&gt;&lt;DIV id="_mcePaste"&gt;FP_COMP_OPS_EXE.SSE_PACKED_DOUBLE 358,000,000,000; which counts SSE &amp;amp; AVX-128 FPcomputational double precious uops issued, only&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Sat, 09 Jun 2012 08:55:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807317#M1706</guid>
      <dc:creator>Peter_W_Intel</dc:creator>
      <dc:date>2012-06-09T08:55:38Z</dc:date>
    </item>
    <item>
      <title>Interpreting the AVX counter results</title>
      <link>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807318#M1707</link>
      <description>Is it correct that operations that count in FP_COM_OPS_EXE are a subset of operations counted by SIMD_FP_256? And by subtracting the former from the latter I get the number of operations with 256-bit operations only?</description>
      <pubDate>Sat, 09 Jun 2012 09:47:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807318#M1707</guid>
      <dc:creator>Pavel_Mezentsev</dc:creator>
      <dc:date>2012-06-09T09:47:07Z</dc:date>
    </item>
    <item>
      <title>Interpreting the AVX counter results</title>
      <link>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807319#M1708</link>
      <description>I think that the answer is "Yes", result for AVX-256 only:-)</description>
      <pubDate>Sat, 09 Jun 2012 11:01:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807319#M1708</guid>
      <dc:creator>Peter_W_Intel</dc:creator>
      <dc:date>2012-06-09T11:01:06Z</dc:date>
    </item>
    <item>
      <title>What is equivalent of  </title>
      <link>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807320#M1709</link>
      <description>&lt;P&gt;What is equivalent of&amp;nbsp;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;SIMD_FP_256.PACKED_DOUBLE.&lt;/P&gt;

&lt;P&gt;SIMD_FP_256.PACKED_DOUBLE&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;on haswell ?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 27 Feb 2015 15:46:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807320#M1709</guid>
      <dc:creator>mrabet_ahmed_amine</dc:creator>
      <dc:date>2015-02-27T15:46:41Z</dc:date>
    </item>
    <item>
      <title>It appears that all of the</title>
      <link>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807321#M1710</link>
      <description>&lt;P&gt;It appears that all of the floating-point performance counters (with the except of the Event 0xCA "Floating Point Assists") have been removed from the Haswell-based products.&lt;/P&gt;

&lt;P&gt;These counters are known to systematically overcount in Sandy Bridge and Ivy Bridge processors whenever the input registers are not ready (e.g., due to cache misses).&amp;nbsp;&amp;nbsp; I have seen overcounting by anywhere from ~3% to 10x, depending on the average latency for loads feeding into the FP instructions.&lt;/P&gt;

&lt;P&gt;We still use these counters on our 6400-node Sandy Bridge system to monitor whether codes are using SSE or AVX, how well the codes vectorize, and whether they are running with 32-bit or 64-bit floating-point arithmetic.&amp;nbsp; The accuracy is good enough for this classification process, and if we deploy a large Haswell-based system we will have to employ a different approach to get this information.&lt;/P&gt;

&lt;P&gt;Intel is certainly aware of the accuracy issues with these counters and is likely to fix the existing problems in some future products.&amp;nbsp; Section 19.2 of Volume 3 of the SW Developer's Guide (document 324384-053, January 2015) shows that Broadwell gets a few FP events back:&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;Event 0x14, Umask 0x01: ARITH.FPU_DIV_ACTIVE -- cycles that the divide unit is active&lt;/LI&gt;
	&lt;LI&gt;Event 0xC0, Umask 0x02: INST_RETIRED.X87 -- x87 Floating-Point operations that are retired without generating exceptions.&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;I have not heard any definitive statements on when improved support for floating-point counts will make it into shipping products.&lt;/P&gt;</description>
      <pubDate>Fri, 27 Feb 2015 19:21:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807321#M1710</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2015-02-27T19:21:44Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;As far as I understand</title>
      <link>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807322#M1711</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 14.3999996185303px;"&gt;&amp;gt;&amp;gt;&amp;gt;As far as I understand during execution of packed AVX instructions the vector can be filled just partly. Is there a way to determine whether a vector was completely filled or nor&amp;gt;&amp;gt;&amp;gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 14.3999996185303px;"&gt;​I presume that you are&amp;nbsp;referring&amp;nbsp;to XMMx/YMMx registers. I this case you can see with debugger if specific register is filled with 4 or 8 scalars.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 06 Mar 2015 10:15:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807322#M1711</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2015-03-06T10:15:24Z</dc:date>
    </item>
    <item>
      <title>Thank you for your answer</title>
      <link>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807323#M1712</link>
      <description>&lt;DIV class="forum-post-author"&gt;Thank you for your answer&lt;/DIV&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;and if we deploy a large Haswell-based system we will have to employ a different approach to get this information.&lt;/P&gt;

&lt;P&gt;Do you have&lt;B&gt; &lt;/B&gt;any idea to get flops on haswell architecture ?&lt;/P&gt;

&lt;DIV class="source_url_spacer"&gt;&amp;nbsp;&lt;/DIV&gt;</description>
      <pubDate>Fri, 06 Mar 2015 13:27:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807323#M1712</guid>
      <dc:creator>mrabet_ahmed_amine</dc:creator>
      <dc:date>2015-03-06T13:27:31Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;Do you have any idea to</title>
      <link>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807324#M1713</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 14.3999996185303px;"&gt;&amp;gt;&amp;gt;&amp;gt;Do you have&lt;/SPAN&gt;&lt;B style="font-size: 12px; line-height: 14.3999996185303px;"&gt;&amp;nbsp;&lt;/B&gt;&lt;SPAN style="font-size: 12px; line-height: 14.3999996185303px;"&gt;any idea to get flops on haswell architecture ?&amp;gt;&amp;gt;&amp;gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 14.3999996185303px;"&gt;Do you mean to count how many GFLOPS were executed?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 08 Mar 2015 07:24:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807324#M1713</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2015-03-08T07:24:11Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;Do you mean to count how</title>
      <link>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807325#M1714</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;Do you mean to count how many GFLOPS were executed?&lt;/P&gt;

&lt;P&gt;yes to count Gflops of application, and number of simple precision and double precision flops were executed&lt;/P&gt;</description>
      <pubDate>Thu, 12 Mar 2015 15:52:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807325#M1714</guid>
      <dc:creator>mrabet_ahmed_amine</dc:creator>
      <dc:date>2015-03-12T15:52:35Z</dc:date>
    </item>
    <item>
      <title>I think that John answered</title>
      <link>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807326#M1715</link>
      <description>&lt;P&gt;I think that John answered your question.&lt;/P&gt;</description>
      <pubDate>Thu, 12 Mar 2015 18:20:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Interpreting-the-AVX-counter-results/m-p/807326#M1715</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2015-03-12T18:20:52Z</dc:date>
    </item>
  </channel>
</rss>

