<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Does Multiply-Add operations count twice ? in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065301#M5210</link>
    <description>&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;I do a&lt;I&gt;+=b&lt;I&gt;*c&lt;I&gt;. It is&amp;nbsp;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;10000&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;*&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;1000000&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp;size;&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:;"&gt;#define N 10000
#define LINE 1000000

for(j=0;j&amp;lt;N;j++)
{
  for(i=0;i&amp;lt;LINE;i++)
  {
    a&lt;I&gt;+=b&lt;I&gt;*c&lt;I&gt;;
  }
}&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;10000&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;*1000000 is almost 20GFloat, and 28Gflops on&amp;nbsp;&lt;/SPAN&gt;E5-2699 v4.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;But the PMU counters are only 14Gflops.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 06 Sep 2016 04:09:29 GMT</pubDate>
    <dc:creator>GHui</dc:creator>
    <dc:date>2016-09-06T04:09:29Z</dc:date>
    <item>
      <title>Does Multiply-Add operations count twice ?</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065301#M5210</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;I do a&lt;I&gt;+=b&lt;I&gt;*c&lt;I&gt;. It is&amp;nbsp;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;10000&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;*&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;1000000&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp;size;&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:;"&gt;#define N 10000
#define LINE 1000000

for(j=0;j&amp;lt;N;j++)
{
  for(i=0;i&amp;lt;LINE;i++)
  {
    a&lt;I&gt;+=b&lt;I&gt;*c&lt;I&gt;;
  }
}&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;10000&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;*1000000 is almost 20GFloat, and 28Gflops on&amp;nbsp;&lt;/SPAN&gt;E5-2699 v4.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;But the PMU counters are only 14Gflops.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 06 Sep 2016 04:09:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065301#M5210</guid>
      <dc:creator>GHui</dc:creator>
      <dc:date>2016-09-06T04:09:29Z</dc:date>
    </item>
    <item>
      <title>The short answer: No. The</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065302#M5211</link>
      <description>&lt;P&gt;The short answer: No. The hardware events count FP instructions and not performed arithmetic operation and also not FLOP/s. When you compile the code with FMAs the two required FP operations per iteration are done by a single instruction. So the counts should be 10000*1000000 / 2. Check on E5-2680 v4 with your code and compiled with icc -xHost -O3 using LIKWID (Code instrumented with MarkerAPI):&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;+------------------------------------------+---------+-------------+
|                   Event                  | Counter |    Core 0   |
+------------------------------------------+---------+-------------+
|             INSTR_RETIRED_ANY            |  FIXC0  | 11936210000 |
|           CPU_CLK_UNHALTED_CORE          |  FIXC1  | 55115580000 |
|           CPU_CLK_UNHALTED_REF           |  FIXC2  | 40096900000 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE |   PMC0  |      0      |
|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE   |   PMC1  |    160000   |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE |   PMC2  |  5000000000 |
+------------------------------------------+---------+-------------+&lt;/PRE&gt;

&lt;P&gt;The event FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE shows the proper counts.&lt;/P&gt;

&lt;P&gt;Just as remark: It's the first time I read the unit GFloat. I'm not sure whether this unit exists.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 06 Sep 2016 08:50:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065302#M5211</guid>
      <dc:creator>Thomas_G_4</dc:creator>
      <dc:date>2016-09-06T08:50:03Z</dc:date>
    </item>
    <item>
      <title>It's about 5000000000*4</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065303#M5212</link>
      <description>&lt;PRE style="color: rgb(0, 0, 0); line-height: normal; word-wrap: break-word; white-space: pre-wrap;"&gt;&lt;SPAN style="color: rgb(0, 0, 0); font-family: Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace; font-size: 13.008px; line-height: 14.3088px;"&gt;It's about 5000000000*4&lt;/SPAN&gt; computations.&lt;/PRE&gt;

&lt;P&gt;&lt;SPAN style="font-family: Arial, SimHei, SimSun, Tahoma, Helvetica, sans-serif; font-size: 12px; line-height: 18px;"&gt;10000*1000000 is almost&amp;nbsp;10000*1000000*2 computations.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 06 Sep 2016 10:09:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065303#M5212</guid>
      <dc:creator>GHui</dc:creator>
      <dc:date>2016-09-06T10:09:43Z</dc:date>
    </item>
    <item>
      <title>The documentation at https:/</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065304#M5213</link>
      <description>&lt;P&gt;The documentation at &lt;A href="https://download.01.org/perfmon/BDW/Broadwell_FP_ARITH_INST_V16.json" target="_blank"&gt;https://download.01.org/perfmon/BDW/Broadwell_FP_ARITH_INST_V16.json&lt;/A&gt; says&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;PRE&gt;FM(N)ADD/SUB instructions count twice as they perform multiple calculations per element.&lt;/PRE&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;I assume that this means that the FMA instructions increment the counter twice, but the wording could be clearer....&lt;/P&gt;</description>
      <pubDate>Tue, 06 Sep 2016 17:57:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065304#M5213</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-09-06T17:57:38Z</dc:date>
    </item>
    <item>
      <title>I'm sorry for that I describe</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065305#M5214</link>
      <description>&lt;P&gt;I'm sorry for that I describe not clearly.&lt;/P&gt;

&lt;P&gt;If I count it flops, I use ( fval / time). I doubt that why fval=10G, while not fval=20G.&lt;/P&gt;

&lt;P&gt;If&amp;nbsp;&lt;SPAN style="font-family: Arial, SimHei, SimSun, Tahoma, Helvetica, sans-serif; font-size: 12px; line-height: 18px;"&gt;FMA instructions increment the counter twice, I should N*LINE*2 (10000*1000000*2 = 20G).&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 07 Sep 2016 04:13:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065305#M5214</guid>
      <dc:creator>GHui</dc:creator>
      <dc:date>2016-09-07T04:13:46Z</dc:date>
    </item>
    <item>
      <title>Sorry for my mistake. I</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065306#M5215</link>
      <description>&lt;P&gt;Sorry for my mistake. I should have read the documentation beforehand.&lt;/P&gt;

&lt;P&gt;I agree with Mr. McCalpin, the wording could be clearer. If an event is named to count instructions, it should count the instructions and not the operations (in some cases). In my opinion the events should always count single operations, hence increment at a single unmasked DP AVX2 FMA by 8. Moreover, it should take into account whether vector elements are masked for the instruction and reduce the increment according to the mask.&lt;/P&gt;

&lt;P&gt;Are you 100% sure that your compiled code contains FMAs? If it uses normal AVX2 operations but no FMAs, you have 2 AVX instructions per loop iteration taking 2 CPU cycles. FMAs execute the multiplication and addition in one cycle, hence you get twice the FLOP rate. This could be the explanation why you only see 14 GFLOP/s while expecting 28 GFLOP/s.&lt;/P&gt;</description>
      <pubDate>Wed, 07 Sep 2016 12:11:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065306#M5215</guid>
      <dc:creator>Thomas_G_4</dc:creator>
      <dc:date>2016-09-07T12:11:31Z</dc:date>
    </item>
    <item>
      <title>There is nothing different</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065307#M5216</link>
      <description>&lt;P&gt;There is nothing different with the two ways(fma and no-fma).&lt;/P&gt;

&lt;P&gt;[root@bdw-E5-2699 ~]# icc a.c -no-fma&lt;BR /&gt;
	[root@bdw-E5-2699 ~]# ./a.out&lt;BR /&gt;
	Start Calc&lt;BR /&gt;
	COUNT: 6.009688 GFlops dt=1.663980&lt;BR /&gt;
	COUNT: 6.819478 GFlops dt=1.466388&lt;BR /&gt;
	COUNT: 6.999130 GFlops dt=1.428749&lt;BR /&gt;
	COUNT: 6.984308 GFlops dt=1.431781&lt;BR /&gt;
	COUNT: 6.948044 GFlops dt=1.439254&lt;BR /&gt;
	^C&lt;BR /&gt;
	[root@bdw-E5-2699 ~]# icc a.c -fma&lt;BR /&gt;
	[root@bdw-E5-2699 ~]# ./a.out&lt;BR /&gt;
	Start Calc&lt;BR /&gt;
	COUNT: 6.115572 GFlops dt=1.635170&lt;BR /&gt;
	COUNT: 6.993232 GFlops dt=1.429954&lt;BR /&gt;
	COUNT: 7.098542 GFlops dt=1.408740&lt;BR /&gt;
	COUNT: 6.993760 GFlops dt=1.429846&lt;/P&gt;</description>
      <pubDate>Thu, 08 Sep 2016 03:32:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065307#M5216</guid>
      <dc:creator>GHui</dc:creator>
      <dc:date>2016-09-08T03:32:10Z</dc:date>
    </item>
    <item>
      <title>[root@bdw-E5-2699 ~]# icc a.c</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065308#M5217</link>
      <description>&lt;P&gt;&lt;BR /&gt;
	[root@bdw-E5-2699 ~]# icc a.c -xCORE-AVX2&lt;BR /&gt;
	[root@bdw-E5-2699 ~]# ./a.out&lt;BR /&gt;
	Start Calc&lt;BR /&gt;
	COUNT: 10.649412 GFlops dt=0.939019&lt;BR /&gt;
	COUNT: 13.533448 GFlops dt=0.738910&lt;BR /&gt;
	COUNT: 14.235038 GFlops dt=0.702492&lt;BR /&gt;
	COUNT: 14.222081 GFlops dt=0.703132&lt;BR /&gt;
	^C&lt;BR /&gt;
	[root@bdw-E5-2699 ~]# icc a.c -xAVX&lt;BR /&gt;
	[root@bdw-E5-2699 ~]# ./a.out&lt;BR /&gt;
	Start Calc&lt;BR /&gt;
	COUNT: 10.445119 GFlops dt=0.957385&lt;BR /&gt;
	COUNT: 13.668466 GFlops dt=0.731611&lt;BR /&gt;
	COUNT: 13.991278 GFlops dt=0.714731&lt;BR /&gt;
	COUNT: 13.565263 GFlops dt=0.737177&lt;BR /&gt;
	^C&lt;BR /&gt;
	[root@bdw-E5-2699 ~]# icc a.c -xCORE-AVX-I&lt;BR /&gt;
	[root@bdw-E5-2699 ~]# ./a.out&lt;BR /&gt;
	Start Calc&lt;BR /&gt;
	COUNT: 10.847340 GFlops dt=0.921885&lt;BR /&gt;
	COUNT: 13.702910 GFlops dt=0.729772&lt;BR /&gt;
	COUNT: 13.990299 GFlops dt=0.714781&lt;BR /&gt;
	COUNT: 13.914650 GFlops dt=0.718667&lt;BR /&gt;
	COUNT: 13.876689 GFlops dt=0.720633&lt;BR /&gt;
	^C&lt;BR /&gt;
	[root@bdw-E5-2699 ~]# icc a.c -xCORE-AVX512&lt;BR /&gt;
	[root@bdw-E5-2699 ~]# ./a.out&lt;/P&gt;

&lt;P&gt;Please verify that both the operating system and the processor support Inte &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; l(R) AVX512DQ, AVX512F, AVX512CD, AVX512BW and AVX512VL instructions.&lt;/P&gt;

&lt;P&gt;[root@bdw-E5-2699 ~]# icc a.c -xMIC-AVX512&lt;BR /&gt;
	[root@bdw-E5-2699 ~]# ./a.out&lt;/P&gt;

&lt;P&gt;Please verify that both the operating system and the processor support Inte &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; l(R) AVX512F, AVX512ER, AVX512PF and AVX512CD instructions.&lt;/P&gt;</description>
      <pubDate>Thu, 08 Sep 2016 04:16:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065308#M5217</guid>
      <dc:creator>GHui</dc:creator>
      <dc:date>2016-09-08T04:16:51Z</dc:date>
    </item>
    <item>
      <title>[root@bdw-E5-2699 ~]# icc -v</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065309#M5218</link>
      <description>&lt;P&gt;[root@&lt;A href="mailto:root@bdw" style="font-family: Arial, SimHei, SimSun, Tahoma, Helvetica, sans-serif; font-size: 12px; line-height: 18px;"&gt;bdw&lt;/A&gt;&lt;SPAN style="font-family: Arial, SimHei, SimSun, Tahoma, Helvetica, sans-serif; font-size: 12px; line-height: 18px;"&gt;-E5-2699&lt;/SPAN&gt; ~]# icc -v&lt;BR /&gt;
	icc version 16.0.3 (gcc version 4.8.5 compatibility)&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 08 Sep 2016 04:28:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065309#M5218</guid>
      <dc:creator>GHui</dc:creator>
      <dc:date>2016-09-08T04:28:23Z</dc:date>
    </item>
    <item>
      <title>It is strange to me.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065310#M5219</link>
      <description>&lt;P&gt;It is strange to me.&lt;/P&gt;

&lt;PRE class="brush:;"&gt;+------------------------------------------+---------+--------+
|                   Event                  | Counter | Core 0 |
+------------------------------------------+---------+--------+
|             INSTR_RETIRED_ANY            |  FIXC0  |   93   |
|           CPU_CLK_UNHALTED_CORE          |  FIXC1  |   474  |
|           CPU_CLK_UNHALTED_REF           |  FIXC2  |   858  |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE |   PMC0  |    0   |
|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE   |   PMC1  |    0   |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE |   PMC2  |    0   |
+------------------------------------------+---------+--------+
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;CPU name: &amp;nbsp; &amp;nbsp; &amp;nbsp; Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz&lt;BR /&gt;
	CPU type: &amp;nbsp; &amp;nbsp; &amp;nbsp; Intel Xeon Broadwell EN/EP/EX processor&lt;BR /&gt;
	CPU clock: &amp;nbsp; &amp;nbsp; &amp;nbsp;2.19 GHz&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 08 Sep 2016 05:15:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065310#M5219</guid>
      <dc:creator>GHui</dc:creator>
      <dc:date>2016-09-08T05:15:53Z</dc:date>
    </item>
    <item>
      <title>The only way to be sure of</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065311#M5220</link>
      <description>&lt;P&gt;The only way to be sure of the generated code is to look at it....&amp;nbsp; You can do this with either the "-S" option on the compile line or with the "objdump -d" command on the executable.&amp;nbsp;&amp;nbsp; I prefer the "-S" option because the comments in the assembler file map back to the line numbers in the source file.&lt;/P&gt;

&lt;P&gt;Sometimes the compiler will generate multiple versions of the code depending on array sizes and alignments, and it can be hard to figure out which code is actually being run.&amp;nbsp; This can be minimized by making the arrays static with dimensions and loop bounds that are compile-time constants, and by compiling with optimization.&amp;nbsp; Sometimes it is helpful to add flags like "-fno-alias" and sometimes it is helpful to add "restrict" keywords to the array declarations, but these are not usually necessary for simple codes.&amp;nbsp;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;The bigger problem with simple codes is ensuring that:&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;All of the arrays are instantiated before use.&amp;nbsp;&amp;nbsp; The easiest way to do this is to fill them with zeros before they are used for any other purpose.&lt;/LI&gt;
	&lt;LI&gt;The results of all of the calculations feed into the output, so the compiler cannot eliminate the computations.&amp;nbsp; I typically sum up the results of the computation and print this sum, or sometimes I will just print one or a few elements of the output array at the end.&lt;/LI&gt;
&lt;/OL&gt;

&lt;P&gt;In the code above, the a[] array is overwritten with the same values for every iteration of the j loop, so the j loop does not actually need to be executed N times.&amp;nbsp; Some compilers can prove that the output of the final iteration is independent of all the prior iterations, so it will only execute the j loop once.&amp;nbsp;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;In other cases (and this may be the key here), the compiler may not be able to make such a large change, but they same effect occurs at a smaller scale.&amp;nbsp;&amp;nbsp; For example, if the compiler unrolls the outer loop once, the optimizer may recognize that the output of the first half is overwritten by the second half, so it eliminates the first half of each unrolled loop iteration.&amp;nbsp; This eliminates 1/2 of the arithmetic operations and would account for the factor of 2 discrepancy seen here.&lt;/P&gt;

&lt;P&gt;To avoid this problem you need to set up a true data dependency across the iterations of j.&amp;nbsp; I would recommend something like:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#define N 10000
#define LINE 1000000

double sum, double scalar;

sum = 0.0;
scalar = 1.0;
for(j=0;j&amp;lt;N;j++)
{
  for(i=0;i&amp;lt;LINE;i++)
  {
    a&lt;I&gt; += scalar*c&lt;I&gt;;
  }
  sum += a&lt;J&gt;;
  scalar = a&lt;J&gt;;
}
printf("dummy sum %g\n",sum);&lt;/J&gt;&lt;/J&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 09 Sep 2016 18:23:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Does-Multiply-Add-operations-count-twice/m-p/1065311#M5220</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-09-09T18:23:29Z</dc:date>
    </item>
  </channel>
</rss>

