Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Does Multiply-Add operations count twice ?

GHui
Novice
745 Views

I do a+=b*c. It is 10000*1000000 size;

#define N 10000
#define LINE 1000000

for(j=0;j<N;j++)
{
  for(i=0;i<LINE;i++)
  {
    a+=b*c;
  }
}

10000*1000000 is almost 20GFloat, and 28Gflops on E5-2699 v4. 

But the PMU counters are only 14Gflops.

 

 

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
746 Views

The only way to be sure of the generated code is to look at it....  You can do this with either the "-S" option on the compile line or with the "objdump -d" command on the executable.   I prefer the "-S" option because the comments in the assembler file map back to the line numbers in the source file.

Sometimes the compiler will generate multiple versions of the code depending on array sizes and alignments, and it can be hard to figure out which code is actually being run.  This can be minimized by making the arrays static with dimensions and loop bounds that are compile-time constants, and by compiling with optimization.  Sometimes it is helpful to add flags like "-fno-alias" and sometimes it is helpful to add "restrict" keywords to the array declarations, but these are not usually necessary for simple codes.  

The bigger problem with simple codes is ensuring that:

  1. All of the arrays are instantiated before use.   The easiest way to do this is to fill them with zeros before they are used for any other purpose.
  2. The results of all of the calculations feed into the output, so the compiler cannot eliminate the computations.  I typically sum up the results of the computation and print this sum, or sometimes I will just print one or a few elements of the output array at the end.

In the code above, the a[] array is overwritten with the same values for every iteration of the j loop, so the j loop does not actually need to be executed N times.  Some compilers can prove that the output of the final iteration is independent of all the prior iterations, so it will only execute the j loop once.  

In other cases (and this may be the key here), the compiler may not be able to make such a large change, but they same effect occurs at a smaller scale.   For example, if the compiler unrolls the outer loop once, the optimizer may recognize that the output of the first half is overwritten by the second half, so it eliminates the first half of each unrolled loop iteration.  This eliminates 1/2 of the arithmetic operations and would account for the factor of 2 discrepancy seen here.

To avoid this problem you need to set up a true data dependency across the iterations of j.  I would recommend something like:

#define N 10000
#define LINE 1000000

double sum, double scalar;

sum = 0.0;
scalar = 1.0;
for(j=0;j<N;j++)
{
  for(i=0;i<LINE;i++)
  {
    a += scalar*c;
  }
  sum += a;
  scalar = a;
}
printf("dummy sum %g\n",sum);

 

View solution in original post

0 Kudos
10 Replies
Thomas_G_4
New Contributor II
745 Views

The short answer: No. The hardware events count FP instructions and not performed arithmetic operation and also not FLOP/s. When you compile the code with FMAs the two required FP operations per iteration are done by a single instruction. So the counts should be 10000*1000000 / 2. Check on E5-2680 v4 with your code and compiled with icc -xHost -O3 using LIKWID (Code instrumented with MarkerAPI):

+------------------------------------------+---------+-------------+
|                   Event                  | Counter |    Core 0   |
+------------------------------------------+---------+-------------+
|             INSTR_RETIRED_ANY            |  FIXC0  | 11936210000 |
|           CPU_CLK_UNHALTED_CORE          |  FIXC1  | 55115580000 |
|           CPU_CLK_UNHALTED_REF           |  FIXC2  | 40096900000 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE |   PMC0  |      0      |
|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE   |   PMC1  |    160000   |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE |   PMC2  |  5000000000 |
+------------------------------------------+---------+-------------+

The event FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE shows the proper counts.

Just as remark: It's the first time I read the unit GFloat. I'm not sure whether this unit exists.

 

0 Kudos
GHui
Novice
746 Views
It's about 5000000000*4 computations.

10000*1000000 is almost 10000*1000000*2 computations.

 

0 Kudos
McCalpinJohn
Honored Contributor III
746 Views

The documentation at https://download.01.org/perfmon/BDW/Broadwell_FP_ARITH_INST_V16.json says

FM(N)ADD/SUB instructions count twice as they perform multiple calculations per element.

I assume that this means that the FMA instructions increment the counter twice, but the wording could be clearer....

0 Kudos
GHui
Novice
746 Views

I'm sorry for that I describe not clearly.

If I count it flops, I use ( fval / time). I doubt that why fval=10G, while not fval=20G.

If FMA instructions increment the counter twice, I should N*LINE*2 (10000*1000000*2 = 20G). 

 

0 Kudos
Thomas_G_4
New Contributor II
746 Views

Sorry for my mistake. I should have read the documentation beforehand.

I agree with Mr. McCalpin, the wording could be clearer. If an event is named to count instructions, it should count the instructions and not the operations (in some cases). In my opinion the events should always count single operations, hence increment at a single unmasked DP AVX2 FMA by 8. Moreover, it should take into account whether vector elements are masked for the instruction and reduce the increment according to the mask.

Are you 100% sure that your compiled code contains FMAs? If it uses normal AVX2 operations but no FMAs, you have 2 AVX instructions per loop iteration taking 2 CPU cycles. FMAs execute the multiplication and addition in one cycle, hence you get twice the FLOP rate. This could be the explanation why you only see 14 GFLOP/s while expecting 28 GFLOP/s.

0 Kudos
GHui
Novice
746 Views

There is nothing different with the two ways(fma and no-fma).

[root@bdw-E5-2699 ~]# icc a.c -no-fma
[root@bdw-E5-2699 ~]# ./a.out
Start Calc
COUNT: 6.009688 GFlops dt=1.663980
COUNT: 6.819478 GFlops dt=1.466388
COUNT: 6.999130 GFlops dt=1.428749
COUNT: 6.984308 GFlops dt=1.431781
COUNT: 6.948044 GFlops dt=1.439254
^C
[root@bdw-E5-2699 ~]# icc a.c -fma
[root@bdw-E5-2699 ~]# ./a.out
Start Calc
COUNT: 6.115572 GFlops dt=1.635170
COUNT: 6.993232 GFlops dt=1.429954
COUNT: 7.098542 GFlops dt=1.408740
COUNT: 6.993760 GFlops dt=1.429846

0 Kudos
GHui
Novice
746 Views


[root@bdw-E5-2699 ~]# icc a.c -xCORE-AVX2
[root@bdw-E5-2699 ~]# ./a.out
Start Calc
COUNT: 10.649412 GFlops dt=0.939019
COUNT: 13.533448 GFlops dt=0.738910
COUNT: 14.235038 GFlops dt=0.702492
COUNT: 14.222081 GFlops dt=0.703132
^C
[root@bdw-E5-2699 ~]# icc a.c -xAVX
[root@bdw-E5-2699 ~]# ./a.out
Start Calc
COUNT: 10.445119 GFlops dt=0.957385
COUNT: 13.668466 GFlops dt=0.731611
COUNT: 13.991278 GFlops dt=0.714731
COUNT: 13.565263 GFlops dt=0.737177
^C
[root@bdw-E5-2699 ~]# icc a.c -xCORE-AVX-I
[root@bdw-E5-2699 ~]# ./a.out
Start Calc
COUNT: 10.847340 GFlops dt=0.921885
COUNT: 13.702910 GFlops dt=0.729772
COUNT: 13.990299 GFlops dt=0.714781
COUNT: 13.914650 GFlops dt=0.718667
COUNT: 13.876689 GFlops dt=0.720633
^C
[root@bdw-E5-2699 ~]# icc a.c -xCORE-AVX512
[root@bdw-E5-2699 ~]# ./a.out

Please verify that both the operating system and the processor support Inte                                                                                                                                                                     l(R) AVX512DQ, AVX512F, AVX512CD, AVX512BW and AVX512VL instructions.

[root@bdw-E5-2699 ~]# icc a.c -xMIC-AVX512
[root@bdw-E5-2699 ~]# ./a.out

Please verify that both the operating system and the processor support Inte                                                                                                                                                                     l(R) AVX512F, AVX512ER, AVX512PF and AVX512CD instructions.

0 Kudos
GHui
Novice
746 Views

[root@bdw-E5-2699 ~]# icc -v
icc version 16.0.3 (gcc version 4.8.5 compatibility)

 

0 Kudos
GHui
Novice
746 Views

It is strange to me.

+------------------------------------------+---------+--------+
|                   Event                  | Counter | Core 0 |
+------------------------------------------+---------+--------+
|             INSTR_RETIRED_ANY            |  FIXC0  |   93   |
|           CPU_CLK_UNHALTED_CORE          |  FIXC1  |   474  |
|           CPU_CLK_UNHALTED_REF           |  FIXC2  |   858  |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE |   PMC0  |    0   |
|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE   |   PMC1  |    0   |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE |   PMC2  |    0   |
+------------------------------------------+---------+--------+

 

CPU name:       Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
CPU type:       Intel Xeon Broadwell EN/EP/EX processor
CPU clock:      2.19 GHz

 

 

0 Kudos
McCalpinJohn
Honored Contributor III
747 Views

The only way to be sure of the generated code is to look at it....  You can do this with either the "-S" option on the compile line or with the "objdump -d" command on the executable.   I prefer the "-S" option because the comments in the assembler file map back to the line numbers in the source file.

Sometimes the compiler will generate multiple versions of the code depending on array sizes and alignments, and it can be hard to figure out which code is actually being run.  This can be minimized by making the arrays static with dimensions and loop bounds that are compile-time constants, and by compiling with optimization.  Sometimes it is helpful to add flags like "-fno-alias" and sometimes it is helpful to add "restrict" keywords to the array declarations, but these are not usually necessary for simple codes.  

The bigger problem with simple codes is ensuring that:

  1. All of the arrays are instantiated before use.   The easiest way to do this is to fill them with zeros before they are used for any other purpose.
  2. The results of all of the calculations feed into the output, so the compiler cannot eliminate the computations.  I typically sum up the results of the computation and print this sum, or sometimes I will just print one or a few elements of the output array at the end.

In the code above, the a[] array is overwritten with the same values for every iteration of the j loop, so the j loop does not actually need to be executed N times.  Some compilers can prove that the output of the final iteration is independent of all the prior iterations, so it will only execute the j loop once.  

In other cases (and this may be the key here), the compiler may not be able to make such a large change, but they same effect occurs at a smaller scale.   For example, if the compiler unrolls the outer loop once, the optimizer may recognize that the output of the first half is overwritten by the second half, so it eliminates the first half of each unrolled loop iteration.  This eliminates 1/2 of the arithmetic operations and would account for the factor of 2 discrepancy seen here.

To avoid this problem you need to set up a true data dependency across the iterations of j.  I would recommend something like:

#define N 10000
#define LINE 1000000

double sum, double scalar;

sum = 0.0;
scalar = 1.0;
for(j=0;j<N;j++)
{
  for(i=0;i<LINE;i++)
  {
    a += scalar*c;
  }
  sum += a;
  scalar = a;
}
printf("dummy sum %g\n",sum);

 

0 Kudos
Reply