How to get the FLOP number of an application?

zhang_t_1 · ‎03-02-2018

Hi:

I write a simple application to test the FLOP count of the application by using SDE;code is as follow :

#include <stdio.h>
#include <stdlib.h>

float addSelf(float a,float b)
{
return a + b;
}
int main()
{
float a = 10.0;
float b = 7.0;
int i = 0;
float c = 0.0;
for(i = 0; i < 999;i++)
{
c = addSelf(a,b);
}
printf("c = %f\n",c);
return 0;
}

The processor is i7-7500U,OS is windows 10, the IDE is CodeBlock; I download the SDE "sde-external-8.16.0-2018-01-30-win", and run the SDE with command : sde -mix -- application.exe;the output file is "sde-mix-out.txt", I search "elements_fp" in the file , but I find nothing ! and I search "FMA" in the file, I find nothing either ! does it means there is no floating calculation in this application ? obviously it's impossible;

excuse me, what's the problem ?

Zakhar_M_Intel1 · ‎03-03-2018

Precise FLOPS computation is very not-trivial on x86, in particular due to AVX-512 mask register usages.

That's why there was a dedicated "FLOPS profiler" developed as a part of Intel Advisor (which is included in Parallel Studio), see https://software.intel.com/en-us/articles/intel-advisor-flops ; and also https://software.intel.com/en-us/advisor/features/vectorization

This feature in Intel Advisor provides precise AVX-512-aware per-loop and per-function FLOP, FLOP/S and FLOP/Byte (AI) mesurement capabilities.

jimdempseyatthecove · ‎03-03-2018

For the particular test program you chose there are several problems

1) A compiler with exceedingly good optimization would see the for loop with a fixed iteration, determine the content of the addSelf, determine the initial values, and at compile time generate the result.

2) A compiler with good optimization would notice that a and b are loop invariant and that c need only be set once and eliminate the loop. (It would do this before it did the work of 1) above)

3) In support of Zakhar's post, the code you showed is scalar, meaning one operation at a time and could not make use of SSE, AVX, AVX2 nor AVX-512 small vector instructions that support 4, 8, 8 and 16 floats respectively.

4) Application attained FLOPS is application dependent. Is the data vectorizable? To what extent? Does data fit completely within registers?, L1 cache?, L2 cache, LL Cash? Within the number of pages that can be mapped by the TLB? What kind of RAM? What kind of mix of the cache levels, competition for memory subsystem. With Turbo? Then is the program multi-threaded, ... And this is an incomplete set of variables in determining FLOPS.

It might be helpful for you to explain how you intend to use or interpret FLOPS. As you cannot use maximum FLOPS from a specific benchmark program and apply it to an application that bears no similarity to the benchmark program.

Jim Dempsey

jimdempseyatthecove · ‎03-03-2018

Taking into account elimination of loop invariant code issues there are two ways to do this:

Calculate by hand the number of floating point operations that the source code would indicate would be required. Run the program and take the estimated (manual) number of operations and divide it by the run time.

Or, build the program without vectorization, if you have a profiler that permits you to count instructions by type, then run the profiler to obtain the floating point instruction counts. Then recompile the program with full optimizations, including vectorization, run it and use the un-optimized floating point instruction count and divide it by the optimized run time. *** Note, this can be a misleading number due to optimization removing loop invariant code.

Jim Dempsey

zhang_t_1 · ‎03-04-2018

Zakhar Matveev (Intel) wrote:

Precise FLOPS computation is very not-trivial on x86, in particular due to AVX-512 mask register usages.

That's why there was a dedicated "FLOPS profiler" developed as a part of Intel Advisor (which is included in Parallel Studio), see https://software.intel.com/en-us/articles/intel-advisor-flops and also https://software.intel.com/en-us/advisor/features/vectorization

This feature in Intel Advisor provides precise AVX-512-aware per-loop and per-function FLOP, FLOP/S and FLOP/Byte (AI) mesurement capabilities.

Do you mean that Intel Advisor is better than the SDE in counting the FLOPS ?

zhang_t_1 · ‎03-04-2018

jimdempseyatthecove wrote:

For the particular test program you chose there are several problems

1) A compiler with exceedingly good optimization would see the for loop with a fixed iteration, determine the content of the addSelf, determine the initial values, and at compile time generate the result.

2) A compiler with good optimization would notice that a and b are loop invariant and that c need only be set once and eliminate the loop. (It would do this before it did the work of 1) above)

3) In support of Zakhar's post, the code you showed is scalar, meaning one operation at a time and could not make use of SSE, AVX, AVX2 nor AVX-512 small vector instructions that support 4, 8, 8 and 16 floats respectively.

4) Application attained FLOPS is application dependent. Is the data vectorizable? To what extent? Does data fit completely within registers?, L1 cache?, L2 cache, LL Cash? Within the number of pages that can be mapped by the TLB? What kind of RAM? What kind of mix of the cache levels, competition for memory subsystem. With Turbo? Then is the program multi-threaded, ... And this is an incomplete set of variables in determining FLOPS.

It might be helpful for you to explain how you intend to use or interpret FLOPS. As you cannot use maximum FLOPS from a specific benchmark program and apply it to an application that bears no similarity to the benchmark program.

Jim Dempsey

Do you mean that if I want to count the FLOPS,I must use instruction like SSE, AVX, AVX2 nor AVX-512 ? otherwise , the SDE can't count the FLOPS ?

James_C_Intel2 · ‎03-05-2018

Jim means that with an optimizing compiler your test code does not have any floating point operations in it, and, that therefore the performance tools are telling you the truth.

Observe https://godbolt.org/g/LmXdoS which takes the fundamentals of your code

float addSelf(float a,float b)
{
    return  a + b;
}
float foo()
{
    float a = 10.0;
    float b = 7.0;
    int i = 0;
    float c = 0.0;
    for(i = 0; i < 999;i++)
    {
        c = addSelf(a,b);
    }
    return c;
}

and shows you the generated assembly code

addSelf(float, float):
  addss xmm0, xmm1
  ret
foo():
  movss xmm0, DWORD PTR .LC0[rip]
  ret
.LC0:
  .long 1099431936

jimdempseyatthecove · ‎03-05-2018

Notes about James' code

The function addSelf was compiled, but not called by the sample code. It is present in the event that an external procedure calls the function.

In the body of function foo, The result was able to be computed at compile time, the call to addSelf eliminated, as well as the loop in foo eliminated, and the pre-computed result was returned (by convention in xmm0). The "by hand" number of floating point operations (+) would be 1000 (plus call overhead and store into c). However the actual count in this case is 0 operations.

While the result is correct, the method differs, thus counting floating point instructions (in this example) produces a non-sense result.

In the case where AVX vector floating point operations are used in the loop, FLOP estimation can also be problematic. A single AVX instruction can perform 1, 2, 3, 4, 5, 6, 7 or 8 floating point operations.

float foo()
{
    __declspec(align(32)) float a[1000];
    __declspec(align(32)) float b[1000];
    .... a and b initialized
    float c = 0.0;
    for(int i = 0; i < 999;i++)
    {
        c = c + a + b;
    }
    return c;
}

The above loop "might" perform 62 (16-wide float adds), + one 8-wide add + a few adds to add horizontally (minimally 4).

IOW, counting instructions does not necessarily result in the a straightforwardly definable FLOPs count.

Jim Dempsey

zhang_t_1 · ‎03-05-2018

Cownie, James H (Intel) wrote:

Jim means that with an optimizing compiler your test code does not have any floating point operations in it, and, that therefore the performance tools are telling you the truth.

Observe https://godbolt.org/g/LmXdoS which takes the fundamentals of your code
float addSelf(float a,float b)
{
    return  a + b;
}
float foo()
{
    float a = 10.0;
    float b = 7.0;
    int i = 0;
    float c = 0.0;
    for(i = 0; i < 999;i++)
    {
        c = addSelf(a,b);
    }
    return c;
}
and shows you the generated assembly code
addSelf(float, float):
  addss xmm0, xmm1
  ret
foo():
  movss xmm0, DWORD PTR .LC0[rip]
  ret
.LC0:
  .long 1099431936

so I changed the code as follows ,these is no other function except the main function :

#include <stdio.h>
#include <stdlib.h>

int main()
{
int num = 10000;
float *arrayA = (float*)malloc(sizeof(float) * num);
float *arrayB = (float*)malloc(sizeof(float) * num);
float C = 0.0;
int i = 0;
if(arrayA == NULL || arrayB == NULL)
{
if(arrayA != NULL)
{
free(arrayA);
arrayA = NULL;
}
if(arrayB != NULL)
{
free(arrayB);
arrayB = NULL;
}

}
for(i = 0; i< num;i++)
{
arrayA = i * 0.1;
arrayB = i * 0.2;
}

for(i = 0; i < num;i++)
{
C += arrayA * arrayB;
}
printf("C = %f\n",C);
free(arrayA);
free(arrayB);
arrayA = NULL;
arrayB = NULL;
}

After I run the SDE,I still can't fine "elements_fp" and "FMA" in the output file;it means the FLOPS is 0;but there are so many times flop calculation;

the output file is as attatchment;

jimdempseyatthecove · ‎03-13-2018

I have not used SDE, but a cursory look shows:

...
F:\wanwan\bin\Release\wanwan.exe                                                             000000400000  000000408fff
...
BLOCK:     1   PC: 00401c50   ICOUNT:    100000   EXECUTIONS:     10000   #BYTES: 34   %:  2.67   cumltv%:  6.42  FN: unnamedImageEntryPoint  IMG: F:\wanwan\bin\Release\wanwan.exe  OFFSET: 1c50
XDIS 00401c50: BASE 89442418                 mov dword ptr [esp+0x18], eax
XDIS 00401c54:  X87 DB442418                 fild st, dword ptr [esp+0x18]
XDIS 00401c58:  X87 D9C0                     fld st, st(0)
XDIS 00401c5a:  X87 D8CA                     fmul st, st(2)
XDIS 00401c5c:  X87 D91C83                   fstp dword ptr [ebx+eax*4], st
XDIS 00401c5f:  X87 DC0D38304000             fmul st, qword ptr [0x403038]
XDIS 00401c65:  X87 D91C86                   fstp dword ptr [esi+eax*4], st
XDIS 00401c68: BASE 83C001                   add eax, 0x1
XDIS 00401c6b: BASE 3D10270000               cmp eax, 0x2710
XDIS 00401c70: BASE 75DE                     jnz 0x401c50
...
BLOCK:     4   PC: 00401c80   ICOUNT:     60000   EXECUTIONS:     10000   #BYTES: 18   %:   1.6   cumltv%:  11.7  FN: unnamedImageEntryPoint  IMG: F:\wanwan\bin\Release\wanwan.exe  OFFSET: 1c80
XDIS 00401c80:  X87 D90483                   fld st, dword ptr [ebx+eax*4]
XDIS 00401c83:  X87 D80C86                   fmul st, dword ptr [esi+eax*4]
XDIS 00401c86: BASE 83C001                   add eax, 0x1
XDIS 00401c89: BASE 3D10270000               cmp eax, 0x2710
XDIS 00401c8e:  X87 DEC1                     faddp st(1), st
XDIS 00401c90: BASE 75EE                     jnz 0x401c80
...

The above instructions are x87 (old FPU instructions), and not AVX2 (with ot without FMA).

You apparently compiled this using the IA32 option that forces use of FPU instruction set.

Jim Dempsey