- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi:
I write a simple application to test the FLOP count of the application by using SDE;code is as follow :
#include <stdio.h>
#include <stdlib.h>
float addSelf(float a,float b)
{
return a + b;
}
int main()
{
float a = 10.0;
float b = 7.0;
int i = 0;
float c = 0.0;
for(i = 0; i < 999;i++)
{
c = addSelf(a,b);
}
printf("c = %f\n",c);
return 0;
}
The processor is i7-7500U,OS is windows 10, the IDE is CodeBlock; I download the SDE "sde-external-8.16.0-2018-01-30-win", and run the SDE with command : sde -mix -- application.exe;the output file is "sde-mix-out.txt", I search "elements_fp" in the file , but I find nothing ! and I search "FMA" in the file, I find nothing either ! does it means there is no floating calculation in this application ? obviously it's impossible;
excuse me, what's the problem ?
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Precise FLOPS computation is very not-trivial on x86, in particular due to AVX-512 mask register usages.
That's why there was a dedicated "FLOPS profiler" developed as a part of Intel Advisor (which is included in Parallel Studio), see https://software.intel.com/en-us/articles/intel-advisor-flops ; and also https://software.intel.com/en-us/advisor/features/vectorization
This feature in Intel Advisor provides precise AVX-512-aware per-loop and per-function FLOP, FLOP/S and FLOP/Byte (AI) mesurement capabilities.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For the particular test program you chose there are several problems
1) A compiler with exceedingly good optimization would see the for loop with a fixed iteration, determine the content of the addSelf, determine the initial values, and at compile time generate the result.
2) A compiler with good optimization would notice that a and b are loop invariant and that c need only be set once and eliminate the loop. (It would do this before it did the work of 1) above)
3) In support of Zakhar's post, the code you showed is scalar, meaning one operation at a time and could not make use of SSE, AVX, AVX2 nor AVX-512 small vector instructions that support 4, 8, 8 and 16 floats respectively.
4) Application attained FLOPS is application dependent. Is the data vectorizable? To what extent? Does data fit completely within registers?, L1 cache?, L2 cache, LL Cash? Within the number of pages that can be mapped by the TLB? What kind of RAM? What kind of mix of the cache levels, competition for memory subsystem. With Turbo? Then is the program multi-threaded, ... And this is an incomplete set of variables in determining FLOPS.
It might be helpful for you to explain how you intend to use or interpret FLOPS. As you cannot use maximum FLOPS from a specific benchmark program and apply it to an application that bears no similarity to the benchmark program.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Taking into account elimination of loop invariant code issues there are two ways to do this:
Calculate by hand the number of floating point operations that the source code would indicate would be required. Run the program and take the estimated (manual) number of operations and divide it by the run time.
Or, build the program without vectorization, if you have a profiler that permits you to count instructions by type, then run the profiler to obtain the floating point instruction counts. Then recompile the program with full optimizations, including vectorization, run it and use the un-optimized floating point instruction count and divide it by the optimized run time. *** Note, this can be a misleading number due to optimization removing loop invariant code.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Zakhar Matveev (Intel) wrote:
Precise FLOPS computation is very not-trivial on x86, in particular due to AVX-512 mask register usages.
That's why there was a dedicated "FLOPS profiler" developed as a part of Intel Advisor (which is included in Parallel Studio), see https://software.intel.com/en-us/articles/intel-advisor-flops and also https://software.intel.com/en-us/advisor/features/vectorization
This feature in Intel Advisor provides precise AVX-512-aware per-loop and per-function FLOP, FLOP/S and FLOP/Byte (AI) mesurement capabilities.
Do you mean that Intel Advisor is better than the SDE in counting the FLOPS ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
For the particular test program you chose there are several problems
1) A compiler with exceedingly good optimization would see the for loop with a fixed iteration, determine the content of the addSelf, determine the initial values, and at compile time generate the result.
2) A compiler with good optimization would notice that a and b are loop invariant and that c need only be set once and eliminate the loop. (It would do this before it did the work of 1) above)
3) In support of Zakhar's post, the code you showed is scalar, meaning one operation at a time and could not make use of SSE, AVX, AVX2 nor AVX-512 small vector instructions that support 4, 8, 8 and 16 floats respectively.
4) Application attained FLOPS is application dependent. Is the data vectorizable? To what extent? Does data fit completely within registers?, L1 cache?, L2 cache, LL Cash? Within the number of pages that can be mapped by the TLB? What kind of RAM? What kind of mix of the cache levels, competition for memory subsystem. With Turbo? Then is the program multi-threaded, ... And this is an incomplete set of variables in determining FLOPS.
It might be helpful for you to explain how you intend to use or interpret FLOPS. As you cannot use maximum FLOPS from a specific benchmark program and apply it to an application that bears no similarity to the benchmark program.
Jim Dempsey
Do you mean that if I want to count the FLOPS,I must use instruction like SSE, AVX, AVX2 nor AVX-512 ? otherwise , the SDE can't count the FLOPS ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim means that with an optimizing compiler your test code does not have any floating point operations in it, and, that therefore the performance tools are telling you the truth.
Observe https://godbolt.org/g/LmXdoS which takes the fundamentals of your code
float addSelf(float a,float b) { return a + b; } float foo() { float a = 10.0; float b = 7.0; int i = 0; float c = 0.0; for(i = 0; i < 999;i++) { c = addSelf(a,b); } return c; }
addSelf(float, float): addss xmm0, xmm1 ret foo(): movss xmm0, DWORD PTR .LC0[rip] ret .LC0: .long 1099431936
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Notes about James' code
The function addSelf was compiled, but not called by the sample code. It is present in the event that an external procedure calls the function.
In the body of function foo, The result was able to be computed at compile time, the call to addSelf eliminated, as well as the loop in foo eliminated, and the pre-computed result was returned (by convention in xmm0). The "by hand" number of floating point operations (+) would be 1000 (plus call overhead and store into c). However the actual count in this case is 0 operations.
While the result is correct, the method differs, thus counting floating point instructions (in this example) produces a non-sense result.
In the case where AVX vector floating point operations are used in the loop, FLOP estimation can also be problematic. A single AVX instruction can perform 1, 2, 3, 4, 5, 6, 7 or 8 floating point operations.
float foo() { __declspec(align(32)) float a[1000]; __declspec(align(32)) float b[1000]; .... a and b initialized float c = 0.0; for(int i = 0; i < 999;i++) { c = c + a + b; } return c; }
The above loop "might" perform 62 (16-wide float adds), + one 8-wide add + a few adds to add horizontally (minimally 4).
IOW, counting instructions does not necessarily result in the a straightforwardly definable FLOPs count.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Cownie, James H (Intel) wrote:
Jim means that with an optimizing compiler your test code does not have any floating point operations in it, and, that therefore the performance tools are telling you the truth.
Observe https://godbolt.org/g/LmXdoS which takes the fundamentals of your code
float addSelf(float a,float b) { return a + b; } float foo() { float a = 10.0; float b = 7.0; int i = 0; float c = 0.0; for(i = 0; i < 999;i++) { c = addSelf(a,b); } return c; }and shows you the generated assembly code
addSelf(float, float): addss xmm0, xmm1 ret foo(): movss xmm0, DWORD PTR .LC0[rip] ret .LC0: .long 1099431936
so I changed the code as follows ,these is no other function except the main function :
#include <stdio.h>
#include <stdlib.h>
int main()
{
int num = 10000;
float *arrayA = (float*)malloc(sizeof(float) * num);
float *arrayB = (float*)malloc(sizeof(float) * num);
float C = 0.0;
int i = 0;
if(arrayA == NULL || arrayB == NULL)
{
if(arrayA != NULL)
{
free(arrayA);
arrayA = NULL;
}
if(arrayB != NULL)
{
free(arrayB);
arrayB = NULL;
}
}
for(i = 0; i< num;i++)
{
arrayA = i * 0.1;
arrayB = i * 0.2;
}
for(i = 0; i < num;i++)
{
C += arrayA * arrayB;
}
printf("C = %f\n",C);
free(arrayA);
free(arrayB);
arrayA = NULL;
arrayB = NULL;
}
After I run the SDE,I still can't fine "elements_fp" and "FMA" in the output file;it means the FLOPS is 0;but there are so many times flop calculation;
the output file is as attatchment;
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have not used SDE, but a cursory look shows:
... F:\wanwan\bin\Release\wanwan.exe 000000400000 000000408fff ... BLOCK: 1 PC: 00401c50 ICOUNT: 100000 EXECUTIONS: 10000 #BYTES: 34 %: 2.67 cumltv%: 6.42 FN: unnamedImageEntryPoint IMG: F:\wanwan\bin\Release\wanwan.exe OFFSET: 1c50 XDIS 00401c50: BASE 89442418 mov dword ptr [esp+0x18], eax XDIS 00401c54: X87 DB442418 fild st, dword ptr [esp+0x18] XDIS 00401c58: X87 D9C0 fld st, st(0) XDIS 00401c5a: X87 D8CA fmul st, st(2) XDIS 00401c5c: X87 D91C83 fstp dword ptr [ebx+eax*4], st XDIS 00401c5f: X87 DC0D38304000 fmul st, qword ptr [0x403038] XDIS 00401c65: X87 D91C86 fstp dword ptr [esi+eax*4], st XDIS 00401c68: BASE 83C001 add eax, 0x1 XDIS 00401c6b: BASE 3D10270000 cmp eax, 0x2710 XDIS 00401c70: BASE 75DE jnz 0x401c50 ... BLOCK: 4 PC: 00401c80 ICOUNT: 60000 EXECUTIONS: 10000 #BYTES: 18 %: 1.6 cumltv%: 11.7 FN: unnamedImageEntryPoint IMG: F:\wanwan\bin\Release\wanwan.exe OFFSET: 1c80 XDIS 00401c80: X87 D90483 fld st, dword ptr [ebx+eax*4] XDIS 00401c83: X87 D80C86 fmul st, dword ptr [esi+eax*4] XDIS 00401c86: BASE 83C001 add eax, 0x1 XDIS 00401c89: BASE 3D10270000 cmp eax, 0x2710 XDIS 00401c8e: X87 DEC1 faddp st(1), st XDIS 00401c90: BASE 75EE jnz 0x401c80 ...
The above instructions are x87 (old FPU instructions), and not AVX2 (with ot without FMA).
You apparently compiled this using the IA32 option that forces use of FPU instruction set.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page