topic The code of flop in Software Tuning, Performance Optimization & Platform Monitoring

The code of flop

GHui — Wed, 29 Feb 2012 16:03:29 GMT

I want to understand flops more. SNB used PORT, but not FPU, to do floating-point operations. What's program code could test SNB flop. Or how to calculate flop by coding?

The code of flop

Patrick_F_Intel1 — Wed, 29 Feb 2012 16:32:44 GMT

Hello GHui,
This is a complicated question.
FLOP is just a floating point operation.
There is an article on measuring the FLOPs on SNB (and other processor families) at http://software.intel.com/en-us/articles/estimating-flops-using-event-based-sampling-ebs/
Hopefully this helps,
Pat

The code of flop

GHui — Thu, 01 Mar 2012 05:33:03 GMT

Thanks, Pat. The material is very useful.
I do a "a+=b*c" calculate. It's about 529M every second on nehalem. But it display about 1864M every second by other tools which monitored via PMU. I want to know how to understand this situation.

The code of flop

GHui — Thu, 01 Mar 2012 09:00:18 GMT

The code is like this.

[cpp]#include #include #include #include #define NUM 20000 #define TIMEi 10000 int main(int argc,char **argv) { float a[NUM],b[NUM],c[NUM]; long i,j,k; double result; struct timeval tv1,tv2; printf("init datan"); for(i=0;i=b=c=0.2; } printf("start FP opn"); long iMax=TIMEi; while(1) { gettimeofday(&tv1,NULL); for(i=0;i+=b*c; } } gettimeofday(&tv2,NULL); float dt=(float)(tv2.tv_sec*1000000+tv2.tv_usec-tv1.tv_sec*1000000-tv1.tv_usec)/1000000; //float dt=tv2.tv_usec-tv1.tv_usec; result=(double)2*iMax*NUM/dt/1000000; printf("MFlops:%lf %lfn",result,dt); } return 0; } [/cpp]

The code of flop

SergeyKostrov — Thu, 01 Mar 2012 14:35:44 GMT

Did you have a chance to look at aLinpack 100x100 Benchmark in C/C++ for PCs?

The code of flop

GHui — Fri, 02 Mar 2012 07:30:45 GMT

I had replied. Does it can be pass?

The code of flop

Guanghui — Fri, 02 Mar 2012 13:32:03 GMT

I complied the code which download from http://www.netlib.org/benchmark/linpack-pc.c. And run it. My monitor tool shows that 34M, 435M, 1800M, 2045M ... and so on.

The code of flop

Patrick_F_Intel1 — Fri, 02 Mar 2012 13:55:57 GMT

Hello GHui,
I'll try running your program this weekend.
Pat

The code of flop

Patrick_F_Intel1 — Sun, 04 Mar 2012 05:39:23 GMT

[bash]Hello GHui,
When I compile the program with 'gcc -O0 -g ghui_flops.c -o ghui_flops' on my SandyBridge "Intel Core i7-2820QM CPU @ 2.30GHz" processor and run it I get:[/bash]

snb-d2:/home/pfay/flops # ./ghui_flops
init data
start FP op
MFlops:726.550135 0.550547
MFlops:734.875382 0.544310

The assembly code shows that the generated code for the inner loop loads each b[] and c[] value, does the multiply, then the adds the value to a[] and stores it. The compiler generates SSE2 vector instructions but only uses 1 of the 4 available single precision values in the xmm* registers.

I modified your program to print the Ops (in this case Float point ops) and to only do 9 outer loops.
And I ran it under 'perf stat' using the FP_COMP_OPS_EXE.SSE_FP_SCALAR_SINGLE event (event =0x10, umask= 0x20).
See the SDM vol 3 section 19.3 for sandy bridge events.
To specify a 'raw' event with the 'perf' utility, you have to say ' -e rXXYY' where XX is the mask and YY is the event number.

snb-d2:/home/pfay/flops # perf stat -e r2010 ./ghui_flops
init data
start FP op
MFlops:734.230559 Mops= 400.000000 0.544788
MFlops:734.408536 Mops= 400.000000 0.544656
MFlops:734.424691 Mops= 400.000000 0.544644
MFlops:733.564013 Mops= 400.000000 0.545283
MFlops:734.520429 Mops= 400.000000 0.544573
MFlops:734.523162 Mops= 400.000000 0.544571
MFlops:733.920691 Mops= 400.000000 0.545018
MFlops:734.270968 Mops= 400.000000 0.544758
MFlops:733.686477 Mops= 400.000000 0.545192
tot_Mops= 3600.000000, tot_time= 4.903483, overall Mops/sec= 734.172011

Performance counter stats for './ghui_flops':
3635739254 raw 0x2010
4.904172008 seconds time elapsed

Note that the FP_COMP_OPS_EXE.SSE_FP_SCALAR_SINGLE*1.0e-6 count is almost the same as the tot_Mops (3635.7 count versus tot_Mops= 3600).
The Mops calculation expects 2 SSE_FP_SCALAR_SINGLE operations per iteration. The 2 ops are the multiply and the add.

If I now compile with -O to get some optimizations:
snb-d2:/home/pfay/flops # gcc ghui_flops.c -o ghui_flops -g -O

And run it:
snb-d2:/home/pfay/flops # perf stat -e r2010 ./ghui_flops
init data
start FP op
MFlops:2172.460865 Mops= 400.000000 0.184123
MFlops:2223.580994 Mops= 400.000000 0.179890
MFlops:2222.938540 Mops= 400.000000 0.179942
MFlops:2221.691522 Mops= 400.000000 0.180043
MFlops:2222.963208 Mops= 400.000000 0.179940
MFlops:2223.358514 Mops= 400.000000 0.179908
MFlops:2223.420391 Mops= 400.000000 0.179903
MFlops:2222.827359 Mops= 400.000000 0.179951
MFlops:2223.148600 Mops= 400.000000 0.179925
tot_Mops= 3600.000000, tot_time= 1.623625, overall Mops/sec= 2217.260765

Performance counter stats for './ghui_flops':
1800206977 raw 0x2010
1.625062683 seconds time elapsed

Now we are getting near 1 Flop per clocktick... what's going on?
Generate the assembly: gcc ghui_flops.c -o ghui_flops.s -g -O -S -c
Inspecting ghui_flops.s shows the load/multiply/add/store loop has been replaced with (more or less)

for(i=0;i{
float x = b*c;
for(j=0;j {
a+=x;
}
}

So the multiply has been moved out of the inner loop. This explains why 'perf stat' reports half the expected number of flops.

Why are we only using 1 of the 4 single precision values in the xmm registers?
If we change swap the 2 'for()' loops to:

[cpp]for(j=0;j+=b*c; } } [/cpp]

Then the compiler can auto-vectorize and we can compile it with:
gcc ghui_flops.c -o ghui_flops -g -O3 -ftree-vectorizer-verbose=3
The -ftree... option tells you which loops it can vectorize.

Then when we run we see
[plain]snb-d2:/home/pfay/flops # perf stat -e cycles -e r4010 ./ghui_flops init data start FP op MFlops:7673.419523 Mops= 400.000000 0.052128 MFlops:7719.772351 Mops= 400.000000 0.051815 MFlops:7696.007600 Mops= 400.000000 0.051975 MFlops:7732.756103 Mops= 400.000000 0.051728 MFlops:7683.442094 Mops= 400.000000 0.052060 MFlops:7714.263820 Mops= 400.000000 0.051852 MFlops:7741.286464 Mops= 400.000000 0.051671 MFlops:7722.156901 Mops= 400.000000 0.051799 MFlops:7723.647785 Mops= 400.000000 0.051789 tot_Mops= 3600.000000, tot_time= 0.466817, overall Mops/sec= 7711.801490 Performance counter stats for './ghui_flops': 1557784235 cycles 1012898876 raw 0x4010 0.468101879 seconds time elapsed [/plain]

Now we are running 3.5x faster. We are getting about 2.6 Flops/cycle.
Note that I changed the event to 'r0410' which is event FP_COMP_OPS_EXE.SSE_PACKED_SINGLE (event # 0x10 umask=0x40). And I added the '-e cycles' option to get clockticks.

The perf event count is 1,013.7 Mops.
Each SSE_PACKED_SINGLE instruction does 4 operations, so we have to multiply it by 4 to get ops.
We would expect at least3600 Moperations and perf counted about 4000Mops.

I'llattach my updated source code file in another post (this post is already too long).
Does this make sense?
Pat

The code of flop

Patrick_F_Intel1 — Sun, 04 Mar 2012 05:48:45 GMT

Here is my modified version of your source code.

[cpp]#include #include #include #include #define NUM 20000 #define TIMEi 10000 int main(int argc,char **argv) { float a[NUM],b[NUM],c[NUM]; int i,j,k, m; double result, ops, tot_ops, tot_time; struct timeval tv1,tv2; printf("init datan"); for(i=0;i=b=c=0.2; } printf("start FP opn"); long iMax=TIMEi; m = 0; tot_ops = 0.0; tot_time = 0.0; while(++m < 10) { gettimeofday(&tv1,NULL); for(i=0;i+=b*c; } } gettimeofday(&tv2,NULL); float dt=(float)(tv2.tv_sec*1000000+tv2.tv_usec-tv1.tv_sec*1000000-tv1.tv_usec)/1000000; //float dt=tv2.tv_usec-tv1.tv_usec; ops = (double)2*iMax*NUM; result=ops/dt/1000000.0; printf("MFlops:%lf Mops= %f %lfn",result,1.0e-6*ops, dt); tot_ops += ops; tot_time+= dt; } if(argc > 10) // just put this in so compiler doesn't optimize everything away. { float d=0; for(j=0;j; } printf("d= %fn", d); } printf("tot_Mops= %f, tot_time= %f, overall Mops/sec= %fn", 1.0e-6*tot_ops, tot_time, 1.0e-6*tot_ops/tot_time); return 0; } [/cpp]

The code of flop

SergeyKostrov — Sun, 04 Mar 2012 23:13:02 GMT

Hi Patrick,

Thank you! I'll try to use the code to verify a performance (in Flops) of my test computers.

Best regards,
Sergey

The code of flop

Guanghui — Mon, 05 Mar 2012 12:20:49 GMT

Hello Pat.
Thank you very much.
GHui.

The code of flop

Patrick_F_Intel1 — Mon, 05 Mar 2012 18:05:48 GMT

One of theloops in my code has an error. The loop:

[cpp]for(j=0;j; } [/cpp]

should be:

[cpp]for(j=0;j; } [/cpp]

The loop will only get executed ifsomeone ever enters 10 args to the program.
I put the loop in because, for one intermediate version of my code, the compiler optimized away the loops.
Pat

The code of flop

SergeyKostrov — Tue, 06 Mar 2012 01:34:06 GMT

Thanks for the update, Patrick! Ididn't have time yet to run the test but I hope I'll spend some time soon.

Best regards,
Sergey

The code of flop

Guanghui — Tue, 06 Mar 2012 16:40:35 GMT

Hi Pat,

For several days past, I really don't understand the following code piece. It can change the compiler compiled result? Or how the code piece can effect the compiler?

[cpp] if(argc > 10) // just put this in so compiler doesn't optimize everything away. { float d=0; for(j=0;j; } printf("d= %fn", d); } [/cpp]

The code of flop

Patrick_F_Intel1 — Tue, 06 Mar 2012 17:10:03 GMT

Hello Guanghui,
Sometimes compilers will optimize away sections of code.
For instance, if the compiler sees that the result arraya[] aren't used anywhere and nothing depends on the result, why not just delete the whole 'a += b * c;' loop?
This is what happened when I compiled before I added the 'if(argc > 10)' logic.

The 'if(argc > 10)' logic makes it where the compiler can't say (at compile time) whether the results in the a[] array are used or not. So the compiler leavesin the 'a += b * c;' loop.
Does this make sense?
Pat

The code of flop

Guanghui — Wed, 07 Mar 2012 11:02:41 GMT

Yes. Thank you very much, Pat.

The code of flop

SergeyKostrov — Wed, 07 Mar 2012 15:24:11 GMT

Quoting Patrick Fay (Intel)

...
Sometimes compilers will optimize away sections of code.
For instance, if the compiler sees that the result arraya[] aren't used anywhere and nothing depends on the result, why not just delete the whole 'a += b * c;' loop?
...

I'm always concerned in such cases.