- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I want to understand flops more. SNB used PORT, but not FPU, to do floating-point operations. What's program code could test SNB flop. Or how to calculate flop by coding?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is a complicated question.
FLOP is just a floating point operation.
There is an article on measuring the FLOPs on SNB (and other processor families) at http://software.intel.com/en-us/articles/estimating-flops-using-event-based-sampling-ebs/
Hopefully this helps,
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I do a "a+=b*c" calculate. It's about 529M every second on nehalem. But it display about 1864M every second by other tools which monitored via PMU. I want to know how to understand this situation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]#include
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll try running your program this weekend.
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When I compile the program with 'gcc -O0 -g ghui_flops.c -o ghui_flops' on my SandyBridge "Intel Core i7-2820QM CPU @ 2.30GHz" processor and run it I get:[/bash]
snb-d2:/home/pfay/flops # ./ghui_flops
init data
start FP op
MFlops:726.550135 0.550547
MFlops:734.875382 0.544310
The assembly code shows that the generated code for the inner loop loads each b[] and c[] value, does the multiply, then the adds the value to a[] and stores it. The compiler generates SSE2 vector instructions but only uses 1 of the 4 available single precision values in the xmm* registers.
I modified your program to print the Ops (in this case Float point ops) and to only do 9 outer loops.
And I ran it under 'perf stat' using the FP_COMP_OPS_EXE.SSE_FP_SCALAR_SINGLE event (event =0x10, umask= 0x20).
See the SDM vol 3 section 19.3 for sandy bridge events.
To specify a 'raw' event with the 'perf' utility, you have to say ' -e rXXYY' where XX is the mask and YY is the event number.
snb-d2:/home/pfay/flops # perf stat -e r2010 ./ghui_flops
init data
start FP op
MFlops:734.230559 Mops= 400.000000 0.544788
MFlops:734.408536 Mops= 400.000000 0.544656
MFlops:734.424691 Mops= 400.000000 0.544644
MFlops:733.564013 Mops= 400.000000 0.545283
MFlops:734.520429 Mops= 400.000000 0.544573
MFlops:734.523162 Mops= 400.000000 0.544571
MFlops:733.920691 Mops= 400.000000 0.545018
MFlops:734.270968 Mops= 400.000000 0.544758
MFlops:733.686477 Mops= 400.000000 0.545192
tot_Mops= 3600.000000, tot_time= 4.903483, overall Mops/sec= 734.172011
Performance counter stats for './ghui_flops':
3635739254 raw 0x2010
4.904172008 seconds time elapsed
Note that the FP_COMP_OPS_EXE.SSE_FP_SCALAR_SINGLE*1.0e-6 count is almost the same as the tot_Mops (3635.7 count versus tot_Mops= 3600).
The Mops calculation expects 2 SSE_FP_SCALAR_SINGLE operations per iteration. The 2 ops are the multiply and the add.
If I now compile with -O to get some optimizations:
snb-d2:/home/pfay/flops # gcc ghui_flops.c -o ghui_flops -g -O
And run it:
snb-d2:/home/pfay/flops # perf stat -e r2010 ./ghui_flops
init data
start FP op
MFlops:2172.460865 Mops= 400.000000 0.184123
MFlops:2223.580994 Mops= 400.000000 0.179890
MFlops:2222.938540 Mops= 400.000000 0.179942
MFlops:2221.691522 Mops= 400.000000 0.180043
MFlops:2222.963208 Mops= 400.000000 0.179940
MFlops:2223.358514 Mops= 400.000000 0.179908
MFlops:2223.420391 Mops= 400.000000 0.179903
MFlops:2222.827359 Mops= 400.000000 0.179951
MFlops:2223.148600 Mops= 400.000000 0.179925
tot_Mops= 3600.000000, tot_time= 1.623625, overall Mops/sec= 2217.260765
Performance counter stats for './ghui_flops':
1800206977 raw 0x2010
1.625062683 seconds time elapsed
Now we are getting near 1 Flop per clocktick... what's going on?
Generate the assembly: gcc ghui_flops.c -o ghui_flops.s -g -O -S -c
Inspecting ghui_flops.s shows the load/multiply/add/store loop has been replaced with (more or less)
for(i=0;i
float x = b*c;
for(j=0;j
a+=x;
}
}
So the multiply has been moved out of the inner loop. This explains why 'perf stat' reports half the expected number of flops.
Why are we only using 1 of the 4 single precision values in the xmm registers?
If we change swap the 2 'for()' loops to:
Then the compiler can auto-vectorize and we can compile it with:
gcc ghui_flops.c -o ghui_flops -g -O3 -ftree-vectorizer-verbose=3
The -ftree... option tells you which loops it can vectorize.
Then when we run we see
Now we are running 3.5x faster. We are getting about 2.6 Flops/cycle.
Note that I changed the event to 'r0410' which is event FP_COMP_OPS_EXE.SSE_PACKED_SINGLE (event # 0x10 umask=0x40). And I added the '-e cycles' option to get clockticks.
The perf event count is 1,013.7 Mops.
Each SSE_PACKED_SINGLE instruction does 4 operations, so we have to multiply it by 4 to get ops.
We would expect at least3600 Moperations and perf counted about 4000Mops.
I'llattach my updated source code file in another post (this post is already too long).
Does this make sense?
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]#include
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you! I'll try to use the code to verify a performance (in Flops) of my test computers.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much.
GHui.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One of theloops in my code has an error. The loop:
[cpp]for(j=0;jshould be:
[cpp]for(j=0;j
The loop will only get executed ifsomeone ever enters 10 args to the program.
I put the loop in because, for one intermediate version of my code, the compiler optimized away the loops.
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For several days past, I really don't understand the following code piece. It can change the compiler compiled result? Or how the code piece can effect the compiler?
[cpp] if(argc > 10) // just put this in so compiler doesn't optimize everything away. { float d=0; for(j=0;j
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sometimes compilers will optimize away sections of code.
For instance, if the compiler sees that the result arraya[] aren't used anywhere and nothing depends on the result, why not just delete the whole 'a += b * c;' loop?
This is what happened when I compiled before I added the 'if(argc > 10)' logic.
The 'if(argc > 10)' logic makes it where the compiler can't say (at compile time) whether the results in the a[] array are used or not. So the compiler leavesin the 'a += b * c;' loop.
Does this make sense?
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sometimes compilers will optimize away sections of code.
For instance, if the compiler sees that the result arraya[] aren't used anywhere and nothing depends on the result, why not just delete the whole 'a += b * c;' loop?
...
I'm always concerned in such cases.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page