Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
1628 Discussions

Why is my application not able to reach core i7 920 peak FP performance

Dan_Leakin
Beginner
232 Views

i have a question about the FP peak performance of my core i7 920.

I have an application that does a lot of MAC operations (basically a convolution operation), and i am not able to reach the peak FP performance of the cpu by a factor of ~8x when using multi-threading and SSE instructions. When trying to find out what the reason was for this i ended up with a simplified code snippet, running on a single thread and not using SSE instructions which performs equally bad:

for(i=0; i<49335264; i++)
{
data += other_data * other_data2;
}

If i'm correct (the data and other_data arrays are all FP) this piece of code requires:

49335264 * 2 = 98670528 FLOPs

It executes in ~150 ms (i'm very sure this timing is correct, since C timers and the Intel VTune Profiler give me the same result)

This means the performance of this code snippet is:

98670528 / 150.10^-3 / 10^9 = 0.66 GFLOPs/sec

Where the peak performance of this cpu should be at 2*3.2 GFlops/sec (2 FP units, 3.2 GHz processor) right?

Is there any explanation for this huge gap? I first thought it was because the application should be memory limited, but that would mean:

The peak stream b/w of my cpu is ~16.4 GB/s right? So let's say every iteration i require 3 FP reads and 1 FP write, or 16 bytes of bandwidth. This would require 789.364.224 bytes of traffic to the main memory in the entire application (assuming nothing is cached), which runs in ~150 ms. This would mean i use 789.364.224 / 150 * 10^-3 / 10^9 = 5.26 GB/s. So i would say i don't hit this bandwidth ceiling?

I also tried changing the operation within the loop to " data += 2.0 * 5.0 " to test whether this would improve the performance, but this yields the exact same performance.

Thanks a lot in advance, and i could really use your help!

0 Kudos
4 Replies
Patrick_F_Intel1
Employee
232 Views
Hello Dan,
Let's look at your simplest case.

for(i=0; i<49335264; i++) { data +=2.0 * 5.0; }

You could get the slow performancefor the above loop if the data[] array is not initialized to valid values.
If data[] is not initialized then you could be incurring floating point exceptions.

The i7 920 processor has 3 memory channels. Do you have only 1 memory chip in the system?
If so, the theoretical memory bandwidth is 1/3 * 25.6 GB/s = 8.5 GB/s.

Assuming the above loop is using single precision float point then you are only getting 1.31 GB/sec.

Have you tried running the stream mem bw benchmark? I'm curious what stream reports for your system.
Stream source: http://www.cs.virginia.edu/stream/FTP/Code/ and website http://www.cs.virginia.edu/stream/

So... more questions than answers...
Pat

Dan_Leakin
Beginner
232 Views
Thanks for your answer.
I'm sure the data array is initialised properly, and all data consists of single precision FPs.
I will run the benchmark tomorrow, thanks for the heads up.

Anybody has any other ideas? Thanks!
Dan_Leakin
Beginner
232 Views
Dear Pat,

i ran the benchmark and here is the result:

-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 2642 microseconds.
(= 2642 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 9060.1950 0.0036 0.0035 0.0037
Scale: 8824.8884 0.0036 0.0036 0.0037
Add: 9179.5820 0.0053 0.0052 0.0053
Triad: 8966.9781 0.0054 0.0054 0.0054
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------


Patrick_F_Intel1
Employee
232 Views

Hello Dan,
Thanks for your patience.
This analysis will follow closely the posting http://software.intel.com/en-us/forums/showpost.php?p=177199.

I have a small program which reproduces most of the things about which you have questions.

[cpp]#include #include #include #include #define NUM 49335264 #define TIMEi 1000 // return epoch time in seconds (with usec accuracy) double dclock(void) { double xxx; struct timeval tv2; gettimeofday(&tv2, NULL); xxx =(double)(tv2.tv_sec) + 1.0e-6*(double)tv2.tv_usec; return xxx; } #ifndef MEM_TRAFFIC #define MEM_TRAFFIC 2 #endif #if (MEM_TRAFFIC != 2) && (MEM_TRAFFIC != 4) #error "MEM_TRAFFIC must be 2 or 4" #endif float a[NUM],b[NUM],c[NUM]; int main(int argc,char **argv) { int i,j,k, m; double dt, result, ops, tot_ops, tot_time, tm_beg, tm_end; printf("init datan"); for(i=0;i=b=c=0.2; } printf("start FP opn"); long iMax=TIMEi; tot_ops = 0.0; tot_time = 0.0; for(m=0; m < 10; m++) { tm_beg = dclock(); for(i=0; i += b * c; #endif #if (MEM_TRAFFIC == 2) a += 2.0 * 5.0; #endif } tm_end = dclock(); dt = tm_end - tm_beg; #if (MEM_TRAFFIC == 4) ops = (double)2*NUM; #endif #if (MEM_TRAFFIC == 2) ops = (double)1*NUM; #endif result=ops/dt/1000000.0; printf("m= %d, NUM= %d, MFlops= %f, Mop= %f, time= %f, MB/s= %fn", m, NUM, result,1.0e-6*ops, dt, 1.0e-6*(double)(MEM_TRAFFIC*sizeof(float)*NUM)/dt); tot_ops += ops; tot_time+= dt; } if(argc > 10) // just put this in so compiler doesn't optimize everything away. { float d=0; for(i=0;i; } printf("d= %fn", d); } printf("tot_Mop= %f, tot_time= %f, overall Mops/sec= %fn", 1.0e-6*tot_ops, tot_time, 1.0e-6*tot_ops/tot_time); return 0; } [/cpp]


Let'scompile with the command below
snb-d2:/home/flops # gcc dan_fmac.c -o dan_fmac -g -DMEM_TRAFFIC=2

The '-DMEM_TRAFFIC=2' option makes the code just do 'a += 2.0 * 5.0;'.
If we look at the assembly code (use the '-S -c -o dan_fmac.s' option to generate assembly file) then we see that the compiler precomputes the '2.0 * 5.0' so there is only 1 floating point operation (the add) per iteration of the loop.
There is 1 load and 1 store per iteration.
I'll run the program under 'perf' with the command below.
'perf stat -e cycles -e r2010 ./dan_fmac' says to collect the 'cycles' (clockticks) event and the raw event 'r2010'.
The 'r2010' means collect the event number 0x10 with umask 0x20. On Sandybridge, this is the FP_COMP_OPS_EXE.SSE_FP_SCALAR_SINGLE event.
See the SDM vol 3 section 19.3 for sandy bridge events and their encodings.
See http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html for the SDM.

Below is the command to run it and the output:

[bash]snb-d2:/home/flops # perf stat -e cycles -e r2010 ./dan_fmac init data start FP op m= 0, NUM= 49335264, MFlops= 384.267156, Mop= 49.335264, time= 0.128388, MB/s= 3074.137250 m= 1, NUM= 49335264, MFlops= 384.311404, Mop= 49.335264, time= 0.128373, MB/s= 3074.491232 m= 2, NUM= 49335264, MFlops= 384.272865, Mop= 49.335264, time= 0.128386, MB/s= 3074.182921 m= 3, NUM= 49335264, MFlops= 383.902145, Mop= 49.335264, time= 0.128510, MB/s= 3071.217159 m= 4, NUM= 49335264, MFlops= 384.387077, Mop= 49.335264, time= 0.128348, MB/s= 3075.096616 m= 5, NUM= 49335264, MFlops= 384.299984, Mop= 49.335264, time= 0.128377, MB/s= 3074.399874 m= 6, NUM= 49335264, MFlops= 384.425639, Mop= 49.335264, time= 0.128335, MB/s= 3075.405110 m= 7, NUM= 49335264, MFlops= 384.036805, Mop= 49.335264, time= 0.128465, MB/s= 3072.294437 m= 8, NUM= 49335264, MFlops= 384.359945, Mop= 49.335264, time= 0.128357, MB/s= 3074.879564 m= 9, NUM= 49335264, MFlops= 384.347809, Mop= 49.335264, time= 0.128361, MB/s= 3074.782472 tot_Mop= 493.352640, tot_time= 1.283900, overall Mops/sec= 384.261020 Performance counter stats for './dan_fmac': 5293118691 cycles 528655357 raw 0x2010 1.589042239 seconds time elapsed [/bash]

So we see that I'm only getting 384 MFlops and about 3074 MB/sec of memory bandwidth.
Not good.
The 'perf' output shows that we are only getting about 1 addfor every 10 cycles.
The gcc compiler, if you don't specify an optimization level, apparently doesn't optimize much.

If I compile with -O3 things are a lotbetter.
Using cmd: gcc dan_fmac.c -o dan_fmac -g -DMEM_TRAFFIC=2 -O3
I get output:

[plain]snb-d2:/home/flops # perf stat -e cycles -e r4010 ./dan_fmac init data start FP op m= 0, NUM= 49335264, MFlops= 2169.911444, Mop= 49.335264, time= 0.022736, MB/s= 17359.291553 m= 1, NUM= 49335264, MFlops= 2171.619372, Mop= 49.335264, time= 0.022718, MB/s= 17372.954979 m= 2, NUM= 49335264, MFlops= 2169.638425, Mop= 49.335264, time= 0.022739, MB/s= 17357.107399 m= 3, NUM= 49335264, MFlops= 2166.390225, Mop= 49.335264, time= 0.022773, MB/s= 17331.121801 m= 4, NUM= 49335264, MFlops= 2170.025222, Mop= 49.335264, time= 0.022735, MB/s= 17360.201780 m= 5, NUM= 49335264, MFlops= 2166.390225, Mop= 49.335264, time= 0.022773, MB/s= 17331.121801 m= 6, NUM= 49335264, MFlops= 2169.456450, Mop= 49.335264, time= 0.022741, MB/s= 17355.651602 m= 7, NUM= 49335264, MFlops= 2167.048165, Mop= 49.335264, time= 0.022766, MB/s= 17336.385316 m= 8, NUM= 49335264, MFlops= 2169.638425, Mop= 49.335264, time= 0.022739, MB/s= 17357.107399 m= 9, NUM= 49335264, MFlops= 2167.048165, Mop= 49.335264, time= 0.022766, MB/s= 17336.385316 tot_Mop= 493.352640, tot_time= 0.227486, overall Mops/sec= 2168.715219 Performance counter stats for './dan_fmac': 1219427840 cycles 222453725 raw 0x4010 0.369353868 seconds time elapsed [/plain]

Now theMFlops are about 5.6x faster. The memory bandwidth is about 5.6x higher too.
For the above runI used the raw event 'r4010' which is the FP_COMP_OPS_EXE.SSE_PACKED_SINGLE event.
FP_COMP_OPS_EXE.SSE_PACKED_SINGLE counts SSE packed single precision instructions.
Each packed SP instruction does 4 operations.
The compiler optimizations (at -O3) are vectorizing the code so use packed SSE instructions.
So now, by the perf data, we are getting 0.72 floating point operations per cycle.
Thatis, flops/cycle =0.72 ~= 4 * 222453725 / 1219427840.

If I compile it with -DMEM_TRAFFIC=4, the inner loop becomes a += b*c;

Without optimizations I get:

[plain]snb-d2:/home/flops # gcc dan_fmac.c -o dan_fmac -g -DMEM_TRAFFIC=4 snb-d2:/home/flops # perf stat -e cycles -e r2010 ./dan_fmac init data start FP op m= 0, NUM= 49335264, MFlops= 673.620407, Mop= 98.670528, time= 0.146478, MB/s= 5388.963256 m= 1, NUM= 49335264, MFlops= 674.167973, Mop= 98.670528, time= 0.146359, MB/s= 5393.343784 m= 2, NUM= 49335264, MFlops= 674.176759, Mop= 98.670528, time= 0.146357, MB/s= 5393.414071 m= 3, NUM= 49335264, MFlops= 673.786009, Mop= 98.670528, time= 0.146442, MB/s= 5390.288075 m= 4, NUM= 49335264, MFlops= 674.176759, Mop= 98.670528, time= 0.146357, MB/s= 5393.414071 m= 5, NUM= 49335264, MFlops= 674.176759, Mop= 98.670528, time= 0.146357, MB/s= 5393.414071 m= 6, NUM= 49335264, MFlops= 673.514069, Mop= 98.670528, time= 0.146501, MB/s= 5388.112556 m= 7, NUM= 49335264, MFlops= 673.799173, Mop= 98.670528, time= 0.146439, MB/s= 5390.393387 m= 8, NUM= 49335264, MFlops= 674.039506, Mop= 98.670528, time= 0.146387, MB/s= 5392.316047 m= 9, NUM= 49335264, MFlops= 673.859515, Mop= 98.670528, time= 0.146426, MB/s= 5390.876118 tot_Mop= 986.705280, tot_time= 1.464103, overall Mops/sec= 673.931609 Performance counter stats for './dan_fmac': 5900500805 cycles 1017177303 raw 0x2010 1.769067328 seconds time elapsed
[/plain]

This is similar to your results. About 674 MFlops, 5390 MB/sec and 5.8 cycles/flop. Not too good.
Note that I use the raw event 'r2010' (the FP_COMP_OPS_EXE.SSE_FP_SCALAR_SINGLE event).

If I compile with optimizations (-O3) and run then I get:

[plain]snb-d2:/home/pfay/flops # gcc dan_fmac.c -o dan_fmac -g -DMEM_TRAFFIC=4 -O3 snb-d2:/home/pfay/flops # perf stat -e cycles -e r4010 ./dan_fmac init data start FP op m= 0, NUM= 49335264, MFlops= 2247.973614, Mop= 98.670528, time= 0.043893, MB/s= 17983.788910 m= 1, NUM= 49335264, MFlops= 2260.016330, Mop= 98.670528, time= 0.043659, MB/s= 18080.130637 m= 2, NUM= 49335264, MFlops= 2263.811601, Mop= 98.670528, time= 0.043586, MB/s= 18110.492811 m= 3, NUM= 49335264, MFlops= 2258.154265, Mop= 98.670528, time= 0.043695, MB/s= 18065.234119 m= 4, NUM= 49335264, MFlops= 2262.104007, Mop= 98.670528, time= 0.043619, MB/s= 18096.832060 m= 5, NUM= 49335264, MFlops= 2260.954690, Mop= 98.670528, time= 0.043641, MB/s= 18087.637520 m= 6, NUM= 49335264, MFlops= 2261.832020, Mop= 98.670528, time= 0.043624, MB/s= 18094.656163 m= 7, NUM= 49335264, MFlops= 2259.251402, Mop= 98.670528, time= 0.043674, MB/s= 18074.011214 m= 8, NUM= 49335264, MFlops= 2261.572457, Mop= 98.670528, time= 0.043629, MB/s= 18092.579659 m= 9, NUM= 49335264, MFlops= 2259.769522, Mop= 98.670528, time= 0.043664, MB/s= 18078.156177 tot_Mop= 986.705280, tot_time= 0.436685, overall Mops/sec= 2259.536339 Performance counter stats for './dan_fmac': 1918680883 cycles 364863566 raw 0x4010 0.577368584 seconds time elapsed [/plain]

Now we see a 3.35x increase in Flops, BW and flops/cycle.

So hopefully when you were compiling your program, optimizations were not enabled or the compiler was not able to vectorize the code.
You can compile with '-O3 -ftree-vectorizer-verbose=3' to see why the compiler is not able to vectorize.
Does this help?
Pat


Reply