Why is my application not able to reach core i7 920 peak FP performance

Dan_Leakin · ‎02-29-2012

i have a question about the FP peak performance of my core i7 920.

I have an application that does a lot of MAC operations (basically a convolution operation), and i am not able to reach the peak FP performance of the cpu by a factor of ~8x when using multi-threading and SSE instructions. When trying to find out what the reason was for this i ended up with a simplified code snippet, running on a single thread and not using SSE instructions which performs equally bad:

for(i=0; i<49335264; i++)
{
  data += other_data * other_data2;
}

If i'm correct (the data and other_data arrays are all FP) this piece of code requires:

49335264 * 2 = 98670528 FLOPs

It executes in ~150 ms (i'm very sure this timing is correct, since C timers and the Intel VTune Profiler give me the same result)

This means the performance of this code snippet is:

98670528 / 150.10^-3 / 10^9 = 0.66 GFLOPs/sec

Where the peak performance of this cpu should be at 2*3.2 GFlops/sec (2 FP units, 3.2 GHz processor) right?

Is there any explanation for this huge gap? I first thought it was because the application should be memory limited, but that would mean:

The peak stream b/w of my cpu is ~16.4 GB/s right? So let's say every iteration i require 3 FP reads and 1 FP write, or 16 bytes of bandwidth. This would require 789.364.224 bytes of traffic to the main memory in the entire application (assuming nothing is cached), which runs in ~150 ms. This would mean i use 789.364.224 / 150 * 10^-3 / 10^9 = 5.26 GB/s. So i would say i don't hit this bandwidth ceiling?

I also tried changing the operation within the loop to " data += 2.0 * 5.0 " to test whether this would improve the performance, but this yields the exact same performance.

Thanks a lot in advance, and i could really use your help!

Patrick_F_Intel1 · ‎02-29-2012

Hello Dan,
Let's look at your simplest case.

for(i=0; i<49335264; i++) { data +=2.0 * 5.0; }

You could get the slow performancefor the above loop if the data[] array is not initialized to valid values.
If data[] is not initialized then you could be incurring floating point exceptions.

The i7 920 processor has 3 memory channels. Do you have only 1 memory chip in the system?
If so, the theoretical memory bandwidth is 1/3 * 25.6 GB/s = 8.5 GB/s.

Assuming the above loop is using single precision float point then you are only getting 1.31 GB/sec.

Have you tried running the stream mem bw benchmark? I'm curious what stream reports for your system.
Stream source: http://www.cs.virginia.edu/stream/FTP/Code/ and website http://www.cs.virginia.edu/stream/

So... more questions than answers...
Pat

Dan_Leakin · ‎02-29-2012

Thanks for your answer.
I'm sure the data array is initialised properly, and all data consists of single precision FPs.
I will run the benchmark tomorrow, thanks for the heads up.

Anybody has any other ideas? Thanks!

Dan_Leakin · ‎03-01-2012

Dear Pat,

i ran the benchmark and here is the result:

-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 2642 microseconds.
(= 2642 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 9060.1950 0.0036 0.0035 0.0037
Scale: 8824.8884 0.0036 0.0036 0.0037
Add: 9179.5820 0.0053 0.0052 0.0053
Triad: 8966.9781 0.0054 0.0054 0.0054
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

Patrick_F_Intel1 · ‎03-04-2012

Hello Dan,
Thanks for your patience.
This analysis will follow closely the posting http://software.intel.com/en-us/forums/showpost.php?p=177199.

I have a small program which reproduces most of the things about which you have questions.

[cpp]#include #include #include #include #define NUM 49335264 #define TIMEi 1000 // return epoch time in seconds (with usec accuracy) double dclock(void) { double xxx; struct timeval tv2; gettimeofday(&tv2, NULL); xxx =(double)(tv2.tv_sec) + 1.0e-6*(double)tv2.tv_usec; return xxx; } #ifndef MEM_TRAFFIC #define MEM_TRAFFIC 2 #endif #if (MEM_TRAFFIC != 2) && (MEM_TRAFFIC != 4) #error "MEM_TRAFFIC must be 2 or 4" #endif float a[NUM],b[NUM],c[NUM]; int main(int argc,char **argv) { int i,j,k, m; double dt, result, ops, tot_ops, tot_time, tm_beg, tm_end; printf("init datan"); for(i=0;i=b=c=0.2; } printf("start FP opn"); long iMax=TIMEi; tot_ops = 0.0; tot_time = 0.0; for(m=0; m < 10; m++) { tm_beg = dclock(); for(i=0; i += b * c; #endif #if (MEM_TRAFFIC == 2) a += 2.0 * 5.0; #endif } tm_end = dclock(); dt = tm_end - tm_beg; #if (MEM_TRAFFIC == 4) ops = (double)2*NUM; #endif #if (MEM_TRAFFIC == 2) ops = (double)1*NUM; #endif result=ops/dt/1000000.0; printf("m= %d, NUM= %d, MFlops= %f, Mop= %f, time= %f, MB/s= %fn", m, NUM, result,1.0e-6*ops, dt, 1.0e-6*(double)(MEM_TRAFFIC*sizeof(float)*NUM)/dt); tot_ops += ops; tot_time+= dt; } if(argc > 10) // just put this in so compiler doesn't optimize everything away. { float d=0; for(i=0;i; } printf("d= %fn", d); } printf("tot_Mop= %f, tot_time= %f, overall Mops/sec= %fn", 1.0e-6*tot_ops, tot_time, 1.0e-6*tot_ops/tot_time); return 0; } [/cpp]

Let'scompile with the command below
snb-d2:/home/flops # gcc dan_fmac.c -o dan_fmac -g -DMEM_TRAFFIC=2

The '-DMEM_TRAFFIC=2' option makes the code just do 'a += 2.0 * 5.0;'.
If we look at the assembly code (use the '-S -c -o dan_fmac.s' option to generate assembly file) then we see that the compiler precomputes the '2.0 * 5.0' so there is only 1 floating point operation (the add) per iteration of the loop.
There is 1 load and 1 store per iteration.
I'll run the program under 'perf' with the command below.
'perf stat -e cycles -e r2010 ./dan_fmac' says to collect the 'cycles' (clockticks) event and the raw event 'r2010'.
The 'r2010' means collect the event number 0x10 with umask 0x20. On Sandybridge, this is the FP_COMP_OPS_EXE.SSE_FP_SCALAR_SINGLE event.
See the SDM vol 3 section 19.3 for sandy bridge events and their encodings.
See http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html for the SDM.

Below is the command to run it and the output:

[bash]snb-d2:/home/flops # perf stat -e cycles -e r2010 ./dan_fmac init data start FP op m= 0, NUM= 49335264, MFlops= 384.267156, Mop= 49.335264, time= 0.128388, MB/s= 3074.137250 m= 1, NUM= 49335264, MFlops= 384.311404, Mop= 49.335264, time= 0.128373, MB/s= 3074.491232 m= 2, NUM= 49335264, MFlops= 384.272865, Mop= 49.335264, time= 0.128386, MB/s= 3074.182921 m= 3, NUM= 49335264, MFlops= 383.902145, Mop= 49.335264, time= 0.128510, MB/s= 3071.217159 m= 4, NUM= 49335264, MFlops= 384.387077, Mop= 49.335264, time= 0.128348, MB/s= 3075.096616 m= 5, NUM= 49335264, MFlops= 384.299984, Mop= 49.335264, time= 0.128377, MB/s= 3074.399874 m= 6, NUM= 49335264, MFlops= 384.425639, Mop= 49.335264, time= 0.128335, MB/s= 3075.405110 m= 7, NUM= 49335264, MFlops= 384.036805, Mop= 49.335264, time= 0.128465, MB/s= 3072.294437 m= 8, NUM= 49335264, MFlops= 384.359945, Mop= 49.335264, time= 0.128357, MB/s= 3074.879564 m= 9, NUM= 49335264, MFlops= 384.347809, Mop= 49.335264, time= 0.128361, MB/s= 3074.782472 tot_Mop= 493.352640, tot_time= 1.283900, overall Mops/sec= 384.261020 Performance counter stats for './dan_fmac': 5293118691 cycles 528655357 raw 0x2010 1.589042239 seconds time elapsed [/bash]
So we see that I'm only getting 384 MFlops and about 3074 MB/sec of memory bandwidth.
Not good.
The 'perf' output shows that we are only getting about 1 addfor every 10 cycles.
The gcc compiler, if you don't specify an optimization level, apparently doesn't optimize much.

If I compile with -O3 things are a lotbetter.
Using cmd: gcc dan_fmac.c -o dan_fmac -g -DMEM_TRAFFIC=2 -O3
I get output:
[plain]snb-d2:/home/flops # perf stat -e cycles -e r4010 ./dan_fmac init data start FP op m= 0, NUM= 49335264, MFlops= 2169.911444, Mop= 49.335264, time= 0.022736, MB/s= 17359.291553 m= 1, NUM= 49335264, MFlops= 2171.619372, Mop= 49.335264, time= 0.022718, MB/s= 17372.954979 m= 2, NUM= 49335264, MFlops= 2169.638425, Mop= 49.335264, time= 0.022739, MB/s= 17357.107399 m= 3, NUM= 49335264, MFlops= 2166.390225, Mop= 49.335264, time= 0.022773, MB/s= 17331.121801 m= 4, NUM= 49335264, MFlops= 2170.025222, Mop= 49.335264, time= 0.022735, MB/s= 17360.201780 m= 5, NUM= 49335264, MFlops= 2166.390225, Mop= 49.335264, time= 0.022773, MB/s= 17331.121801 m= 6, NUM= 49335264, MFlops= 2169.456450, Mop= 49.335264, time= 0.022741, MB/s= 17355.651602 m= 7, NUM= 49335264, MFlops= 2167.048165, Mop= 49.335264, time= 0.022766, MB/s= 17336.385316 m= 8, NUM= 49335264, MFlops= 2169.638425, Mop= 49.335264, time= 0.022739, MB/s= 17357.107399 m= 9, NUM= 49335264, MFlops= 2167.048165, Mop= 49.335264, time= 0.022766, MB/s= 17336.385316 tot_Mop= 493.352640, tot_time= 0.227486, overall Mops/sec= 2168.715219 Performance counter stats for './dan_fmac': 1219427840 cycles 222453725 raw 0x4010 0.369353868 seconds time elapsed [/plain]
Now theMFlops are about 5.6x faster. The memory bandwidth is about 5.6x higher too.
For the above runI used the raw event 'r4010' which is the FP_COMP_OPS_EXE.SSE_PACKED_SINGLE event.
FP_COMP_OPS_EXE.SSE_PACKED_SINGLE counts SSE packed single precision instructions.
Each packed SP instruction does 4 operations.
The compiler optimizations (at -O3) are vectorizing the code so use packed SSE instructions.
So now, by the perf data, we are getting 0.72 floating point operations per cycle.
Thatis, flops/cycle =0.72 ~= 4 * 222453725 / 1219427840.

If I compile it with -DMEM_TRAFFIC=4, the inner loop becomes a += b*c;

Without optimizations I get:
[plain]snb-d2:/home/flops # gcc dan_fmac.c -o dan_fmac -g -DMEM_TRAFFIC=4 snb-d2:/home/flops # perf stat -e cycles -e r2010 ./dan_fmac init data start FP op m= 0, NUM= 49335264, MFlops= 673.620407, Mop= 98.670528, time= 0.146478, MB/s= 5388.963256 m= 1, NUM= 49335264, MFlops= 674.167973, Mop= 98.670528, time= 0.146359, MB/s= 5393.343784 m= 2, NUM= 49335264, MFlops= 674.176759, Mop= 98.670528, time= 0.146357, MB/s= 5393.414071 m= 3, NUM= 49335264, MFlops= 673.786009, Mop= 98.670528, time= 0.146442, MB/s= 5390.288075 m= 4, NUM= 49335264, MFlops= 674.176759, Mop= 98.670528, time= 0.146357, MB/s= 5393.414071 m= 5, NUM= 49335264, MFlops= 674.176759, Mop= 98.670528, time= 0.146357, MB/s= 5393.414071 m= 6, NUM= 49335264, MFlops= 673.514069, Mop= 98.670528, time= 0.146501, MB/s= 5388.112556 m= 7, NUM= 49335264, MFlops= 673.799173, Mop= 98.670528, time= 0.146439, MB/s= 5390.393387 m= 8, NUM= 49335264, MFlops= 674.039506, Mop= 98.670528, time= 0.146387, MB/s= 5392.316047 m= 9, NUM= 49335264, MFlops= 673.859515, Mop= 98.670528, time= 0.146426, MB/s= 5390.876118 tot_Mop= 986.705280, tot_time= 1.464103, overall Mops/sec= 673.931609 Performance counter stats for './dan_fmac': 5900500805 cycles 1017177303 raw 0x2010 1.769067328 seconds time elapsed
[/plain]
This is similar to your results. About 674 MFlops, 5390 MB/sec and 5.8 cycles/flop. Not too good.
Note that I use the raw event 'r2010' (the FP_COMP_OPS_EXE.SSE_FP_SCALAR_SINGLE event).

If I compile with optimizations (-O3) and run then I get:
[plain]snb-d2:/home/pfay/flops # gcc dan_fmac.c -o dan_fmac -g -DMEM_TRAFFIC=4 -O3 snb-d2:/home/pfay/flops # perf stat -e cycles -e r4010 ./dan_fmac init data start FP op m= 0, NUM= 49335264, MFlops= 2247.973614, Mop= 98.670528, time= 0.043893, MB/s= 17983.788910 m= 1, NUM= 49335264, MFlops= 2260.016330, Mop= 98.670528, time= 0.043659, MB/s= 18080.130637 m= 2, NUM= 49335264, MFlops= 2263.811601, Mop= 98.670528, time= 0.043586, MB/s= 18110.492811 m= 3, NUM= 49335264, MFlops= 2258.154265, Mop= 98.670528, time= 0.043695, MB/s= 18065.234119 m= 4, NUM= 49335264, MFlops= 2262.104007, Mop= 98.670528, time= 0.043619, MB/s= 18096.832060 m= 5, NUM= 49335264, MFlops= 2260.954690, Mop= 98.670528, time= 0.043641, MB/s= 18087.637520 m= 6, NUM= 49335264, MFlops= 2261.832020, Mop= 98.670528, time= 0.043624, MB/s= 18094.656163 m= 7, NUM= 49335264, MFlops= 2259.251402, Mop= 98.670528, time= 0.043674, MB/s= 18074.011214 m= 8, NUM= 49335264, MFlops= 2261.572457, Mop= 98.670528, time= 0.043629, MB/s= 18092.579659 m= 9, NUM= 49335264, MFlops= 2259.769522, Mop= 98.670528, time= 0.043664, MB/s= 18078.156177 tot_Mop= 986.705280, tot_time= 0.436685, overall Mops/sec= 2259.536339 Performance counter stats for './dan_fmac': 1918680883 cycles 364863566 raw 0x4010 0.577368584 seconds time elapsed [/plain]
Now we see a 3.35x increase in Flops, BW and flops/cycle.

So hopefully when you were compiling your program, optimizations were not enabled or the compiler was not able to vectorize the code.
You can compile with '-O3 -ftree-vectorizer-verbose=3' to see why the compiler is not able to vectorize.
Does this help?
Pat