i have a question about the FP peak performance of my core i7 920.
I have an application that does a lot of MAC operations (basically a convolution operation), and i am not able to reach the peak FP performance of the cpu by a factor of ~8x when using multi-threading and SSE instructions. When trying to find out what the reason was for this i ended up with a simplified code snippet, running on a single thread and not using SSE instructions which performs equally bad:
for(i=0; i<49335264; i++)
{
data += other_data * other_data2;
}
If i'm correct (the data and other_data arrays are all FP) this piece of code requires:
49335264 * 2 = 98670528 FLOPs
It executes in ~150 ms (i'm very sure this timing is correct, since C timers and the Intel VTune Profiler give me the same result)
This means the performance of this code snippet is:
98670528 / 150.10^-3 / 10^9 = 0.66 GFLOPs/sec
Where the peak performance of this cpu should be at 2*3.2 GFlops/sec (2 FP units, 3.2 GHz processor) right?
Is there any explanation for this huge gap? I first thought it was because the application should be memory limited, but that would mean:
The peak stream b/w of my cpu is ~16.4 GB/s right? So let's say every iteration i require 3 FP reads and 1 FP write, or 16 bytes of bandwidth. This would require 789.364.224 bytes of traffic to the main memory in the entire application (assuming nothing is cached), which runs in ~150 ms. This would mean i use 789.364.224 / 150 * 10^-3 / 10^9 = 5.26 GB/s. So i would say i don't hit this bandwidth ceiling?
I also tried changing the operation within the loop to " data += 2.0 * 5.0 " to test whether this would improve the performance, but this yields the exact same performance.
Thanks a lot in advance, and i could really use your help!