Solved: Hi,

logik_l_ · ‎02-06-2016

Hi,

I'm benchmarking i7-5775c's 4 CPU cores against its Iris Pro 6200 for simple OpenCL kernels. Guess you can say I want to know when it makes sense to off-load computation onto the IGP. One experiment involves each thread executing many FMA operations on a single input element to measure computational speed. I'm surprised to see the IGP outperform the CPU by nearly 9x, and by 18x with hyper-threading disabled:

OpenCL kernel:
void kernel fmaKernel(global float * out){

float sum = out[get_global_id(0)];

for(int i = 0; i < FMA_LOOP; i++){
sum=fma(1.02345, FMA_LOOP.f, sum);
};

out[get_global_id(0)]=sum;

}

Results:
512x512 matrix, FMA_LOOP=1664

IGP ~ 1.06 ms
CPU w/ hyper threading ~9.66 ms
CPU w/o hyper threading ~19 ms

Question:
Why is the CPU so much slower than the IGP? From what I know about the hardware, I would expect the GPU to outperform the CPU by about 2x. I have this expectation because the 6200 is rated at 883 GFLOPS and I'm estimating my 4 CPU cores amount to roughly 460 GFLOPS (I can't find a spec sheet that explicitly states the CPU GFLOP rating). I recognize that these numbers are doubled since they represent FMA as being two operations in one cycle. Further, a quick (very) theoretical calculation on computation time gives me:

CPU: 512 * 512 * 1664 / 10^9 / (460 GFLOPS / 2) * 1000 ms / s = 1.9 ms
IGP: 512 * 512 * 1664 / 10^9 / (883 GFLOPS / 2) * 1000 ms / s = 0.99 ms

So I feel like the IGP is performing as it should, while the CPU isn't even close. I can't tell if this is because I don't understand how the hardware works or because there is something up with the OpenCL implementation.

Config:
Ubuntu 14.04 LTS
Intel OpenCL 1.2-5.0.0.43 (CPU-x64)
Intel OpenCL 1.2-1.0 (Graphics Driver for HD Graphics, Iris, Iris Pro)

Run benchmark:
./simpleOpencl <platform> 24 10 1 512

For my system platform=0=IGP and platform=1=CPU.

Robert_I_Intel · ‎02-08-2016

Hi,

By my calculations, peak GFlops of your part on the CPU is 473.6 GFlops = 8 threads * 8 floats * 2 FMAs * 3.7 GHz. However, this site reports 105.49 GFlops for your part: http://techgage.com/article/intels-skylake-core-i7-6700k-a-performance-look/ for Whetstone single-float, which is closer to what your see. So how do you reconcile what you see on the CPU side of things. If you look at the manual: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf p 11-46 - the latency of VFMADD213PS is 5 cycles. Since one of your sources (sum) and your destination are the same, you are exposing this latency, so in your case theoretical GFlops are reduced to 473.6/5 = 94.7, giving GPU 9.3X advantage. What you are measuring is 9.1X advantage, which is pretty close. Looks like the GPU, which has two physical SIMD4 FPUs for each EU, can pipeline and hide instruction latency better in this case and actually achieve it's theoretical GFlops. Typically, code like yours is compiled on the GPU SIMD32 (8 cycle latency) or SIMD16 (4 cycle latency), but FPUs are fully pipelined across 7 threads of the EU and instructions complete every cycle.

View solution in original post

Robert_I_Intel · ‎02-08-2016

Hi,

By my calculations, peak GFlops of your part on the CPU is 473.6 GFlops = 8 threads * 8 floats * 2 FMAs * 3.7 GHz. However, this site reports 105.49 GFlops for your part: http://techgage.com/article/intels-skylake-core-i7-6700k-a-performance-look/ for Whetstone single-float, which is closer to what your see. So how do you reconcile what you see on the CPU side of things. If you look at the manual: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf p 11-46 - the latency of VFMADD213PS is 5 cycles. Since one of your sources (sum) and your destination are the same, you are exposing this latency, so in your case theoretical GFlops are reduced to 473.6/5 = 94.7, giving GPU 9.3X advantage. What you are measuring is 9.1X advantage, which is pretty close. Looks like the GPU, which has two physical SIMD4 FPUs for each EU, can pipeline and hide instruction latency better in this case and actually achieve it's theoretical GFlops. Typically, code like yours is compiled on the GPU SIMD32 (8 cycle latency) or SIMD16 (4 cycle latency), but FPUs are fully pipelined across 7 threads of the EU and instructions complete every cycle.

Georg_K_ · ‎02-09-2016

I believe peak flops should be calculated using the core count, not the thread count as hyperthreading does not increase the number of execution units.

With 4 cores * 8 floats (256-bit wide FPU) * 2 (mul + add) * 3.7 GHz I get 237 GFlops. And as Robert pointed out the operation will be latency limited on the CPU instead of throughput limited, which is precisely why hyperthreading can double the performance.

logik_l_ · ‎02-09-2016

Excellent answers, that makes a lot of sense to me.

Regarding the GFLOPs calculation, per the architecture guide page 38, section 2.2.2, there are 2 FMA units per core giving 32 SP /clock:
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

So, I agree that it should be by 4 cores, but there's an additional 2X for 2 FMA ports. So in light of that I can see how running 8 hardware threads instead of 4 could translate into a 2x difference. What I don't get is my application should be creating several software threads (hundreds in concept), so naively I'd expect that my application would still be able to utilize both FMA units.

My understanding is that under-the-hood the CPU OpenCL platform groups work-groups into some number of actual software threads (since creating hundreds of software threads would be very time consuming), and I'm speculating that it creates one software thread per core. But I have no way of knowing for sure. Any thoughts?

Regardless, again, both great answers. Thanks!

Robert_I_Intel · ‎02-09-2016

Georg,

My bad, the GFlops should be calculated based on the number of FMA units (not threads), so 4 cores * 2 FMA units * 8 floats (256-bit wide FPU) * 2 (mul + add) * 3.7 GHz = 473.6 GFlops.

On the GPU, we do have many software threads, each containing 32, 16, or 8 work-items, depending on the kernel, that are dispatched to hardware threads.

On the CPU, under the hood, 8 software threads are running persistently in your case and receive tasks dispatched to them.

BTW, one way to shoot for maximum GFlops is to modify your kernel so that in a loop you are computing 4 or 5 partial FMAs so as to hide the latency and then add the partial results together.

i7-5775c Iris Pro vs CPU performance