Difference in Kernel launch latencies of CPU and GPU devices

Supradeep_A_ · ‎10-23-2016

Hello

I am testing the kernel launch latencies of the CPU and GPU by timing EnqueueNDRangeKernel on a blank kernel (but with arguments). I found the CPU consistently takes about 150us while the GPU much higher. Further tests revealed that the GPU's latency scales with the size of the buffer provided as kernel argument. For my 3072x3072 buffer, the GPU's launch latency is ~1600 us. For a smaller 1024x1024 buffer, it is ~450us. Note that this is while keeping the global work size constant.

Can someone shed some light on:

1) Why the GPU's launch overhead is higher than the CPU and is there any way to mitigate this?

2) Why the GPU's launch overhead scales with the buffer size. Since no PCIe is involved, shouldn't all data transfer happen during execution and not during kernel launch?

I am using the latest OpenCL runtimes for Windows running on Intel 4200u / HD 4400.

Thanks

Tamer_Assad · ‎10-24-2016

Hi Supradeep,

1) GPU kernel launch overhead must be higher, it sets up the kernel on the GPU, GPU execution is feasible for processing and accelerating (i)large data sets, best practice for acceleration by GPU achieved by (ii)pipeline designed solutions.

Warming up. Executing a kernel on the GPU couple of times before using/measuring it would result in a completely different/better results.

2) The driver actually handles the kernel setup, Enqueuing buffers involves data transfers.

- In your experiment, were you using the CPU as an OpenCL device?

- Even though, The GPU is embedded, it still uses its own video memory, it is not shared with the CPU.

- if your dataset can fit into the CPU cash at once, most probably the CPU alone will achieve the fastest results.

Best regards,

Tamer

Supradeep_A_ · ‎10-24-2016

Thanks for the helpful answer. Can you clarify one part of it?

- Even though, The GPU is embedded, it still uses its own video memory, it is not shared with the CPU.

I'm creating my buffers to be zero-copy using CL_MEM_USE_HOST_PTR in the CPU-GPU shared context as in https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics.

I know that the GPU has some dedicated memory reserved in the DRAM. So are you saying that whenever a kernel is launched, all buffers associated with it are first copied to this video memory? But doesn't this violate the 'zero' in 'zero-copy'?

Thanks

Tamer_Assad · ‎10-24-2016

Hi Surdeep,

It is true, buffers can be created on shared memory. Referenced article demonstrate shared memory approach, System's shared memory, not GPU's:

-----------------------

Figure 1. Relationship of the CPU, Intel® processor graphics, and main memory. Notice a single pool of memory is shared by the CPU and GPU, unlike discrete GPUs that have their own dedicated memory that must be managed by the driver.

-----------------------

You might want to measure the performance of your specific solution, using both approaches, considering CL memory types and optimizations.

Best regards,

Tamer