OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1719 Discussions

Difference in Kernel launch latencies of CPU and GPU devices

Supradeep_A_
Beginner
785 Views

Hello

I am testing the kernel launch latencies of the CPU and GPU by timing EnqueueNDRangeKernel on a blank kernel (but with arguments). I found the CPU consistently takes about 150us while the GPU much higher. Further tests revealed that the GPU's latency scales with the size of the buffer provided as kernel argument. For my 3072x3072 buffer, the GPU's launch latency is ~1600 us. For a smaller 1024x1024 buffer, it is ~450us. Note that this is while keeping the global work size constant.

Can someone shed some light on:

1) Why the GPU's launch overhead is higher than the CPU and is there any way to mitigate this?

2) Why the GPU's launch overhead scales with the buffer size. Since no PCIe is involved, shouldn't all data transfer happen during execution and not during kernel launch?

 

I am using the latest OpenCL runtimes for Windows running on Intel 4200u / HD 4400.

Thanks

0 Kudos
3 Replies
Tamer_Assad
Innovator
785 Views

Hi Supradeep,

 

1) GPU kernel launch overhead must be higher, it sets up the kernel on the GPU, GPU execution is feasible for processing and accelerating (i)large data sets, best practice for acceleration by GPU achieved by (ii)pipeline designed solutions.

Warming up. Executing a kernel on the GPU couple of times before using/measuring it would result in a completely different/better results.

2) The driver actually handles the kernel setup, Enqueuing buffers involves data transfers.

 

- In your experiment, were you using the CPU as an OpenCL device?

- Even though, The GPU is embedded, it still uses its own video memory, it is not shared with the CPU.

- if your dataset can fit into the CPU cash at once, most probably the CPU alone will achieve the fastest results.

 

Best regards,

Tamer

0 Kudos
Supradeep_A_
Beginner
785 Views

Thanks for the helpful answer. Can you clarify one part of it?

- Even though, The GPU is embedded, it still uses its own video memory, it is not shared with the CPU.

I'm creating my buffers to be zero-copy using CL_MEM_USE_HOST_PTR in the CPU-GPU shared context as in https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics.

I know that the GPU has some dedicated memory reserved in the DRAM. So are you saying that whenever a kernel is launched, all buffers associated with it are first copied to this video memory? But doesn't this violate the 'zero' in 'zero-copy'?

Thanks

0 Kudos
Tamer_Assad
Innovator
785 Views

Hi Surdeep,

 

It is true, buffers can be created on shared memory. Referenced article demonstrate shared memory approach, System's shared memory, not GPU's:

-----------------------

opencl-figure-1.png

Figure 1. Relationship of the CPU, Intel® processor graphics, and main memory. Notice a single pool of memory is shared by the CPU and GPU, unlike discrete GPUs that have their own dedicated memory that must be managed by the driver.

-----------------------

You might want to measure the performance of your specific solution, using both approaches, considering CL memory types and optimizations.

 

Best regards,

Tamer

0 Kudos
Reply