I have managed to run my kernel on iGPU under Linux and Windows.
Officially linux does not support to run kernel on iGPU but an OpenCL source project "beignet" come to help.
So following is the performance result for my kernel (deblocking filter in HEVC), the performance (time in seconds) was not obtained by binding event to kernel launching in OpenCL as it also depends on the OpenCL runtime implementation under windows and linux, instead, it was obtained by the host side CPU profiling utilities.
H2D Kernel D2H
Linux 1.95, 3.89, 1.56
Windows 6.74, 0.85, 1.44
I am not sure whether the beignet develop team use the same compiler to the windows OpenCL compiler, but the performance of kernel differs too much under these two systems. Also the host to device copy take much more time on Windows, can not figure out why.
Please submit Beignet bugs here: https://bugs.freedesktop.org/enter_bug.cgi?product=Beignet or direct questions about Beignet performance to the mailing list: http://lists.freedesktop.org/mailman/listinfo/beignet - we are not supporting it in this forum.
As far as D2H and H2D times, you can avoid these copies by properly aligning the memory of your buffers, using CL_USE_HOST_PTR flag when creating the buffers, and aligning the size of your buffers to 64 bytes. See this excellent article by Adam Lake: https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performanc... on how to do it properly.