( The following is based on some recent experiments on a GEN8 IGP )
FYI -- one gotcha to watch out for when porting from CUDA to Intel IGP is that the OpenCL barrier()/work_group_barrier() operation doesn't support either work items or subgroups exiting early.
For example, if a subgroup returns early and the remaining work items synchronize in a barrier() then your kernel is going to hang on the IGP.
Early exit of some threads (work items) at the end of a grid is a pretty common use case in CUDA.
Fortunately, OpenCL 2.0 has a feature that doesn't exist in CUDA and it might help you workaround this issue... Non-Uniform Work Groups.