Hi OpenCL experts:
I saw a sentence "Thread dispatch serialization becomes a gating factor when a kernel has insufficient work per a work-item." in page 6 of the paper named <Intel® VTune™ Amplifier XE: Getting started with OpenCL™ performance HD Graphics OpenCL™ analysis on Intel HD Graphics>. I don't get the point.
Today I wrote a kernel to translate 3-channelled image to gray. The 3-channel are placed in 3 separated mem. Every work-item should read 3 times to utilize these data. When I use SIMD4(vload4()) instruction, the idle EUs array can be 17% , meanwhile 82% with SIMD 16(vload16()). Which factor cased it? Did I miss something?
These statistics are collected with 512x512 image, the number of work-items are 128x512 and 32x512 separately, and the local size are set to NULL. Please help me, I didn't have any idea.
The main idea behind the sentence about thread dispatch is that it can be inefficient to write kernels where each work item does not do much. Increasing the amount of work per work item (as you've done) can increase efficiency. The drop in efficiency as you move to 16 pixels per work item isn't expected. As I've run tests here I have not seen a drop like this.
Is there any way you can send a reproducer? Alternately, with a little more time I can send some examples of the RGB-gray kernels I've been looking at.
A couple follow-on questions:
- How does the performance of the vec16 version compare to the vec4 version? Is the vec16 version faster than the vec4 version even if the EU array is "more idle"?
- Which device are you running on?
- Do you happen to know if your kernel is being compiled SIMD8, SIMD16, or SIMD32? You can determine this by querying clGetKernelWorkGroupInfo(CL_KERNEL_PREFERRED_WORK_GROUP_SIZE).
- What happens if you use a much bigger image, say 2048x2048 or 4096x4096?
My thinking is that if your kernel is compiled SIMD32, and you're launching 32x512 work items, that's not a lot of EU threads, therefore the EU array may not spend much time fully occupied and active.