Hi OpenCL experts:
I saw a sentence "Thread dispatch serialization becomes a gating factor when a kernel has insufficient work per a work-item." in page 6 of the paper named <Intel® VTune™ Amplifier XE: Getting started with OpenCL™ performance HD Graphics OpenCL™ analysis on Intel HD Graphics>. I don't get the point.
Today I wrote a kernel to translate 3-channelled image to gray. The 3-channel are placed in 3 separated mem. Every work-item should read 3 times to utilize these data. When I use SIMD4(vload4()) instruction, the idle EUs array can be 17% , meanwhile 82% with SIMD 16(vload16()). Which factor cased it? Did I miss something?
These statistics are collected with 512x512 image, the number of work-items are 128x512 and 32x512 separately, and the local size are set to NULL. Please help me, I didn't have any idea.