OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1663 Discussions

strange behavior when float16 are used and the meaning of thread idle

Xinyan_S_
Beginner
176 Views

Hi OpenCL experts:
    I saw a sentence "Thread dispatch serialization becomes a gating factor when a kernel has insufficient work per a work-item." in page 6 of the paper named <Intel® VTune™ Amplifier XE: Getting started with OpenCL™ performance HD Graphics OpenCL™ analysis on Intel HD Graphics>. I don't get the point.
    Today I wrote a kernel to translate 3-channelled image to gray. The 3-channel are placed in 3 separated mem. Every work-item should read 3 times to utilize these data. When I use SIMD4(vload4()) instruction, the idle EUs array can be 17% , meanwhile 82% with SIMD 16(vload16()). Which factor cased it? Did I miss something?
    These statistics are collected with 512x512 image, the number of work-items are 128x512 and 32x512 separately, and the local size are set to NULL. Please help me, I didn't have any idea.

0 Kudos
2 Replies
Jeffrey_M_Intel1
Employee
176 Views

The main idea behind the sentence about thread dispatch is that it can be inefficient to write kernels where each work item does not do much.  Increasing the amount of work per work item (as you've done) can increase efficiency.   The drop in efficiency as you move to 16 pixels per work item isn't expected.  As I've run tests here I have not seen a drop like this. 

Is there any way you can send a reproducer?  Alternately, with a little more time I can send some examples of the RGB-gray kernels I've been looking at.   

Ben_A_Intel
Employee
176 Views

A couple follow-on questions:

- How does the performance of the vec16 version compare to the vec4 version?  Is the vec16 version faster than the vec4 version even if the EU array is "more idle"?

- Which device are you running on?

- Do you happen to know if your kernel is being compiled SIMD8, SIMD16, or SIMD32?  You can determine this by querying clGetKernelWorkGroupInfo(CL_KERNEL_PREFERRED_WORK_GROUP_SIZE).

- What happens if you use a much bigger image, say 2048x2048 or 4096x4096?

My thinking is that if your kernel is compiled SIMD32, and you're launching 32x512 work items, that's not a lot of EU threads, therefore the EU array may not spend much time fully occupied and active.

Reply