Basic question about relation between work-item and scheduling unit
For data parallel mode of OpenCL,is single work-item the scheduling unit and executed by one CPU thread? Or scheduler takes a group of work-items, merge them into one task, and take it as one schedule unit?
In addition, if I set the work group size, does one group executed on one core?
The OpenCL spec defines that a work group is executed on a single compute unit. The current version of the Intel OpenCL SDK defines a CPU (hardware) thread to be a compute unit, so a work group will be executed on a single thread. The opposite doesn't necessarily hold - multiple work groups may be processed on the same thread, even if the number of work groups is smaller than the total number of compute units. Work groups exist irrespective of the local work size passed to clEnqueueNDRange calls - if you choose not to define it yourself, the implementation will choose a suitable value based on the kernel you wish to execute.
Is there any documentation available to know how the opencl sdk implementation is done. Like specifying about how the workgroups are scheduled on the CPU coresor how intel opecl sdk maps on to intel core cpus.
To add to Evgeny's answer, could you tell us what you'd like to know about the scheduling policy and why? We try to address the potential developer needs via the OOO queue, device fission and immediate execution implementations, as well as document various approaches to well-performing code in the optimization guide linked to above. If we've missed something, we'd like to know so we can extend our support for developers in future versions.
Does it mean that if I have only one work-group and a lot of work-items within it (local size = global size), only one CPU thread will be used? Why not distribute the work-items across all available CPU threads?
This scenario (i.e. only one work-group) might happen, for instance, if I need to synchronize over *all* work-items, so I could use "barriers"--the OpenCL specification doesn't allow synchronization across work-items from different work-groups.
In the scenario of a single work-group with multiple work items, you will only be able to benefit from SIMD-level parallelism. Your understanding is correct that you will not get any thread-level parallelism. The OpenCL spec defines a "work group" as a collection of work items that executes on a single compute unit. To have thread-level parallelism within a work-group would mean that we define the entire CPU to mean "compute unit", which in turn would take away control from developers, as they won't be able to control utilization of the threads with extensions such as Device Fission (that operates on the compute unit level).
It sounds like your program requires some synchronization between work items that could be in different work groups. If you can tell us a bit more about it, maybe we can find a solution for it within the existing OpenCL API. You're correct that an OpenCL C barrier will not be a good match.
Hi Doron Singer,
Thank you for your prompt reply. That scenario was a hypothetical one. I was just wondering if it would be possible to a given OpenCL implementation to be "smart" and distribute work-items across CPU threads when a single work-group was scheduled.
Besides the mentioned Device Fission issue, is there another reason why a CPU isn't considered as a single compute unit by OpenCL? In this case each CPU core would be treated as a processing element.
The current version of the OpenCL spec doesn't allow for such flexibility in scheduling. Another reason to consider is that if we work in that granularity, it's difficult to express SIMD-level parallelism.