As far as I know , each work group is passed sequentially to the kernel and all the work-items are run in parallel. Example : a 3D array of 8*8*8 would have 512 points, each work group size is 4*4*4 and there are 8 such groups each having 64 points. ND range performs an operation on all 64 points of a work group parallelly after which it does the same for the next work-group. How do the run the work-groups also in parallel ?
i think bellow article will helps you to understand. please go through it.
Work-items in the same work-group won't run in parallel, they will be pipelined. You will need to use the SIMD attribute to achieve work-item-level parallelism:
Multiple work-groups are also automatically pipelined one after the other inside the same compute unit, and the compiler will replicate local memory buffers inside your kernel to accommodate for this. If you want to have work-group-level parallelism, then you need to use the num_compute_units() attribute:
As we do not receive any response from you on the previous question/reply/answer that we have provided. Please login to ‘https://supporttickets.intel.com’, view details of the desire request, and post a feed/response within the next 15 days to allow me to continue to support you. After 15 days, this thread will be transitioned to community support. The community users will be able to help you on your follow-up questions.