CPU OpenCL work-items

Juan_G_1 · ‎04-15-2019

What is the meaning of having a certain number of OpenCL work-items into a CPU?

I'm trying tu understand why I could have more work-items in a CPU than a GPU in one dimension.

== CPU ==

   DEVICE_VENDOR: Intel
   DEVICE NAME: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
   MAXIMUM NUMBER OF PARALLAEL COMPUTE UNITS: 4
   MAXIMUM DIMENSIONS FOR THE GLOBAL/LOCAL WORK ITEM IDs: 3
   MAXIMUM NUMBER OF WORK-ITEMS IN EACH DIMENSION: (1024 1 1 )
   MAXIMUM NUMBER OF WORK-ITEMS IN A WORK-GROUP: 1024

== GPU ==

   DEVICE_VENDOR: Intel Inc.
   DEVICE NAME: Intel(R) Iris(TM) Graphics 6100
   MAXIMUM NUMBER OF PARALLAEL COMPUTE UNITS: 48
   MAXIMUM DIMENSIONS FOR THE GLOBAL/LOCAL WORK ITEM IDs: 3
   MAXIMUM NUMBER OF WORK-ITEMS IN EACH DIMENSION: (256 256 256 )
   MAXIMUM NUMBER OF WORK-ITEMS IN A WORK-GROUP: 256

The above is the result of my test code to print the information of the actual hardware that the OpenCL framework can use.

I really do not understand why the value of 1024 in the Maximum number of work-items in the CPU section. What is the real meaning of having that amount of work-items? What is the meaning of 1024 work-items into a 4-core CPU?

Michael_C_Intel1 · ‎04-16-2019

Hi JuanG,

Thanks for the question and the interest in heterogeneous programming.

The number of OpenCL workitems available per workgroup does not necessarily have a linear relationship to the number of cores of an Intel processor nor the cores reported by an OpenCL runtime. The reason is two fold.

1) The microarchitecture on different devices may be different.

2) The built OpenCL program may exhibit different residency and transfer characteristics that can affect performance. Runtimes have some flexibility in scheduling work on the target as they see fit... and this is transparent to the program calling them.

Your posted OpenCL runtimes employ their own methods to stage work to an OpenCL device that vary by platform and cl_kernel ... So programmers are recommended two heuristics:

1) Start from scheduling as much work as possible to the target device. Stage the number of workitems dynamically. Use API call feedback to assist with scheduling sizing (see notes below). Let the Intel runtime compile and do the scheduling and affinitization work.

2) Minimize offload transfers... Composite buffers and transfer them in one shot if possible. Use zero copy where possible.

Reference:

Zero Copy Reference

Intel employees put effort into tuning these platforms. As such, feedback that shows a gap in expectations is appreciated as it allows the product to be improved.... and for Intel Graphics ... especially if provided to the github portal directly for Windows OS or Linux OS. For Intel CPU runtime this forum is a good location to comment on issues.

Note two closely related API provisions:
1) clGetKernelWorkGroupInfo(...) with CL_KERNEL_WORK_GROUP_SIZE
This will give you the maximum workgroup size for the kernel object on that device. Logically this should be less than or equal to the value clGetDeviceInfo(...) provides from CL_DEVICE_MAX_WORK_GROUP_SIZE

2) clGetKernelWorkGroupInfo(...) with CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
Can't think of a reason this shouldn't be used unless there is a 'tail' of a data set to try to execute.

Reference:

Note that for most best cases for any library, a compiler and runtime scheduler will saturate the compute available in each vector lane of the device. The compiler will analyse the source to choose it's best known vector operation that could apply to a particular source code line. Within the context of Intel® Core™ i5-5257U CPU Runtime, for the CPU case the runtime could leverage AVX[2] vector instructions. Newer Intel CPUs have wider vector widths that will in some cases see each vector lane of an AVX512 instruction filled for compute. AVX512 functionality is one of the key features in the CPU runtime released most recently.

Reference:

In the noted case observe that the CPU in use is 2 core 4 thread (assuming Intel Hyper Threading technology is on). The example program you mentioned suggests the CPU runtime enumerates 4 compute units. I don't have a machine to check immediately, but I would suspect if HT was turned off only two compute units would be enumerated.

If you're interested in getting a better feel for the capabilities of Intel Graphics devices with respect to programming, I recommend the training video composed by Adam Herr, one of Intel's leaders in heterogeneous compute. I expect the video to still be useful for developers targeting just the CPU runtime as well. Herr discusses general principles that extend beyond Graphics devices.

Reference

Video Reference

-MichaelC