Hi Sq Chen,
Thanks for the comment. I think it’s useful to expand on it a bit:
Determining capabilities to best execute your program is an important consideration when moving to new hardware for better performance opportunities and features. Initial bring up for a new platform can appear to be a burden, but the interrogation infrastructure here actually provides fairly useful feedback to for scheduling and making sure work fits on the device.
Compared to a host only program, some things are similar… for example: understanding the maximum amount of memory you can allocate. Some things are different, like understanding the maximum number of work-items or memory available underneath the different memory qualifiers: __global, __local, __private. Other things are just slightly different like understanding which OpenCL-C revision is offered by the device, typically 1.2, or 2.0 . The spirit here isn’t that much different than writing to different C revisions and managing deprecated usages of previous C standards.
Fortunately, many related parameters are offered through API interrogations calls. Let’s scan through the khronos reference to look at 2.1 API facilities… again keep in mind some of the different revisions of the linked resources.
3. Deviceinfo parameters
The Intel implementations will pick a work group size automatically if you don’t pick, but you may wish to choose another size to partition your work better with respect to __local memory access within a workgroup. Make sure to also see the description of CL_KERNEL_WORK_GROUP_SIZE in the other article linked below.
The dimensions must be considered for partitioning the workload.
The number of work items per kernel execution dictates how to partition the work.
Dictates which kernel (not API) features you can use (i.e. 1.2 or 2.0)
Consider scaling the workload by the number of samplers available. In general for Intel devices, intel subgroups provide better performance where applicable.
Helpful to understand how image data structures should be partitioned.
Don’t allocate more than you have available!
Useful for knowing workgroup only memory access limits.
As a general rule, anything with MAX SIZE or PREFERRED can be interesting to interrogate in order to partition work. Not all will be interesting for all devices as there is some vendor device implementation differences… See CL_DEVICE_MAX_COMPUTE_UNITS as one such example.
There are various other niche parameters that are useful. It’s highly recommended to read through the document as it can alert developers to useful features.
4. Kernel Work Group Info parameters
Use in combination with Device information parameters to tune per kernel and per device.
How large can a work group be for a given device?
Use this flag to make sure local memory used does not exceed the maximum expected on the device.
Using this value can be a performance advantage.
Are the kernels using too much private memory for the target device? If so, this API provides a facility to count private (default) memory back out of execution.
Exceeding some of these limits may fail somewhat opaquely… and be hard to debug at runtime if an application does not check beforehand. Make sure to check and print all API error codes… a simple macro to print the errors can be a life saver for new developers!
There are some equivalent built in calls for OpenCL-C kernels to get similar information device side… so it may not be necessary to send everything over. Pages 6 and 9 of the 2.1 reference card are relevant here:
In this reference guide, please take care to observe the color coding to apply the appropriate revision for your targets.
For many developers with a future focus, development is becoming more heterogenous while using different types of devices and architectures.
Thanks again for the comment.
Thank you for the very important Imformation, MichaelC.
This work seems not easy.
So, I want to know is there any high level libs that can automatically devide a large task, then dispatch to several GPUs.
I know it is difficult, but I still hopeful.