OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1720 Discussions

Is there any opencl tools to deal with multiple different GPUs?


Otherwise I have to judge the capabilities of every GUP, and send different amout of works to them.

It seems an awful task.

Is there any opencl tools to deal with this situation?

Or any suggestion?

thanks in advance.

0 Kudos
2 Replies

Hi Sq Chen,

Thanks for the comment. I think it’s useful to expand on it a bit:

1. Summary

  • With interrogation parameters users can better dynamically partition their workload on the host side and the target device side.
  • Generally speaking, the more that can be run in parallel the better… but this isn’t always the case… interrogation parameters can be used to cut off scaling or count memory against device constraints.
  • Many developers use open source public tools like ‘clinfo’ that has examples of system interrogation for OpenCL™ devices.
  • An expanded level of tuning on top of interrogation suggestions should orient people toward Intel® VTune™ Amplifier tool’s ‘GPU’ capabilities.


2. Goal

Determining capabilities to best execute your program is an important consideration when moving to new hardware for better performance opportunities and features. Initial bring up for a new platform can appear to be a burden, but the interrogation infrastructure here actually provides fairly useful feedback to for scheduling and making sure work fits on the device.

Compared to a host only program, some things are similar… for example: understanding the maximum amount of memory you can allocate. Some things are different, like understanding the maximum number of work-items or memory available underneath the different memory qualifiers: __global, __local, __private. Other things are just slightly different like understanding which OpenCL-C revision is offered by the device, typically 1.2, or 2.0 . The spirit here isn’t that much different than writing to different C revisions and managing deprecated usages of previous C standards.

Fortunately, many related parameters are offered through API interrogations calls. Let’s scan through the khronos reference to look at 2.1 API facilities… again keep in mind some of the different revisions of the linked resources.


3. Deviceinfo parameters



The Intel implementations will pick a work group size automatically if you don’t pick, but you may wish to choose another size to partition your work better with respect to __local memory access within a workgroup. Make sure to also see the description of CL_KERNEL_WORK_GROUP_SIZE in the other article linked below.


The dimensions must be considered for partitioning the workload.


The number of work items per kernel execution dictates how to partition the work.


Dictates which kernel (not API) features you can use (i.e. 1.2 or 2.0)


Consider scaling the workload by the number of samplers available. In general for Intel devices, intel subgroups provide better performance where applicable.


Helpful to understand how image data structures should be partitioned.


Don’t allocate more than you have available!


Useful for knowing workgroup only memory access limits.

As a general rule, anything with MAX SIZE or PREFERRED can be interesting to interrogate in order to partition work. Not all will be interesting for all devices as there is some vendor device implementation differences… See CL_DEVICE_MAX_COMPUTE_UNITS as one such example.

There are various other niche parameters that are useful. It’s highly recommended to read through the document as it can alert developers to useful features.


4. Kernel Work Group Info parameters

Use in combination with Device information parameters to tune per kernel and per device.



How large can a work group be for a given device?


Use this flag to make sure local memory used does not exceed the maximum expected on the device.


Using this value can be a performance advantage.


Are the kernels using too much private memory for the target device? If so, this API provides a facility to count private (default) memory back out of execution.


5. Extra

Exceeding some of these limits may fail somewhat opaquely… and be hard to debug at runtime if an application does not check beforehand. Make sure to check and print all API error codes… a simple macro to print the errors can be a life saver for new developers!

There are some equivalent built in calls for OpenCL-C kernels to get similar information device side… so it may not be necessary to send everything over. Pages 6 and 9 of the 2.1 reference card are relevant here:

In this reference guide, please take care to observe the color coding to apply the appropriate revision for your targets.

For many developers with a future focus, development is becoming more heterogenous while using different types of devices and architectures.

  • Managing how work is sent to different devices will in many instances becomes more important. Just reading the bitflags in the above API manual docs should help a developers understanding.
  • Just like vector compute units and vector instructions on CPUs, laying out work for alternative devices presents a challenge to get performance from the device.
  • That being said, much effort Intel implementations to both compile device kernels and schedule kernel execution to map well to a particular architecture… when no developer parameters are supplied.
  • In reality ensuring that memory access is unmitigated is one of the biggest bottleneck alleviators. Intel® Vtune™ Amplifier can provide pretty quick suggestions about better memory access patterns to eliminate extraneous overhead.


Thanks again for the comment.


0 Kudos

Thank you for the very important Imformation, MichaelC.

This work seems not easy.

So, I want to know is there any high level libs that can automatically devide a large task, then dispatch to several GPUs.

I know it is difficult, but I still hopeful.

0 Kudos