Is there any opencl tools to deal with multiple different GPUs?

chen__sq · ‎11-02-2018

Otherwise I have to judge the capabilities of every GUP, and send different amout of works to them.

It seems an awful task.

Is there any opencl tools to deal with this situation?

Or any suggestion?

thanks in advance.

Michael_C_Intel1 · ‎11-05-2018

Hi Sq Chen,

Thanks for the comment. I think it’s useful to expand on it a bit:

1. Summary

With interrogation parameters users can better dynamically partition their workload on the host side and the target device side.
Generally speaking, the more that can be run in parallel the better… but this isn’t always the case… interrogation parameters can be used to cut off scaling or count memory against device constraints.
Many developers use open source public tools like ‘clinfo’ that has examples of system interrogation for OpenCL™ devices.
An expanded level of tuning on top of interrogation suggestions should orient people toward Intel® VTune™ Amplifier tool’s ‘GPU’ capabilities.

2. Goal

Determining capabilities to best execute your program is an important consideration when moving to new hardware for better performance opportunities and features. Initial bring up for a new platform can appear to be a burden, but the interrogation infrastructure here actually provides fairly useful feedback to for scheduling and making sure work fits on the device.

Compared to a host only program, some things are similar… for example: understanding the maximum amount of memory you can allocate. Some things are different, like understanding the maximum number of work-items or memory available underneath the different memory qualifiers: __global, __local, __private. Other things are just slightly different like understanding which OpenCL-C revision is offered by the device, typically 1.2, or 2.0 . The spirit here isn’t that much different than writing to different C revisions and managing deprecated usages of previous C standards.

Fortunately, many related parameters are offered through API interrogations calls. Let’s scan through the khronos reference to look at 2.1 API facilities… again keep in mind some of the different revisions of the linked resources.

3. Deviceinfo parameters

https://www.khronos.org/registry/OpenCL/sdk/2.1/docs/man/xhtml/clGetDeviceInfo.html

CL_DEVICE_MAX_WORK_GROUP_SIZE

The Intel implementations will pick a work group size automatically if you don’t pick, but you may wish to choose another size to partition your work better with respect to __local memory access within a workgroup. Make sure to also see the description of CL_KERNEL_WORK_GROUP_SIZE in the other article linked below.

CL_DEVICE_MAX_WORK_ITEM_ DIMENSIONS

The dimensions must be considered for partitioning the workload.

CL_DEVICE_MAX_WORK_ITEM_SIZES

The number of work items per kernel execution dictates how to partition the work.

CL_DEVICE_OPENCL_C_VERSION

Dictates which kernel (not API) features you can use (i.e. 1.2 or 2.0)

CL_DEVICE_MAX_SAMPLERS

Consider scaling the workload by the number of samplers available. In general for Intel devices, intel subgroups provide better performance where applicable.

CL_DEVICE_IMAGE*D_MAX_[HEIGHT|WIDTH]

Helpful to understand how image data structures should be partitioned.

CL_DEVICE_IMAGE_MAX_BUFFER_SIZE

Don’t allocate more than you have available!

CL_DEVICE_LOCAL_MEM_SIZE

Useful for knowing workgroup only memory access limits.

As a general rule, anything with MAX SIZE or PREFERRED can be interesting to interrogate in order to partition work. Not all will be interesting for all devices as there is some vendor device implementation differences… See CL_DEVICE_MAX_COMPUTE_UNITS as one such example.

There are various other niche parameters that are useful. It’s highly recommended to read through the document as it can alert developers to useful features.

4. Kernel Work Group Info parameters

Use in combination with Device information parameters to tune per kernel and per device.

https://www.khronos.org/registry/OpenCL/sdk/2.1/docs/man/xhtml/clGetKernelWorkGroupInfo.html

CL_KERNEL_WORK_GROUP_SIZE

How large can a work group be for a given device?

CL_KERNEL_LOCAL_MEM_SIZE

Use this flag to make sure local memory used does not exceed the maximum expected on the device.

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

Using this value can be a performance advantage.

CL_KERNEL_PRIVATE_MEM_SIZE

Are the kernels using too much private memory for the target device? If so, this API provides a facility to count private (default) memory back out of execution.

5. Extra

Exceeding some of these limits may fail somewhat opaquely… and be hard to debug at runtime if an application does not check beforehand. Make sure to check and print all API error codes… a simple macro to print the errors can be a life saver for new developers!

There are some equivalent built in calls for OpenCL-C kernels to get similar information device side… so it may not be necessary to send everything over. Pages 6 and 9 of the 2.1 reference card are relevant here:

https://www.khronos.org/files/opencl21-reference-guide.pdf

In this reference guide, please take care to observe the color coding to apply the appropriate revision for your targets.

For many developers with a future focus, development is becoming more heterogenous while using different types of devices and architectures.

Managing how work is sent to different devices will in many instances becomes more important. Just reading the bitflags in the above API manual docs should help a developers understanding.
Just like vector compute units and vector instructions on CPUs, laying out work for alternative devices presents a challenge to get performance from the device.
That being said, much effort Intel implementations to both compile device kernels and schedule kernel execution to map well to a particular architecture… when no developer parameters are supplied.
In reality ensuring that memory access is unmitigated is one of the biggest bottleneck alleviators. Intel® Vtune™ Amplifier can provide pretty quick suggestions about better memory access patterns to eliminate extraneous overhead.

Thanks again for the comment.

-MichaelC

chen__sq · ‎11-07-2018

Thank you for the very important Imformation, MichaelC.

This work seems not easy.

So, I want to know is there any high level libs that can automatically devide a large task, then dispatch to several GPUs.

I know it is difficult, but I still hopeful.