OpenCL implementation details

Helder_Vieira · ‎09-05-2018

Hello.
I've been doing a lot of experiments with OpenCL in the last two months or so.
More specifically, I've been using the NOpenCL library ( created by Tunnel Vision Labs ) to perform OpenCL tasks in C# applications, on a low-end portable ( Intel i7-4510U CPU / Intel HD Graphics 4400 + AMD Radeon R7 M260 ).
Being an application developer, most of my work won't fit a SIMD model. However, the performance gains of using the 4400 GPU instead of the CPU ( even when using kernels with several branching points ) are so significant that the issue becomes irrelevant.
Unfortunately, all the OpencCL-related documentation I've read so far is quite elusive about the relationship of its concepts with the specific hardware, and I couldn't find a single piece of information about how Intel chose to implement OpenCL in its GPU line(s). As such, I must say I'm completely blind when I'm preparing the command queues. Are workgroups in some way related with the 4400 20 pipelines ? How do compute units fit in the picture ? By establishing a local work size of 1, am I in some way forcing the use of a single thread inside a compute unit ?
I'd say in SIMD problems this type of questions is probably useless, as long as one follows some general rule about the division of the task size. In any other case, it would perhaps be important to be aware of the penalties involved and the best strategies to minimize them. And to do that, it would be important to understand some OpenCL implementation details on specific chips or architectures.

So, if someone could share one or more links to relevant documentation on these issues, I'd be very grateful.

Thanks,

Helder Vieira

Ben_A_Intel · ‎09-05-2018

I did a webinar a while back that talks specifically about how Intel GPUs execute OpenCL kernels. The slides are still online, here:

https://software.intel.com/en-us/download/taking-advantage-of-intel-graphics-with-opencl

Have a look, and if you still have questions, feel free to ask here.

Note that the same concepts mostly apply to newer GPUs as well, though some of the numbers have changed, such as the number of EUs and number of sub-slices.

Michael_C_Intel1 · ‎09-12-2018

Duplicate thread from other forum... https://software.intel.com/en-us/node/785629

Posting the same response from a few days ago here:

Hello VieiraH,

Heads up... lots of this post is opinion... ask different folks and observe different takes...

We have a popular overview on Intel® Graphics Technology and considerations programming it up on tech decoded. It was delivered by one of our heterogeneous compute performance architects. https://techdecoded.intel.io/essentials/what-intel-processor-graphics-ge...

In general:

coalescing memory transfer and access and

running sufficient work-items for an embarrassingly parallel compute task

are where developers should start with OpenCL™ development anywhere, and in particular for Intel® Graphics Technology. Keep in mind OpenCL™ is really more SPMD as opposed to SIMD. Intel® Graphics Technology does have SIMD facilities that can be used to support OpenCL™, but OpenCL™ provisionings themselves are SPMD.

Khronos has the standard resource on https://www.khronos.org/registry/OpenCL/ for describing OpenCL™ provisionings. In particular, I recommend taking a look at the execution model, memory model, and programming model overviews.

There are really some good books out there that explain OpenCL™ provisionings... but I'll have to check on our rules for endorsing other commercial products in this forum (sorry).

Other resources:

OpenCL™ Developer Guide for Intel® Processor Graphics: https://software.intel.com/en-us/iocl_opg

This one is my favorite - https://software.intel.com/en-us/node/540452

Plenty of related performance consideration topics were outlined at this year's international workshop for OpenCL (IWOCL) https://www.iwocl.org/iwocl-2018/conference-program/

We delivered a talk called Improving Performance of OpenCL Workloads on Intel Processors with Profiling Tools which may cover some topics of indicated interest and has more references. The slide deck is linked from the IWOCL hub page.

Sidebar but worth mentioning... in terms of writing code toward a particular device... Between work item/workgroup scheudling and device implementations building for specific targets, much optimization is scheduled between the runtime and the dispatch facilities on the target. People looking for maximal tweakability may wish to look at the new opensource Intel® Graphics Compute Runtime for OpenCL™ Driver implementation on github to see how the Intel Graphics implementation operates. Keep in mind... this is all in user mode. For the interests laid out by the original post of this thread, this is likely more investigation than is reasonable.

On maximizing any OpenCL target device... make sure to consider the extensions available for that target. Khronos maintains the master registry of extension specifications :https://www.khronos.org/registry/OpenCL/

Specifically for the vector lane potential for Intel® Graphics Technology, see this example showing the feature in action : https://github.com/intel/compute-samples/tree/master/compute_samples/app...

For Intel® Graphics Technology this is the cl_intel_subgroups extension.

I hope this helps get you started and thank you for your interest. Good luck. Also... not that it's hard a requirement... please consider our OpenCL™ technology forum for future OpenCL™ specific topics. https://software.intel.com/en-us/forums/opencl. We want developer concerns to get the right eyes on them. OpenCL™ implementations do enable our computer vision developer tools so this is a lot of crossover.

-MichaelC

Helder_Vieira · ‎09-13-2018

Thanks, Ben Ashbaugh and Michael C, I'm now aware that most of my doubts in the near future will be concentrated on the GPU structure, not on OpenCL. I'll be focusing for a while on the implications of both the fanning cache cascade ( forgive the expression ) and the threading mechanism ( at this moment I can't even imagine where or how it might be orchestrated ).

Again, thank you.

Helder