Analyzing OpenCL performance on NUC8i7HVK

George_V_2 · ‎01-16-2019

I have been working on analyzing the performance of our proprietary OpenCL 1.2/OpenGL 4.3 engine on the NUC8.

Using the Intel GPA tools I was able to capture profiling data for our OpenGL calls (after disabling the use of a few extensions that aren't supported) and then load the data in the Frame Analyzer.

However, due to our engine's architecture, the GPU is spending about 80% of its load in the OpenCL kernels, so I would like like to be able to analyze the performance of those kernels, ideally in the same overview as the OpenGL shaders, so I can see the "full picture" of what happens in a frame.

So far, I haven't been able to find out if it can be done, and if so, how it should be done. Any feedback/pointers to potential solutions welcome.

Thanks

Michael_C_Intel1 · ‎01-18-2019

Hi Grecco,

Thanks for the interest and sharing the question.

Such a similar over view is not currently available between the shaders and the kernels as described... As an alternative for profiling the kernels which may achieve those goals see Intel® VTune™ Amplifier....

Intel® Vtune™ Amplifier provides mechanisms to show OpenCL™ kernel performance overviews. There should be views to show some system level and software level bottlenecks that can be mitigated for an executing OpenCL™ kernel. Intel® Vtune™ Amplifier has tool tips to provide recommendations if a particular functional component of the device is under or over subscribed. I.E. texture sampler vs compute unit array subscription. While Intel® Vtune™ Amplifier will have less visualization functionality with respect to Graphics api frames like Intel® GPA, it will allow for review of kernel hotspots. It does have some views that can show corresponding OpenCL™ API calls as well.

Two general suggestions for OpenCL™ programming on Intel® heterogeneous devices...

1) Such processor SKUs may share an address domain between CPU and Intel® Graphics Technology... by corollary, employing OpenCL™ mapped pointers to marshall data to the device, "zero-copy", over OpenCL™ writes and reads enjoys significantly less over head.

2) Specifically with GEN9 and later Intel® Graphics Technology u-arch, please see the subgroups OpenCL™ extension. Here is the walkthrough repository. On such architectures and newer subgroups is needed to reach near theoretical performance for classic workloads like matrix multiply. Employing subgroups is highly advised upon detecting 'cl_intel_subgroups' during OpenCL™ device interrogation.

Thanks,

-MichaelC

George_V_2 · ‎01-21-2019

Hello Michael,

Thank you for the detailed reply.

I have tried using Intel VTune Amplifier but when trying to analyze GPU Compute/Media hotspots I get the following message:

"Cannot collect GPU hardware metrics because Metrics Discovery API library was not found. Make sure the Intel Graphics Driver is installed."

The NUC8i7HVK has both a HD Graphics 630 GPU and an AMD Radeon RX Vega M GH GPU, so I'm assuming this error is generated because the analyzer does not recognize the driver installed (Radeon 18.9.1/Intel 25.20.100.6323) or the HD Graphics GPU is not active.

I have been trying to find a setting that controls which GPU is used for an application, but up until now I have not been able to find out how I can have my application run on the HD Graphics GPU. The Radeon GPU seems to be always on and also setting a preferred GPU in the Windows 10 display settings has no effect.

Is there a way to set which GPU will be used for an application?

Thanks