VTune GPU Hotspot (processor graphics HW events) and OpenCL case studies? sample code?

Colin_Reinhardt · ‎01-24-2017

Hi,

I am teaching a Master's class this quarter at Univ. of Washington EE dept on Applied GPU computing.

We are using Intel OpenCL SDK, VTune, running on Skylake (i5-6500T CPU w/ HD Graphics 530).

I have been receiving excellent help from the Intel OpenCL group (Jeffrey Mcallister, Robert Ioffe, Ben Ashaugh, Michael Chu, and others) on lots of OpenCL details.

I am planning to use VTune to provide additional in-depth analysis of OpenCL kernel performance and optimization, but I was hoping to find some tutorials and/or samples specifically showing GPU Hotspots and OpenCL capabilities (We're using the latest VTune Amplifier XE 2017). I've read Julia Fedorova's VTune article from 2013 on "Getting Started with OpenCL...", but that is now several years old (used Intel Iris Graphics), and doesn't really provide a tutorial, just a good discussion.

Specifically, I'd like to use VTune to conduct STREAM-like [McCalpin, 1995] studies of memory performance/bandwidth and cache utilization, in order to create Roofline-type performance model analysis plots of compute-/memory-bound kernel behaviors for various arithmetic intensities.

So two initial questions:

1) Are there any other tutorials/code samples/reference specifications for GPU Hotspot analysis types, with Processor Graphics Hardware Events (Compute Basic, Compute Extended, Full Compute (?)), and OpenCL tracing? Perhaps a real world case-study?

2) I need to understand better the metrics available for the various blocks in the Graphics/Architecture plot (Sampler (L1,L2), L3 (SLM), GTI, LLC, DRAM), and how these map to OpenCL memory-model objects (device local memory, device global memory, device constant memory).

Thank you, Colin

Julia_F_Intel · ‎01-25-2017

Hi Colin,

This might be a good material http://www.iwocl.org/wp-content/uploads/iwocl-2015-talk-Propel-with-OpenCL-Intel.pdf

There is a case study with Gaussian Blur filter staring from page #51.

VTune GPU Architecture diagram should provide some help re- GPU metrics.

Thank you,
Julia

Colin_Reinhardt · ‎01-25-2017

Hi Julia,

Thanks for the IWOCL slides link, these are useful. Also, what happened to the great 4-part video you did on VTune & OpenCL? I was watching that just the other day, but it appears the links have changed and I can't find it anymore.

Maybe you can answer some more specific questions I have:

1. Regarding the 64 KB/subslice of SLM (on Gen9 Intel Processor Graphics):

a) This is what is accessed by using "__local" qualifier in OpenCL kernels, correct?

b) Is this total amount shared among all workgroups running within a subslice? There could be more than 1 workgroup running concurrently within a subslice, correct?

c) Local memory (SLM) data cannot be directly shared/accessed between workgroups (on same subslice) can it?

d) What is the remaining amount of L3 cache used for (512 KB - 3x64 KB = 320 KB)? Is it accessible using OpenCL?

2. Does the GTI memory region (shown in the VTune GPU Architecture diagram) play any role in OpenCL kernel execution? Is it ever necessary to consider when tuning/optimizing?

3. Can the Intel Memory Latency Checker tool (Intel MLC) be used to measure any GPU/Intel Processor Graphics memory latencies and bandwidths?

Thank you, Colin

Julia_F_Intel · ‎01-25-2017

Hi Colin,

Unfortunately I don't know what happened with that video. It was captured by someone. and i don't have it around.
But if you have questions re- optimization code for Intel GPU - i try to answer.
(Note: i am on another project now, and i might forgot GPU related things)

Re- your questions:

1.
a)Yes. Either specify "__local" inside a kernel or pass it as a parameter to a kernel (this way the host code can change the size in run-time )

b). Total amount of used SLM is ~ SLM size for a kernel multiplied by number of workgroups simultaneously running on a sub-slice.
"~" because SLM is assigned by some "quantum portions" (if i remember correctly, and i don't know exact size of the quantum - my guess is few Kb (probably depending on some HW characteristics?) - you can experiment and deduce it).

Definetely total used SLM for subslice could be less than 64K. 64 K is a max what HW allows.

Someone should use SLM carefully - as asking for too big SLM will limit the number of workrgroups running in parallel - which usually result the GPU will be underutilized and this is not good in general. But it is up to an code/algorithm designer - some time it happens that the work could be done faster with smart algorithm not using all available concurrency of a GPU..
(but ideally more workgroups running in parallel the better)

c) SLM shared (same region) between working items/threads of one workgroup. i believe it is OCL spec, not specific to Intel. It can't shared between several working groups - they share global memory. Each workgroup access its own SLM region.

d) almost correct with the difference that it could be used less than 64 KB per sub-slice of SLM. The remaining L3 will be [transparently] used for caching global data which would be good for the OCL kernel code in general.

This is another factor why SLM should be used with care - as it limits amount of cache space for the other data.
Generally SLM is useful if data loaded there once - reused many times

2. AFAIK - GTI is just an interface. We can measure on it the whole memory traffic between GPU and Uncore: as there is Sampler cache traffic that bypass L3 and other non-cachable structures (while these are not used in OCL) but all they go through GTI.
So looking to GTI bandwidth is helpful - it tells how much available bandwidth between GPU and Uncore a kernel uses: e.g. is it saturating it, or there is still has some room that could be used....

3. Sorry - i have not used Intel MLC.
VTune measures/exposes bandwidth inside GPU, GPU to Uncore, Uncore to DRAM; but not latencies.
While there might be some future capabilities in VTune that could be helpful to pin point high latency memory accesses

Sorry if my comments some messy
Thanks
Julia

Julia_F_Intel · ‎01-26-2017

correction: Sampler traffic does goes through L3.
on GTI we measure what was requested from L3 + non-cachable / some specialized buffers traffic (e.g for video surfaces? )