I am running my OpenCL application on an Intel HD 530 graphics device and experience exessive time usage for some kernels. On all previous devices I have tried (including earlier Intel HD), the measured time of my kernel during repeating calls has been fairly constant. On the HD 530 graphics, the execution time jumps between 3ms (normal) to 40ms(!).
I got two questions:
- In the attached sceren shot, the Platform Analyzer shows me the execution times for my kernels, but I don't know how I should interpret this, as each kernel is displayed several times, on different lines, with different execution times. Is the top line of colored boxes, the "actual execution time" and the lower ones indicate when the command was issued from CPU?
- Do you have any idea why my execution time would jump between 3ms, which is the normal, and 40ms? When I enable more computation in my kernel I can see the 3ms is increasing, but the 40ms is constant. As I mentioned above, this has never happened on other devices (NVIDIA, AMD, Intel), which makes me think there is something else going on on the GPU.
Platform Analyzer is a fairly limited tool: what you see on the timeline is the GPU OpenCL command queue: you don't see the actual kernel running on the GPU. To get a better idea of what is going on, you can download Intel(R) VTune(TM) Amplifier 2016 https://software.intel.com/en-us/intel-vtune-amplifier-xe/ - you can get a free 30-day evaluation version first to decided whether it is useful for your purposes.
Then, take a look at this article : https://software.intel.com/en-us/articles/intel-vtune-amplifier-xe-getting-started-with-opencl-perfo... for a quick intro to using Vtune for OpenCL analysis.
Please send me a screenshot of the Vtune analysis for a more complete picture.
If you can provide me a sample reproducer (you can send it to me by private message) that would be even better.
Also, please let me know the details of your system: the version of the graphics driver, the OS you are running on, whether your system is a laptop or a desktop, and whether you are measuring performance of your kernels while plugged in into a power source or not.
Thanks for answering fast. I am currently evaluating some new computers with these Intel chips, so I would need to sort this out before concluding if they are good enough.
Attached you'll find a screen shot from VTune showing the OpenCL compute queue. It is the "green kernel" in the graph that confuses me. Most often it spends 25ms doing apparently nothing (EU array stalled), but sometimes it executes in 1.6ms which is good.
The fast execution (1.6ms is visible to the left in the graph and also marked in the list above, while a slow execution is visibble to the right and also in the list.
I cannot ship any sample code at the moment.
Here are some details on my system.
I have tried on two separate desktop computers, both plugged in with power during profiling.
- Intel i7-6700T with HD530 graphics, Windows 7. Graphics driver version 10.18.15.4259
This is the one with the VTune screenshot attached.
- Intel i5-6600T with HD530 graphica, Windows 10. Graphics driver "latest downloaded yesterday"
Well, looks like you are on the bleeding edge :) - first sighting of Skylake chips in the wild :)
First, could you try the latest driver from here: https://downloadcenter.intel.com/product/88345/Intel-HD-Graphics-530-for-6th-Generation-Intel-Core-P... - the driver version should be 18.104.22.16879? If you are still experiencing similar issues, we would need to figure out how to replicate the issue on our side: this could be still a driver issue, which may be fixed on the internal driver mainline (there is a lag of ~2 months between the released version of the driver and what's on the mainline) or it could be a hardware issue.
We could get a non-disclosure agreement in place so that you can safely share your code with us or, if you can replicate the behavior on a simpler kernel that does not contain your IP, that would be great.
On the Windows 10 computer I was running the latest driver (22.214.171.12479), but on the Win7 it would install properly. However there was no difference in behaviour.
Anyway, I think I found the reason for the lagging behaviour. In my kernel I am working with images (image2d_t). I have multiple images that I read and one destination image that is written to. The source images are allocated with a 128 bytes/pixel alignment but my destination image could be of arbitary size. All images are in the order of ~1000-3000 pixels wide.
It turned out that at specific destination images sizes my kernel performed as expected. This suggests I should probably allocate all images using a width that is a multiple of N, where N is still unknown for me. I will do some reading to see if this multiple can be queried from the device. If you have any hints, feel free to post them :)
I think it is a good idea for N to be a multiple of 64 (or more precisely, that your width in bytes should be a multiple of 64 bytes, so depending on the size of your pixel, multiples of 16, 32, or 64 should be good). I am curious to know what image sizes give you trouble.
Here is an interesting explanation of your performance degradation from our performance engineer:
This VTUNE data points to BW limited. The low dispatch rate (GPU Computing Thread row) combined with the near 100% EU stalls (red line from GPU Execution Units row) implies this. The kernel heavily uses images based on the elevated Texture Sample rate.
I’d guess that the long kernel execution time is due to a large number of cache misses. If the data set used for this enqueue is not spatially localized then you will miss the Sampler cache, L3 cache, and maybe even LLC. All access could potentially be fetched from DRAM. SKL Sampler bandwidth is much higher that DRAM bandwidth. This could explain why the same kernel is fast sometimes and slow other times.
Experiment to test theory: Modify kernel to only sample from the same image coordinate (e.g. 0, 0). If this makes all kernel enqueues fast then this supports the theory.
Why might this issue only be seen on SKL? If the customer is comparing against only discrete cards then this issue may not be as observable. Discrete cards have dedicated, fast access DRAM. It’s expected that discrete cards will have an advantage on DRAM-BW-limited workloads over integrated graphics.
Thanks Robert, both for the hints about memory alignment (image width) and the explanation of my VTune screen shot.
When using a multiple of 64 bytes my kernel does indeed run fast all the time. I get the impression the kernel struggles on all image widths except for the ones with correct alignment.
We have an internal discussion on what could have possibly gone wrong in your case. Could you please answer a couple of questions regarding your workload?
1. Which earlier Intel HD devices have you tried?
2. How did you create images previously vs how are you creating them now?
3. What is the format of your images?
4. What are the read_image and write_image commands inside your kernels? If you do multiple of them, what is the access pattern?
This information will help us create a synthetic benchmark, since we are not in complete agreement as to what is going on in your case.
Thank you very much in advance for your help!
I'll do my best to answer your questions.
- I have previously tested the ones listed below. When running my tests on these I didn't observe the effect where kernel time jumped between high and low, as it did with the HD 530.
- More or less the same code has been used on all tests. At least when it comes to image creation and read/write operations.
- Here is how I create the images, and you can see the image format used:
imgFormat.image_channel_data_type = CL_UNSIGNED_INT8;
imgFormat.image_channel_order = CL_R;
dIm = clCreateImage2D(context, CL_MEM_READ_WRITE, &imgFormat, width, height, 0, nullptr, &err);
- I tried to simplify my kernel, and it boils down to something like this. pIm1-4 are all aligned to 128 bytes, but pDstIm has been of arbitary size when testing and always smaller than pIm1-4.
__kernel void myKernel(
__read_only image2d_t pIm1,
__read_only image2d_t pIm2,
__read_only image2d_t pIm3,
__read_only image2d_t pIm4,
__write_only image2d_t pDstIm,
const __global float* buf1,
const __global float* buf2,
const __global float* buf3,
const sampler_t smp = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR
const sampler_t smpNearest = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 ptDst = (int2)(get_global_id(0), get_global_id(1));
int x = get_global_id(0);
int y = get_global_id(1);
if(x < width && y < height)
float2 ptSrc = f(x, y);
b1 = buf1[bIdx];
b2 = buf2[bIdx];
b3 = buf3[bIdx];
val1 = read_imageui(pIm1, smpNearest, ptSrc);
val2 = read_imageui(pIm2, smp, ptSrc);
val3 = read_imageui(pIm3, smp, ptSrc);
val4 = read_imageui(pIm4, smp, ptSrc);
val.x = g(val1, val2, val3, val4)
write_imageui(pDstIm, ptDst, val);