Is branch divergence as much of a concern with Intel Graphics as it is with other GPUs

Ben_Rush · ‎04-18-2016

I'm new to GPU development and targeting the heavily SIMD-nature of GPUs. I have read a fair amount about branch divergence, and how excessive conditional branching in logic executed on GPUs can cause stalling of GPU threads. I'm assuming the answer is yes, but is this still a concern when targeting Intel Graphics w/ OpenCL?

I ask because we have a lot of code we're trying to offload, some of which has a large quantity of branch conditionals. I'm assuming a general algorithm rewrite is going to be necessary or else we'll find ourselves operating even slower on the GPU.

Are there good patterns/practices for mitigating the effects of conditionals? Or is it just a generally accepted notion that you shouldn't have any for GPU code?

Robert_I_Intel · ‎04-19-2016

Hi Ben,

In short, yes. On how to deal with lots of branches: the typical question is why you have lots of branches?

a) If you are trying to handle boundary conditions there are couple of things you could do: 1) use images instead of buffers since they handle boundary conditions and out of bound accesses for you; 2) pad your buffers for out-of-bound accesses; 3) separate bounds handling into a separate kernels.

b) if your algorithm is inherently branchy, maybe you need to rethink your algorithm.

Ben_Rush · ‎04-26-2016

Thanks for your response, again.

Well let me be a bit more explicit with my points, then. What we are trying to do is something called mathematical morphology on the GPU. You can read more about the general idea on wikipedia here. What's important to note is that we have to scan a "window" over pixels in an image and analyze the pixels within that window. If the window that we're currently on in the image matches a certain set of criteria, then we want to do further analysis.

To try to make this GPU'ified, I'm trying to take the branch conditionals out, or all of the "if" statements. As a result, I'm trying to do the processing in multiple paths: first analyzing all of the windows and finding the ones we want to further analyze, and then actually doing the analysis. However, even doing this isn't performing very well.

Here is a kernel I'm trying to execute:

__kernel void FindEdges(__global ushort* iterateValues, 
    __global char* writeValues,
    int height, int width, int maxDiff)
{
    const int x     = get_global_id(0);
    const int y     = get_global_id(1);
    const int stride = get_global_size(0);

    int i = y*stride+x; 

    ushort val = iterateValues; 
    int minval = val - maxDiff;
    int maxval = val + maxDiff;
    
    writeValues = (iterateValues > 0) && ((x > 0 && y > 0 && (iterateValues[i - width - 1] == 0 || iterateValues[i - width - 1] < minval || iterateValues[i - width - 1] > maxval)) ||
        (x > 0 && (iterateValues[i - 1] == 0 || iterateValues[i - 1] < minval || iterateValues[i - 1] > maxval)) ||
        (x > 0 && y < height - 1 && (iterateValues[i + width - 1] == 0 || iterateValues[i + width - 1] < minval || iterateValues[i + width - 1] > maxval)) ||
        (y > 0 && (iterateValues[i - width] == 0 || iterateValues[i - width] < minval || iterateValues[i - width] > maxval)) ||
        (y < height - 1 && (iterateValues[i + width] == 0 || iterateValues[i + width] < minval || iterateValues[i + width] > maxval)) ||
        (x < width - 1 && y > 0 && (iterateValues[i - width + 1] == 0 || iterateValues[i - width + 1] < minval || iterateValues[i - width + 1] > maxval)) ||
        (x < width - 1 && (iterateValues[i + 1] == 0 || iterateValues[i + 1] < minval || iterateValues[i + 1] > maxval)) ||
        (x < width - 1 && y < height - 1 && (iterateValues[i + width + 1] == 0 || iterateValues[i + width + 1] < minval || iterateValues[i + width + 1] > maxval))); 
}

For every pixel in the image I want to execute the above code which analyzes not only the pixel itself, but the pixel's surrounding pixels (a.k.a "the window"). If the above, large conditional is true, then I want to set a byte in the output buffer to be 1, otherwise I want to set it to be 0. This is how I'm attacking the problem of finding the individual windows to further analyze.

The problem is that when I try to do the above conditional serially, the CPU is completing in about .122 ms. The GPU is completing in about 1.2 ms. So about 10x slower :( At least, that's what I get when I surround the OpenCL call (which loads the stack of the kernel parameters, executes the kernel, waits, etc.) with timing calls.

So what I'm wondering is whether the fact there's really many conditionals in that one, large statement is messing things up due to branch divergence. And if so, whether I should continue attempting to break it up even more.

The buffer is 512*424*2 bytes in length, and so it ought to fit within the L3 cache of my processor (Skylake, so it's got 8 megabytes of L3 cache). So in theory I shouldn't be hitting NDRAM.

Here is the output of the run I did using the VTune analyzer: http://www.ben-rush.net/1.PNG.

And here is the output of the architecture diagram for the same run: http://www.ben-rush.net/2.PNG

Any thoughts?

Robert_I_Intel · ‎04-27-2016

Ben,

According to your Vtune screen shot, the GPU kernel takes about 0.12 ms on average. That said, your GPU is idle 52% of the time when executing your kernel, so that's not good.

Also, a lot of your checks are boundary checks (x > 0, x < width - 1), so it might make sense to exclude the boundary altogether, so all your boundary checks will go away. For the boundary, you can create a separate kernel that just writes 0s to the boundary.

Ben_Rush · ‎04-27-2016

Yeah, I noticed the 52% idle time.

Do you think copying buffers over to local memory in a tiled fashion would help? I see that Intel uses a special bank *already* in L3 cache for local memory, and so it's probably not likely this would add a lot of benefit.

I'll try doing what you suggest and removing some of the boundary conditions, but any thoughts as to what would cause such a high idle time?

Robert_I_Intel · ‎04-27-2016

Local memory would not help much. It helps if you have certain access patterns (you have 16 banks from which you could fetch 4 bytes from each bank independently, as opposed to one cache line of 64 bytes fetch from global memory: see slide 54 of this great presentation: https://software.intel.com/sites/default/files/managed/63/2c/SF14_GVCS002_100f.pdf

Robert_I_Intel · ‎04-27-2016

Idle time could be caused by a) excessive branching; b) not enough work, e.g.if you have a GT2 part (e.g. HD Graphics 530 or something like that), you have 24 EUs * 7 HW threads * 32 work items (SIMD32 compiled kernel) = 5376 work items that can run simultaneously on the device, so you need to be above 10X that amount to utilize the device (I think you are OK here). c) not enough work per kernel, e.g. you are reading one byte, shifting it right and writing it out - not enough reading, not enough math; see https://software.intel.com/en-us/articles/optimizing-simple-opencl-kernels for more on how to make sure you do enough work per kernel. Most optimizations don't apply to Skylake, since compilers and hardware improved, but you can run Modulate and Sobel samples on your hardware and get the idea what's working.

Ben_Rush · ‎04-27-2016

Seriously, do you accept comission in beer? If we ever meet in person I'm buying you a round.