Weird performance with loop unrolling

Joose_S_ · ‎05-12-2016

I have a kernel that calculates motion vectors with fullsearch and mse. There is weird performance issues with the following loop:

  #define W_SIZE 16
  for (int y = 0; y != W_SIZE; y++)
  {
    for(uint x = 0; x != W_SIZE; x++)
    {
      float img1 = img1V[x + (y)*W_SIZE];
      float img2 = img2V[x + (localID&VALUE) + (y+ localID/W_SIZE)*W_SIZE*2];
      float img3 = img3V[x + (y)*W_SIZE];
      float result = img1-img2;
      float result2 = img3-img2;
      diffs += result*result;
      diffs2 += result2 * result2;
    }
  }

The whole kernel takes about 360ms to execute. However if I change the outer loop to iterate from -1 to W_SIZE-1 and at 1 to y inside the loop (or any other values for that matter) execution time drops to 170ms.

Only reason I've come up with for this issue is loop unrolling that only happens when the loop iterates from 0 to scalar but using #pragma unroll has pracically no impact to performance. I tried making one loop instead of two nested ones but it still took 360ms to finish.

Does anybody have any idea what is causing this and if there is any way to fix it?

allanmac1 · ‎05-12-2016

I don't recall if #pragma unroll actually works in Intel's OpenCL compiler. Anyone?

You could always try manually unrolling your inner loop.

Example:

#define W_SIZE 16

#define UNROLL_16()  \
  UNROLL_X(0)        \
  UNROLL_X(1)        \
  UNROLL_X(2)        \
  UNROLL_X(3)        \
  UNROLL_X(4)        \
  UNROLL_X(5)        \
  UNROLL_X(6)        \
  UNROLL_X(7)        \
  UNROLL_X(8)        \
  UNROLL_X(9)        \
  UNROLL_X(10)       \
  UNROLL_X(11)       \
  UNROLL_X(12)       \
  UNROLL_X(13)       \
  UNROLL_X(14)       \
  UNROLL_X(15)

for (int y=0; y<W_SIZE; y++)
  {
#undef  UNROLL_X
#define UNROLL_X(x)                                                     \
    float img1    = img1V[x + y * W_SIZE];                              \
    float img2    = img2V[x + (localID & VALUE) + (y + localID / W_SIZE) * W_SIZE * 2]; \
    float img3    = img3V[x + y * W_SIZE];                              \
    float result  = img1 - img2;                                        \
    float result2 = img3 - img2;                                        \
    diffs        += result  * result;                                   \
    diffs2       += result2 * result2;

    UNROLL_16();
  }

Joose_S_ · ‎05-12-2016

It's not the inner loop that's the problem but the outer. It is getting unrolled properly and the #pragma unroll had an effect on it but the compiler optimised it the best way without doing anything special about it. It seems I'm loosing performance when the outer loop is unrolled.

allanmac1 · ‎05-12-2016

170-360ms equates to a huge amount of potential GFLOPS... unless you're actually memory bound.

Have you verified your memory transactions are "coalesced" and some multiple of 64 bytes?

The inner loop looks like it would map well to 16 threads (work-items).

Joose_S_ · ‎05-12-2016

I should probably mention that the problem only exists when the code is run on GPU, on CPU the problem doesn't exist.

Allan M. wrote:

170-360ms equates to a huge amount of potential GFLOPS... unless you're actually memory bound.

That is the runtime for the whole kernel the loop takes on the 170ms runtime about 60-80ms. Only thing I can think is the fact that unrolling the loop causes huge amount of cache misses because I don't see any other way why it would take 4-5 times longer. I'm not too familiar with the structure of Intel's integrated GPU and their caches.

Allan M. wrote:

The inner loop looks like it would map well to 16 threads (work-items)..

Unfortunately not possible. I'm already using 3 dimensions and while I could change the structure to use two the memory handling wouldn't allow braking the kernel down any further.

allanmac1 · ‎05-12-2016

Joose S. wrote:
I should probably mention that the problem only exists when the code is run on GPU, on CPU the problem doesn't exist.

Based on your code snippet, it's unclear to me whether or not your kernel has been explicitly converted into "work item" and "work group" form that can take advantage of "wide" SIMT/SIMD architectures like Intel's IGP or a discrete GPU.

Forgive me if you've already done this. :)

Joose_S_ · ‎05-12-2016

Allan M. wrote:

Quote:
Joose S. wrote:
I should probably mention that the problem only exists when the code is run on GPU, on CPU the problem doesn't exist.

Based on your code snippet, it's unclear to me whether or not your kernel has been explicitly converted into "work item" and "work group" form that can take advantage of "wide" SIMT/SIMD architectures like Intel's IGP or a discrete GPU.

Forgive me if you've already done this. :)

Yeah I have split the work load and omitted most of the code. I'm just interested why the "optimasation" runtime/compiler does makes the code significantly slower. And what I can to do prevent it. For some reason giving clBuildProgram -cl-opt-disable flag did nothing for execute time so I am at kinda loss here.

Robert_I_Intel · ‎05-13-2016

Hi Joose,

What processor, OS, driver version are you using? Could you give the whole kernel and maybe a small reproducer? I am not aware of any way to stop loop unrolling on the GPU, it typically happens automatically: I will ask our driver folks.

Joose_S_ · ‎05-16-2016

Robert I. (Intel) wrote:

Hi Joose,

What processor, OS, driver version are you using? Could you give the whole kernel and maybe a small reproducer? I am not aware of any way to stop loop unrolling on the GPU, it typically happens automatically: I will ask our driver folks.

i5-4590, Windows 7 Enterprise, Intel drivers version 10.18.14.4280

I've attached files that should reproduce the problem.

Changing the outer loop from y = 0; y != W_SIZE to y = 1; y!= 17, and subtracting 1 from y everywhere it's used greatly increases the performance.

Joose_S_ · ‎05-16-2016

I tried changing the loop to use vectors to

  __local float* maskStart = img2V+(localID&VALUE)+(localID&(~VALUE))*2;
  for(int x = 0; x != 16; x++)
  {
    float16 img1 = vload16(x,img1V);         
    float16 img2 = vload16(2*(x), maskStart);
    float16 img3 = vload16(x, img3V);
    float16 result = img1-img2;
    float16 result2 = img3-img2;
    diffs += result*result;
    diffs2 += result2 * result2;
  }

Which got the execution time down to 220ms but again doing

  __local float* maskStart = img2V+(localID&VALUE)+(localID&(~VALUE))*2;
  for(int x = 1; x != 17; x++)
  {
    float16 img1 = vload16(x-1,img1V);         
    float16 img2 = vload16(2*(x-1), maskStart);
    float16 img3 = vload16(x-1, img3V);
    float16 result = img1-img2;
    float16 result2 = img3-img2;
    diffs += result*result;
    diffs2 += result2 * result2;
  }

gets it down to 170ms.

Robert_I_Intel · ‎05-16-2016

Hi Joose,

This is what I got from compiler team:

We support the OpenCL 2.0 loop unroll attribute, which can be used to control loop unrolling. Search the spec for: __attribute__((opencl_unroll_hint(n))) .

That being said, there are a couple of caveats:

- It doesn’t have any effect on older drivers, where we parsed the attribute but otherwise ignored it.

- It should work fine on SKL+ on any of the 15.45 drivers. Even today, it won’t work on any of the BDW drivers built from the 15.40 branch, but e.g. Linux BDW drivers built from mainline will work fine. Crazy, I know.

- Last time I checked we only parsed the attribute for OpenCL 2.0 compiles (with –cl-std=CL2.0), but not for OpenCL 1.2 compiles.