- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a kernel that calculates motion vectors with fullsearch and mse. There is weird performance issues with the following loop:
#define W_SIZE 16 for (int y = 0; y != W_SIZE; y++) { for(uint x = 0; x != W_SIZE; x++) { float img1 = img1V[x + (y)*W_SIZE]; float img2 = img2V[x + (localID&VALUE) + (y+ localID/W_SIZE)*W_SIZE*2]; float img3 = img3V[x + (y)*W_SIZE]; float result = img1-img2; float result2 = img3-img2; diffs += result*result; diffs2 += result2 * result2; } }
The whole kernel takes about 360ms to execute. However if I change the outer loop to iterate from -1 to W_SIZE-1 and at 1 to y inside the loop (or any other values for that matter) execution time drops to 170ms.
Only reason I've come up with for this issue is loop unrolling that only happens when the loop iterates from 0 to scalar but using #pragma unroll has pracically no impact to performance. I tried making one loop instead of two nested ones but it still took 360ms to finish.
Does anybody have any idea what is causing this and if there is any way to fix it?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't recall if #pragma unroll actually works in Intel's OpenCL compiler. Anyone?
You could always try manually unrolling your inner loop.
Example:
#define W_SIZE 16 #define UNROLL_16() \ UNROLL_X(0) \ UNROLL_X(1) \ UNROLL_X(2) \ UNROLL_X(3) \ UNROLL_X(4) \ UNROLL_X(5) \ UNROLL_X(6) \ UNROLL_X(7) \ UNROLL_X(8) \ UNROLL_X(9) \ UNROLL_X(10) \ UNROLL_X(11) \ UNROLL_X(12) \ UNROLL_X(13) \ UNROLL_X(14) \ UNROLL_X(15) for (int y=0; y<W_SIZE; y++) { #undef UNROLL_X #define UNROLL_X(x) \ float img1 = img1V[x + y * W_SIZE]; \ float img2 = img2V[x + (localID & VALUE) + (y + localID / W_SIZE) * W_SIZE * 2]; \ float img3 = img3V[x + y * W_SIZE]; \ float result = img1 - img2; \ float result2 = img3 - img2; \ diffs += result * result; \ diffs2 += result2 * result2; UNROLL_16(); }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
170-360ms equates to a huge amount of potential GFLOPS... unless you're actually memory bound.
Have you verified your memory transactions are "coalesced" and some multiple of 64 bytes?
The inner loop looks like it would map well to 16 threads (work-items).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Allan M. wrote:That is the runtime for the whole kernel the loop takes on the 170ms runtime about 60-80ms. Only thing I can think is the fact that unrolling the loop causes huge amount of cache misses because I don't see any other way why it would take 4-5 times longer. I'm not too familiar with the structure of Intel's integrated GPU and their caches.
170-360ms equates to a huge amount of potential GFLOPS... unless you're actually memory bound.
Allan M. wrote:Unfortunately not possible. I'm already using 3 dimensions and while I could change the structure to use two the memory handling wouldn't allow braking the kernel down any further.
The inner loop looks like it would map well to 16 threads (work-items)..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Joose S. wrote:
I should probably mention that the problem only exists when the code is run on GPU, on CPU the problem doesn't exist.
Based on your code snippet, it's unclear to me whether or not your kernel has been explicitly converted into "work item" and "work group" form that can take advantage of "wide" SIMT/SIMD architectures like Intel's IGP or a discrete GPU.
Forgive me if you've already done this. :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Allan M. wrote:Yeah I have split the work load and omitted most of the code. I'm just interested why the "optimasation" runtime/compiler does makes the code significantly slower. And what I can to do prevent it. For some reason giving clBuildProgram -cl-opt-disable flag did nothing for execute time so I am at kinda loss here.
Quote:
Joose S. wrote:I should probably mention that the problem only exists when the code is run on GPU, on CPU the problem doesn't exist.Based on your code snippet, it's unclear to me whether or not your kernel has been explicitly converted into "work item" and "work group" form that can take advantage of "wide" SIMT/SIMD architectures like Intel's IGP or a discrete GPU.
Forgive me if you've already done this. :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Joose,
What processor, OS, driver version are you using? Could you give the whole kernel and maybe a small reproducer? I am not aware of any way to stop loop unrolling on the GPU, it typically happens automatically: I will ask our driver folks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Robert I. (Intel) wrote:
Hi Joose,
What processor, OS, driver version are you using? Could you give the whole kernel and maybe a small reproducer? I am not aware of any way to stop loop unrolling on the GPU, it typically happens automatically: I will ask our driver folks.
i5-4590, Windows 7 Enterprise, Intel drivers version 10.18.14.4280
I've attached files that should reproduce the problem.
Changing the outer loop from y = 0; y != W_SIZE to y = 1; y!= 17, and subtracting 1 from y everywhere it's used greatly increases the performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried changing the loop to use vectors to
__local float* maskStart = img2V+(localID&VALUE)+(localID&(~VALUE))*2; for(int x = 0; x != 16; x++) { float16 img1 = vload16(x,img1V); float16 img2 = vload16(2*(x), maskStart); float16 img3 = vload16(x, img3V); float16 result = img1-img2; float16 result2 = img3-img2; diffs += result*result; diffs2 += result2 * result2; }
Which got the execution time down to 220ms but again doing
__local float* maskStart = img2V+(localID&VALUE)+(localID&(~VALUE))*2; for(int x = 1; x != 17; x++) { float16 img1 = vload16(x-1,img1V); float16 img2 = vload16(2*(x-1), maskStart); float16 img3 = vload16(x-1, img3V); float16 result = img1-img2; float16 result2 = img3-img2; diffs += result*result; diffs2 += result2 * result2; }
gets it down to 170ms.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Joose,
This is what I got from compiler team:
We support the OpenCL 2.0 loop unroll attribute, which can be used to control loop unrolling. Search the spec for: __attribute__((opencl_unroll_hint(n))) .
That being said, there are a couple of caveats:
- It doesn’t have any effect on older drivers, where we parsed the attribute but otherwise ignored it.
- It should work fine on SKL+ on any of the 15.45 drivers. Even today, it won’t work on any of the BDW drivers built from the 15.40 branch, but e.g. Linux BDW drivers built from mainline will work fine. Crazy, I know.
- Last time I checked we only parsed the attribute for OpenCL 2.0 compiles (with –cl-std=CL2.0), but not for OpenCL 1.2 compiles.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page