OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1719 Discussions

What's wrong with this kernel?

renegr
New Contributor I
547 Views
struct RGB16
{
  ushort R, G, B;
};

__kernel void scale2D_3u16(__global ushort* restrict in, __global ushort* restrict out)
{
  int xPos = get_global_id(0);
  int yPos = get_global_id(1);
  int p = yPos*get_global_size(0)+xPos;
  
  __global struct RGB16* pRGBIn = (__global struct RGB16*)in;
  __global struct RGB16* pRGBOut = (__global struct RGB16*)out;
  pRGBOut

= pRGBIn

; }

Input data (memory view) is:

0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4 ...

Using NVidia and AMD I get the same result as output. But using Intel OCL on HD4600 I get:

0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 3, 4, 4, 4, ...

I'm totally confused!

0 Kudos
1 Solution
Robert_I_Intel
Employee
547 Views

Hi Rene,

My bad: there is no issue with your original code - the problem is with the Intel compiler and how it handles structs, so indeed the workaround is

pRGBOut

.R = pRGBIn

.R;

pRGBOut

.G = pRGBIn

.G;

pRGBOut

.B = pRGBIn

.B;

 

View solution in original post

0 Kudos
5 Replies
renegr
New Contributor I
547 Views

Additional note, using following code works as expected.

  pRGBOut

.R = pRGBIn

.R; pRGBOut

.G = pRGBIn

.G; pRGBOut

.B = pRGBIn

.B;

 

0 Kudos
Robert_I_Intel
Employee
547 Views

The problem is that the way you wrote you code in the first example, two adjacent work items are reading and writing overlapping data, and there is no guarantee of the order in which the work items will write their data out. In the second case you just got lucky :) Technically, you would need barriers between those instructions in the second case to guarantee the results.

To fix, try to make sure that work items read and write non-overlapping pieces of memory.

0 Kudos
renegr
New Contributor I
547 Views

Thank you Robert,

unfortunately I don't see any overlapping within this code. Maybe I'm totally blind. It is nearly the same code as I use within a "parallel_for()". I could rewrite the code to:
 

__kernel void scale2D_3u16(__global ushort* restrict in, __global ushort* restrict out)
{
  int xPos = get_global_id(0);
  int yPos = get_global_id(1);
  int p = (yPos*get_global_size(0)+xPos)*3;

  out[p+0] = in[p+0];
  out[p+1] = in[p+1];
  out[p+2] = in[p+2];
}

I think this overlapping occurs because the addresses of pRGBOut/pRGBIn are not multiples of 4/8/16. Is this correct?

0 Kudos
Robert_I_Intel
Employee
548 Views

Hi Rene,

My bad: there is no issue with your original code - the problem is with the Intel compiler and how it handles structs, so indeed the workaround is

pRGBOut

.R = pRGBIn

.R;

pRGBOut

.G = pRGBIn

.G;

pRGBOut

.B = pRGBIn

.B;

 
0 Kudos
Robert_I_Intel
Employee
547 Views

BTW, your last proposal should work as well. 

0 Kudos
Reply