OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
公告
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.

Using lookup table on reduction kernel produces incorrect results on GPU

Joose_S_
初学者
1,750 次查看

I have a kernel that takes in an array of integers and returns the index of the smallest element.

#ifndef LOCAL_SIZE
#define LOCAL_SIZE 8
#endif // LOCAL_SIZE

kernel void test( global int* in, global int* out )
{   
  int id = get_local_id(0);

  local int indx[LOCAL_SIZE];

  int temp = id;
  for (int i = id; i < 1024; i += LOCAL_SIZE)
  {
    temp = in < in[temp] ? i : temp;
  }
  indx[id] = temp;
  barrier(CLK_LOCAL_MEM_FENCE);

  for(int i = LOCAL_SIZE / 2; i!= 0; i>>=1)
  {
    if(id < i)
    {
      printf( "%4d: %3d, %4d: %3d\n", indx[id], in[indx[id]], indx[id + i], in[indx[id + i]] );
      indx[id] = in[indx[id]] < in[indx[id+ i]] ? indx[id] : indx[id + i];
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    if(id == 0) printf("\n");
  }
  out[0] = indx[0];
}

 

Before the first barrier each work item finds its smallest value and places it into a local buffer. Everything works fine here.

 

In the for loop the results from each work item is reduced further to find the result. However the second to last iteration fails on GPU everytime: in[indx[id]] and in[indx[id + i]] both return the same value.

Operating system: Windows 7 Enterprise

Device Driver Version: 10.18.14.4280

Device: Intel HD 4600 & Processor Intel i5-4590

Works fine on CPU and Nvidia GTX 970

I've attached the kernel and host code to reproduce

0 项奖励
6 回复数
Robert_I_Intel
1,750 次查看

I am afraid there might be in race in your code. What about this:

#ifndef LOCAL_SIZE
#define LOCAL_SIZE 8
#endif // LOCAL_SIZE

kernel void test( global int* in, global int* out )
{   
  int id = get_local_id(0);

  local int indx[LOCAL_SIZE];

  int temp = id;
  for (int i = id; i < 1024; i += LOCAL_SIZE)
  {
    temp = in < in[temp] ? i : temp;
  }
  indx[id] = temp;
  barrier(CLK_LOCAL_MEM_FENCE);

  for(int i = LOCAL_SIZE / 2; i!= 0; i>>=1)
  {
    int tmp = indx[id], tmpi = indx[id + i];
    int val = in[tmp], vali = in[tmpi];
    barrier(CLK_LOCAL_MEM_FENCE);
    if(id < i)
    {
      printf( "%4d: %3d, %4d: %3d\n", tmp, val, tmpi, vali );
      indx[id] = val < vali ? tmp : tmpi;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    if(id == 0) printf("\n");
  }
  out[0] = indx[0];
}

 

0 项奖励
Joose_S_
初学者
1,750 次查看

Robert I. (Intel) wrote:

I am afraid there might be in race in your code. What about this:

Nope, the issue still persist even with this change.

0 项奖励
Joose_S_
初学者
1,750 次查看

However doing this fixed the problem:

#ifndef LOCAL_SIZE
#define LOCAL_SIZE 8
#endif // LOCAL_SIZE

kernel void test( global int* in, global int* out )
{   
  int id = get_local_id(0);

  local int indx[LOCAL_SIZE];

  int temp = id;
  for (int i = id; i < 1024; i += LOCAL_SIZE)
  {
    temp = in < in[temp] ? i : temp;
  }
  indx[id] = temp;
  barrier(CLK_LOCAL_MEM_FENCE);

  for(int i = LOCAL_SIZE / 2; i!= 0; i>>=1)
  {
    int tmp = indx[id]; 
    int val = in[tmp];
    barrier(CLK_LOCAL_MEM_FENCE);
    int tmpi = indx[id + i];
    int vali = in[tmpi];
    if(id < i)
    {
      printf( "%4d: %3d, %4d: %3d\n", tmp, val, tmpi, vali );
      indx[id] = val < vali ? tmp : tmpi;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    if(id == 0) printf("\n");
  }
  out[0] = indx[0];
}

I just don't quite understand why the program behaves the way it does.

0 项奖励
Robert_I_Intel
1,750 次查看

Ok, looks like a compiler team needs to take a look at this one :)

0 项奖励
Joose_S_
初学者
1,750 次查看
Hi Robert, Have you found out anything regarding the issue?
0 项奖励
Robert_I_Intel
1,750 次查看

Hi Joose,

I showed this to the compiler folks and they think this is a compiler bug. Will file a bug. Thanks!

0 项奖励
回复