Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
92 Views

Incorrect kernel execution with barrier(CLK_LOCAL_MEM_FENCE)

Jump to solution

Consider the following kernel:

[cpp]

__kernel void test(__global float2 *output, __global float2 *input)
{
    __local float lmem[8];
    float2 a;

    const size_t tid = get_global_id(0);

    if(tid / 8 == 0)
    {
        a = input[tid];
    }
    else
    {
        return;
    }

    lmem[tid] = -a.x;
    barrier(CLK_LOCAL_MEM_FENCE);
    a.x = lmem[tid];
    barrier(CLK_LOCAL_MEM_FENCE);

    output[tid] = a;
}

[/cpp]

If I execute it with global size == local_size == 16 and pass an array of 16 float2 elements as input:

[plain]input = [  0.+0.j   1.+0.j   2.+0.j   3.+0.j   4.+0.j   5.+0.j   6.+0.j   7.+0.j
   8.+0.j   9.+0.j  10.+0.j  11.+0.j  12.+0.j  13.+0.j  14.+0.j  15.+0.j][/plain]

and a zero-filled buffer as output, I expect the first 8 elements of the output to have their real parts negated in the output array, while the rest of it remaining untouched:

[plain]output = [-0.+0.j -1.+0.j -2.+0.j -3.+0.j -4.+0.j -5.+0.j -6.+0.j -7.+0.j 0.+0.j
  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j][/plain]

This is what happens on Ubuntu 12.04 x64, nVidia CUDA 5 platform, Tesla C2050 device. But on the same operating system, Intel OpenCL XE SDK 2013 3.0.67279, and Intel Xeon E5620 the whole resulting buffer remains untouched:

[plain]output = [ 0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j
  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j][/plain]

The output coincides with the reference CUDA output if I do any of the following:

  1. Comment out the barriers;
  2. Use float arrays instead of float2
  3. Initialize "a" inside the kernel instead of reading it from input (i.e. as "a = (float2)(tid, 0)").

Has anyone encountered such behavior? Is it a bug, or am I making incorrect assumptions about how barriers work?

0 Kudos

Accepted Solutions
Highlighted
New Contributor I
92 Views

Hello vladimirsson!

According to OpenCL specification and common sense barrier() must be encountered by all work-items in a work-group executing the kernel or not encountered at all. From OpenCL specification:

This function must be encountered by all work-items in a work-group executing the kernel. If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier. If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.

In your kernel only part of work-items in work-group call barrier(), because remaining work-items finish execution inside if statement. So behavior is undefined. 

View solution in original post

0 Kudos
2 Replies
Highlighted
New Contributor I
93 Views

Hello vladimirsson!

According to OpenCL specification and common sense barrier() must be encountered by all work-items in a work-group executing the kernel or not encountered at all. From OpenCL specification:

This function must be encountered by all work-items in a work-group executing the kernel. If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier. If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.

In your kernel only part of work-items in work-group call barrier(), because remaining work-items finish execution inside if statement. So behavior is undefined. 

View solution in original post

0 Kudos
Highlighted
Beginner
92 Views

Thank you, I must have missed that part in the specs. Something to look out for in other kernels too.

0 Kudos