Solved: Incorrect kernel execution with barrier(CLK_LOCAL_MEM_FENCE)

vladimirsson · ‎08-14-2013

Consider the following kernel:

[cpp]

__kernel void test(__global float2 *output, __global float2 *input)
{
    __local float lmem[8];
    float2 a;

    const size_t tid = get_global_id(0);

    if(tid / 8 == 0)
    {
        a = input[tid];
    }
    else
    {
        return;
    }

    lmem[tid] = -a.x;
    barrier(CLK_LOCAL_MEM_FENCE);
    a.x = lmem[tid];
    barrier(CLK_LOCAL_MEM_FENCE);

    output[tid] = a;
}

[/cpp]

If I execute it with global size == local_size == 16 and pass an array of 16 float2 elements as input:

[plain]input = [ 0.+0.j 1.+0.j 2.+0.j 3.+0.j 4.+0.j 5.+0.j 6.+0.j 7.+0.j
8.+0.j 9.+0.j 10.+0.j 11.+0.j 12.+0.j 13.+0.j 14.+0.j 15.+0.j][/plain]

and a zero-filled buffer as output, I expect the first 8 elements of the output to have their real parts negated in the output array, while the rest of it remaining untouched:

[plain]output = [-0.+0.j -1.+0.j -2.+0.j -3.+0.j -4.+0.j -5.+0.j -6.+0.j -7.+0.j 0.+0.j
0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j][/plain]

This is what happens on Ubuntu 12.04 x64, nVidia CUDA 5 platform, Tesla C2050 device. But on the same operating system, Intel OpenCL XE SDK 2013 3.0.67279, and Intel Xeon E5620 the whole resulting buffer remains untouched:

[plain]output = [ 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j
0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j][/plain]

The output coincides with the reference CUDA output if I do any of the following:

Comment out the barriers;
Use float arrays instead of float2
Initialize "a" inside the kernel instead of reading it from input (i.e. as "a = (float2)(tid, 0)").

Has anyone encountered such behavior? Is it a bug, or am I making incorrect assumptions about how barriers work?

EvgeniyPeshkov · ‎02-03-2014

Hello vladimirsson!

According to OpenCL specification and common sense barrier() must be encountered by all work-items in a work-group executing the kernel or not encountered at all. From OpenCL specification:

This function must be encountered by all work-items in a work-group executing the kernel. If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier. If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.

In your kernel only part of work-items in work-group call barrier(), because remaining work-items finish execution inside if statement. So behavior is undefined.

View solution in original post

EvgeniyPeshkov · ‎02-03-2014

Hello vladimirsson!

According to OpenCL specification and common sense barrier() must be encountered by all work-items in a work-group executing the kernel or not encountered at all. From OpenCL specification:

This function must be encountered by all work-items in a work-group executing the kernel. If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier. If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.

In your kernel only part of work-items in work-group call barrier(), because remaining work-items finish execution inside if statement. So behavior is undefined.

vladimirsson · ‎02-12-2014

Thank you, I must have missed that part in the specs. Something to look out for in other kernels too.