- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Consider the following kernel:
[cpp]
__kernel void test(__global float2 *output, __global float2 *input)
{
__local float lmem[8];
float2 a;
const size_t tid = get_global_id(0);
if(tid / 8 == 0)
{
a = input[tid];
}
else
{
return;
}
lmem[tid] = -a.x;
barrier(CLK_LOCAL_MEM_FENCE);
a.x = lmem[tid];
barrier(CLK_LOCAL_MEM_FENCE);
output[tid] = a;
}
[/cpp]
If I execute it with global size == local_size == 16 and pass an array of 16 float2 elements as input:
[plain]input = [ 0.+0.j 1.+0.j 2.+0.j 3.+0.j 4.+0.j 5.+0.j 6.+0.j 7.+0.j
8.+0.j 9.+0.j 10.+0.j 11.+0.j 12.+0.j 13.+0.j 14.+0.j 15.+0.j][/plain]
and a zero-filled buffer as output, I expect the first 8 elements of the output to have their real parts negated in the output array, while the rest of it remaining untouched:
[plain]output = [-0.+0.j -1.+0.j -2.+0.j -3.+0.j -4.+0.j -5.+0.j -6.+0.j -7.+0.j 0.+0.j
0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j][/plain]
This is what happens on Ubuntu 12.04 x64, nVidia CUDA 5 platform, Tesla C2050 device. But on the same operating system, Intel OpenCL XE SDK 2013 3.0.67279, and Intel Xeon E5620 the whole resulting buffer remains untouched:
[plain]output = [ 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j
0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j][/plain]
The output coincides with the reference CUDA output if I do any of the following:
- Comment out the barriers;
- Use float arrays instead of float2
- Initialize "a" inside the kernel instead of reading it from input (i.e. as "a = (float2)(tid, 0)").
Has anyone encountered such behavior? Is it a bug, or am I making incorrect assumptions about how barriers work?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello vladimirsson!
According to OpenCL specification and common sense barrier() must be encountered by all work-items in a work-group executing the kernel or not encountered at all. From OpenCL specification:
This function must be encountered by all work-items in a work-group executing the kernel. If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier. If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.
In your kernel only part of work-items in work-group call barrier(), because remaining work-items finish execution inside if statement. So behavior is undefined.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello vladimirsson!
According to OpenCL specification and common sense barrier() must be encountered by all work-items in a work-group executing the kernel or not encountered at all. From OpenCL specification:
This function must be encountered by all work-items in a work-group executing the kernel. If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier. If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.
In your kernel only part of work-items in work-group call barrier(), because remaining work-items finish execution inside if statement. So behavior is undefined.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you, I must have missed that part in the specs. Something to look out for in other kernels too.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page