Inner loops with OpenCL

akhal · ‎07-31-2011

Hello

I am new to OpenCL and want to parallelize some looping code thats doing lu factorization with the looping structure showed by exact code as below:

for(int k = 0; k < N-1; k++)
{
for(int i = k+1; i < N; i++)
S[i*N + k] = S[i*N + k] / S[k*N + k];

for(int j = k+1; j < N; j++)
for(int i = k+1; i < N; i++)
S[i*N + j] -= S[i*N + k] * S[k*N + j];
}

I have done with the simple opencl kernel with single work items (no groping). Thats following:

int IDx = get_global_id(0);
int IDy = get_global_id(1);

for(int k = 0; k < n-1; k++)
{
barrier(CLK_GLOBAL_MEM_FENCE);

if(IDy > k && IDx == k)
matrix[IDy*n + IDx] = matrix[IDy*n + IDx] / matrix[IDx*n + IDx];

barrier(CLK_GLOBAL_MEM_FENCE);

for(int j = k+1; j < n; j++)
{
if(IDy > k && IDx == j)
matrix[IDy*n + IDx] -= matrix[IDy*n + k] * matrix[k*n + IDx];
}
}

But I dont get correct results when compared to the serial code, this is my personal try for OpenCL kernel and I am still learning how this data parallel scheme in OpenCL works, Can you point out what I am doing wrong in the kernel?

akhal · ‎08-03-2011

Anybody there ....

stefanh · ‎02-06-2012

Hello everybody,

I encounter exactly the same problem: A barrier within a inner loop seems not to guarantee a synchronisation within items of the same work group. Does anybody have an explanation for this issue?

Thank you in advance and kind regards
Stefan H.,
Germany

Amjad_A_Intel · ‎02-06-2012

Sorry for the late answer.
I will check this issue and return to you as soon as possible.
Can you give me the version of the OpenCL SDK you are using?

Thanks.

Jim_Vaughn · ‎02-07-2012

Hi Stefanh,

To be clear a barrier only guarantees that all threads will stop at that point and wait for all memory operations to complete before continueing on. That does not mean they didn't put memory access out of order before that call or after it.

To be fair I could see how the above problem is an overly excited parser unrolling loops.

Amjad_A_Intel · ‎02-09-2012

Hi,

I went through the code and i think that the issue is in the implementation of the algorithm which requires that all work groups must complete their work prior to proceeding to the code after the barrier. However, please note that a barrier in OpenCL does not provide this mechanism (synchronization between work groups) as it only synchronizes work items in the same work group.

To validate this assumption please run the code with a single work group configuration (local size == global size).

Please let us know what you think about our analysis.

Regards,

Amjad

Jim_Vaughn · ‎02-09-2012

"I went through the code and i think that the issue is in the implementation of the algorithm which requires that all work groups must complete their work prior to proceeding to the code after the barrier."

This was my original thought but I didn't have enough time to verify that it was occuring and to be honest wasn't 100% of how the original algorithm was intendend to function. I was going to recommend testing the code on the CPU using an outer loop to simulate random the execution environment to ensure it was correct. After you add the outer loop you could write a get_global_id() function which returned an random id which has not been used. I have used this is the past to debug issues with algorithms for the GPU and I am sure it will work here.

Also I recently saw a paper on a similar subject about comparing openCL algorithms to their C counterparts semantically to ensure equality. It was very interesting as they ran their code through the Bullet physics engine's OpenCL implementation and found a number of unknown errors as well as compiler bugs. Obviously it wouldn't work when you make architecture trade offs but for simple porting of algorithms it could be of a real benifit.