OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.

Inner loops with OpenCL

akhal
Beginner
724 Views

Hello

I am new to OpenCL and want to parallelize some looping code thats doing lu factorization with the looping structure showed by exact code as below:

for(int k = 0; k < N-1; k++)
{
for(int i = k+1; i < N; i++)
S[i*N + k] = S[i*N + k] / S[k*N + k];

for(int j = k+1; j < N; j++)
for(int i = k+1; i < N; i++)
S[i*N + j] -= S[i*N + k] * S[k*N + j];
}

I have done with the simple opencl kernel with single work items (no groping). Thats following:

int IDx = get_global_id(0);
int IDy = get_global_id(1);

for(int k = 0; k < n-1; k++)
{
barrier(CLK_GLOBAL_MEM_FENCE);

if(IDy > k && IDx == k)
matrix[IDy*n + IDx] = matrix[IDy*n + IDx] / matrix[IDx*n + IDx];

barrier(CLK_GLOBAL_MEM_FENCE);

for(int j = k+1; j < n; j++)
{
if(IDy > k && IDx == j)
matrix[IDy*n + IDx] -= matrix[IDy*n + k] * matrix[k*n + IDx];
}
}

But I dont get correct results when compared to the serial code, this is my personal try for OpenCL kernel and I am still learning how this data parallel scheme in OpenCL works, Can you point out what I am doing wrong in the kernel?

0 Kudos
6 Replies
akhal
Beginner
724 Views
Anybody there ....
0 Kudos
stefanh
Beginner
724 Views
Hello everybody,

I encounter exactly the same problem: A barrier within a inner loop seems not to guarantee a synchronisation within items of the same work group. Does anybody have an explanation for this issue?

Thank you in advance and kind regards
Stefan H.,
Germany
0 Kudos
Amjad_A_Intel
Employee
724 Views
Sorry for the late answer.
I will check this issue and return to you as soon as possible.
Can you give me the version of the OpenCL SDK you are using?

Thanks.
0 Kudos
Jim_Vaughn
Beginner
724 Views
Hi Stefanh,
To be clear a barrier only guarantees that all threads will stop at that point and wait for all memory operations to complete before continueing on. That does not mean they didn't put memory access out of order before that call or after it.
To be fair I could see how the above problem is an overly excited parser unrolling loops.
0 Kudos
Amjad_A_Intel
Employee
724 Views

Hi,

I went through the code and i think that the issue is in the implementation of the algorithm which requires that all work groups must complete their work prior to proceeding to the code after the barrier. However, please note that a barrier in OpenCL does not provide this mechanism (synchronization between work groups) as it only synchronizes work items in the same work group.

To validate this assumption please run the code with a single work group configuration (local size == global size).

Please let us know what you think about our analysis.

Regards,

Amjad

0 Kudos
Jim_Vaughn
Beginner
724 Views
"I went through the code and i think that the issue is in the implementation of the algorithm which requires that all work groups must complete their work prior to proceeding to the code after the barrier."

This was my original thought but I didn't have enough time to verify that it was occuring and to be honest wasn't 100% of how the original algorithm was intendend to function. I was going to recommend testing the code on the CPU using an outer loop to simulate random the execution environment to ensure it was correct. After you add the outer loop you could write a get_global_id() function which returned an random id which has not been used. I have used this is the past to debug issues with algorithms for the GPU and I am sure it will work here.

Also I recently saw a paper on a similar subject about comparing openCL algorithms to their C counterparts semantically to ensure equality. It was very interesting as they ran their code through the Bullet physics engine's OpenCL implementation and found a number of unknown errors as well as compiler bugs. Obviously it wouldn't work when you make architecture trade offs but for simple porting of algorithms it could be of a real benifit.
0 Kudos
Reply