- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello

I am new to OpenCL and want to parallelize some looping code thats doing lu factorization with the looping structure showed by exact code as below:

for(int k = 0; k < N-1; k++)

{

for(int i = k+1; i < N; i++)

S[i*N + k] = S[i*N + k] / S[k*N + k];

for(int j = k+1; j < N; j++)

for(int i = k+1; i < N; i++)

S[i*N + j] -= S[i*N + k] * S[k*N + j];

}

I have done with the simple opencl kernel with single work items (no groping). Thats following:

int IDx = get_global_id(0);

int IDy = get_global_id(1);

for(int k = 0; k < n-1; k++)

{

barrier(CLK_GLOBAL_MEM_FENCE);

if(IDy > k && IDx == k)

matrix[IDy*n + IDx] = matrix[IDy*n + IDx] / matrix[IDx*n + IDx];

barrier(CLK_GLOBAL_MEM_FENCE);

for(int j = k+1; j < n; j++)

{

if(IDy > k && IDx == j)

matrix[IDy*n + IDx] -= matrix[IDy*n + k] * matrix[k*n + IDx];

}

}

But I dont get correct results when compared to the serial code, this is my personal try for OpenCL kernel and I am still learning how this data parallel scheme in OpenCL works, Can you point out what I am doing wrong in the kernel?

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I encounter exactly the same problem: A barrier within a inner loop seems not to guarantee a synchronisation within items of the same work group. Does anybody have an explanation for this issue?

Thank you in advance and kind regards

Stefan H.,

Germany

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I will check this issue and return to you as soon as possible.

Can you give me the version of the OpenCL SDK you are using?

Thanks.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

I went through the code and i think that the issue is in the implementation of the algorithm which requires that all work groups must complete their work prior to proceeding to the code after the barrier. However, please note that a barrier in OpenCL does not provide this mechanism (synchronization between work groups) as it only synchronizes work items in the same work group.

To validate this assumption please run the code with a single work group configuration (local size == global size).

Please let us know what you think about our analysis.

Regards,

Amjad

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

This was my original thought but I didn't have enough time to verify that it was occuring and to be honest wasn't 100% of how the original algorithm was intendend to function. I was going to recommend testing the code on the CPU using an outer loop to simulate random the execution environment to ensure it was correct. After you add the outer loop you could write a get_global_id() function which returned an random id which has not been used. I have used this is the past to debug issues with algorithms for the GPU and I am sure it will work here.

Also I recently saw a paper on a similar subject about comparing openCL algorithms to their C counterparts semantically to ensure equality. It was very interesting as they ran their code through the Bullet physics engine's OpenCL implementation and found a number of unknown errors as well as compiler bugs. Obviously it wouldn't work when you make architecture trade offs but for simple porting of algorithms it could be of a real benifit.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page