Issue Sychronzation between threads in GPU.

Xiaoying__Y · ‎05-08-2019

Hi there,

I am currently working on an error diffusion algo refering to the following link.

https://community.arm.com/developer/tools-software/graphics/b/blog/posts/when-parallelism-gets-tricky-accelerating-floyd-steinberg-on-the-mali-gpu

#detal:

a thread handles a single image line

# a pixel can be only handled when the line above it has already handled corresponding horizontal position+2

#current result

I tried the code in the link above, and it stalls.

Is it possible that Intel cpu's thread will wait for each other and will never proceed?

Thank for you help in advance.

Regards.

Michael_C_Intel1 · ‎05-22-2019

Hello YX,

Note that the article you're referring to explains and suggests the code may stall on other platforms due to a while loop for every work-item but the first in a work group.

See this article on forward progress programming guidance for related information.

Sidebar:

If you'd like to discuss a minimally required reproducer for understanding when you can expect forward progress to be portable per OpenCL spec, please add a hanging reproducer here. Having a discussion about the topic as it relates to expectations presented by the standard is useful for users new to the parallel execution environment. Particularly if they are interested in portability.

See this guide on soliciting feedback on forum posts for other information to include in the post. Also, please only use sources that you have privileges to post on this forum.

Thanks,

-MichaelC

Michael_C_Intel1 · ‎05-22-2019

Hi YX/forum viewers,

Also... It's very verbose initially to monitor race conditions and hangs, but printf(...) debug within a kernel can be useful. Keep a little creativity in mind so your terminal output isn't flooded and it can make source code leading to hangs more obvious.

-MichaelC

Ben_A_Intel · ‎05-23-2019

This is a nice example. If you're able to share your code once it starts working (or even while debugging) I would encourage you to do so.

I'm not 100% sure what is going on, but if I had to wager a guess I'd say that the compiler is aggressively optimizing one or both of the loops that are waiting for prior results to complete:

while (progress[get_local_id(0) - 1] < (x + 2))

In other words, marking the pointers as "volatile" may not have the intended effect.

If you're able to, I would encourage you to switch to the OpenCL 2.0 atomics (which are based on C11 atomics) instead, which are much better specified:

https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_C.html#atomic-functions

https://en.cppreference.com/w/c/atomic

If you cannot switch to the OpenCL 2.0 atomics for some reason, you may want to try experimenting with the OpenCL 1.x memory fence functions, which are at least better specified than "volatile".

You may find this presentation insightful - for some reason my name is in the presentation filename, but it was primarily work done by my colleague Biju George:

https://www.iwocl.org/wp-content/uploads/iwocl2017-ben-ashbaugh-wavefront.pdf

https://dl.acm.org/citation.cfm?id=3078177

Xiaoying__Y · ‎05-23-2019

Hi Michael and Hi Ben,

Thank you for your reply and advice.

I will look into the issue and the documents you provided , and again provide a minimal code/and environment information here.

#By the way there seems to be not output by using printf() in kernel using my device(Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)

from clinfo I can't find any cl_intel_printf shown.

Again thank you very much for the help.

Xiaoying.

Michael_C_Intel1 · ‎05-24-2019

Hi XiaoyingY,

Are you running the application in a developer environment in either Intel® SDK for OpenCL™ Applications or Intel® System Studio: OpenCL™ Tools component? Output may go to the IDE pane if so....

Is this a windows environment where the console terminal disappears when the process is terminated? A system("pause"), getchar(), cin.get() or piping output to file helps here.

printf(...) is defined per OpenCL specification in OCL 1.2 and 2.0 ... Output should be going somewhere. printf was an extension before 1.2.

-MichaelC

Xiaoying__Y · ‎05-28-2019

Hi MichaelC,

Thank you for your reply.

>Are you running the application in a developer environment in either Intel® SDK for OpenCL™ Applications or Intel® System Studio...

I am not using an IDE, it is a pure Linux program.

What I did was just installing the .deb files in the following link

https://github.com/intel/compute-runtime/releases

#include <CL/cl.h> file

and compile the source code with -lOpenCL.

Do you mean printf() in a kernel cannot be used if is without an IDE?

Thank you in advance.

Michael_C_Intel1 · ‎06-10-2019

Hi XiaoyingY,

Revisiting this thread...

This style of debug works fine to regular stdout. Tested on Ubuntu with May 30 2019 NEO runtimes for Intel® Core™ i7-6770HQ.

If the program is used in an IDE debugger, such output may go to a GUI pane as opposed to the shell.

~/workspace/training_sample/ocl-tools-walkthrough-20181221-edit/cpu$ ls
CPUOpenCLProjectforLinux  CPUOpenCLProjectforLinux.cpp  Makefile  TemplateCPU.cl  utils.cpp  utils.h
~/workspace/training_sample/ocl-tools-walkthrough-20181221-edit/cpu$ cat TemplateCPU.cl
...
...

__kernel void Add(__global int* pA, __global int* pB, __global int* pC)
{
    const int x     = get_global_id(0);
    const int y     = get_global_id(1);
    const int width = get_global_size(0);

    const int id = y * width + x;

    printf("Debug\n");
    pC[id] = pA[id] + pB[id];
}

Hope this helps,

-MichaelC