Why does a blocked channel in one kernel also block other kernels' running?

hiratz · ‎05-08-2019

Hi,

I noticed the following scenario:

I have two kernels, say Producer and Consumer. I put them into two DIFFERENT command queues, so they can be launched and run concurrently. Meanwhile, there is a channel whose depth is set N between them. Each work item in the producer writes one value to the channel and the each work item in the consumer reads the channel. I use NDRange and the number of workitems is W.

Then

When W < N, everything is OK;

When W > N, the execution of Producer is blocked. But the Consumer also cannot be executed.

Since the Consumer is in a different queue from the Producer, why is it also blocked? Actually I guess it is even not launched.

I show a simple code sample here. In this example, N = 8 and W = 32. The "printf" in line 13 cannot be executed. Even if I commented line 16 (channel read), it still cannot be executed. So I conclude that the consumer is not launched. If I'm right, why? My original thought was: even if the producer is blocked, eventually it will be unblocked as long as the consumer is able to execute and keep reading the channel and making space for the channel.

Thank you!

channel ulong ch1 __attribute__ ((depth(8)));
 
__attribute__((max_work_group_size(32)))
__kernel void producer(buf_addr_space const int * restrict buf_in)
{
    size_t gid = get_global_id(0);
    write_channel_intel(ch1, buf_in[gid]);
}
 
__attribute__((max_work_group_size(32)))
__kernel void consumer(buf_addr_space int * restrict buf_out)
{
    printf("--------- test ------------\n");
    size_t gid = get_global_id(0);
 
    int val = 5 + read_channel_intel(ch1);
    buf_out[gid] = val;
}

HRZ · ‎05-09-2019

Is this with emulation or actual FPGA execution? Printf data is cached in Block RAMs and dumped after kernel execution on the actual FPGA (while it is dumped right away in the emulator) and hence, if the kernel runs into a deadlock, the printf output will never be displayed (printf is useless for debugging deadlocks on the FPGA). Other than that, I don't see any reason why your code above would deadlock unless there is some problem in your host code; e.g. one kernel is waiting for an event from the other or there is a clFinish on the producer kernel before the consumer kernel is launched. Your description very well matches a synchronization problem in the host code rather than the kernels.

hiratz · ‎05-09-2019

Thanks, HRZ.

I just tested this with emulator (with the option "-emulator-channel-depth-model=strict"). My host code is very simple and no kernels are waiting for any events. I show the key code as follows (you can see I put Producer and Consumer into two different command queues, respectively). It is really weird.

#define MAX_CMD_QUEUES 4
#define N_KERNEL 2
 
void run_kernel(cl_command_queue cmd_queue[MAX_CMD_QUEUES], cl_kernel (&kernel)[N_KERNEL], size_t n_thread)
{
    size_t global_work_size[1] = {(size_t)n_thread};
    size_t local_work_size[1] = {(size_t)n_thread};
 
    cl_event event_write[1], event_exec[2];
    cl_int status;
    cl_command_queue &rcmd_queue0 = cmd_queue[0], &rcmd_queue1 = cmd_queue[1], &rcmd_queue2 = cmd_queue[2];
 
    // Write data into buf_in from h_in
    status = clEnqueueWriteBuffer(rcmd_queue0, buf_in, CL_TRUE, 0, bufsize, h_in, 0, NULL, event_write);
    error_check(status, "Write buf_in failed!\n");
 
    // Launch the kernel Producer
    status = clEnqueueNDRangeKernel(rcmd_queue0, kernel[0], 1, NULL, global_work_size, local_work_size, 1, event_write, NULL);
    error_check(status, "Run kernel Producer error!\n");
 
    // Launch the kernel Consumer
    status = clEnqueueNDRangeKernel(rcmd_queue1, kernel[1], 1, NULL, global_work_size, local_work_size, 0, NULL, NULL);
    error_check(status, "Run kernel Consumer error!\n");
 
    // Read results back to h_out from buf_out
    read_back_results(rcmd_queue1, bufsize, buf_out, (char*)h_out, NULL);
 
    clReleaseEvent(event_write[0]);
    clReleaseEvent(event_exec[0]);
    clReleaseEvent(event_exec[1]);
}

HRZ · ‎05-10-2019

Have you tried commenting the error_check functions? Depending on its implementation, that function could be serializing the kernel launches. If you provide the full code for your example so that I can compile it on my own machine, I might be able to find the source of the problem.

hiratz · ‎05-13-2019

Thanks! Sure, I attached the whole directory which contains the host code, kernel code and some necessary scripts.

I tried commenting the error_check function but the problem still exists ...

My code's directory has a Readme file that shows how to compile/run the code.

Please note: Since I am working on the Intel Harp machine, I have to use some header files provided by Intel. I put them into the directory "common-fpga". You may not need them if you have your own configuration environment. The files in the directory "common" are created by myslef and most of their code are written by me.

Currently the run.sh uses 16 workitems that caused this problem. If you change 16 to 8, the problem will disappear.

HRZ · ‎05-13-2019

I tested with both Quartus v16.1.2 and 18.1 (the latter with strict channel depth emulation). I do not seem to get any deadlock with 8, 16 or 32 work-items (though 32 gives garbled output for the second half of data since your buffer size is hardcoded to be 16 indexes). Maybe there is some problem specific to the HARP system or version of Quartus you are using. I have two recommendations for you:

1- Try compiling your host and kernel code against the a10_ref BSP which is shipped with all the newer versions of Quartus and see if you observe the same behavior.

2- Try allocating buf_in using CL_MEM_READ_WRITE instead of CL_MEM_ALLOC_HOST_PTR to see if it makes any difference.

hiratz · ‎05-13-2019

Great!

Your recommendation 2 solved this problem :)

But the reason is still unclear. I thought one possible reason is: when using CL_ALLOC_HOST_PTR, being stuck of the kernel "Producer" makes the host code stuck and then prevent the launching of the kernel "Consumer". But it turned out that it is not this case. When I put a printf(...) behind the second "clEnqueueNDRangeKernel(...)", the output is displayed. So I guess that maybe the kernel "Consumer" is successfully put into the command queue but cannot be executed because of some unknown reason. What about your thought?

Thanks a lot!

HRZ · ‎05-13-2019

It is difficult to tell what is happening in this case. Buffer allocations that involve host pointers are implemented differently for different hardware and by different manufacturers. I have no idea how Intel's implementation works in such cases, and it might as well be buggy. Moreover, it is quite surprising that this is showing up in the emulator and only for the HARP BSP; I always assumed the behavior of the emulator is independent of the BSP. If you have a contact point inside Intel, you can consider sending them your example code to see if it is a bug in Intel's OpenCL SDK.

hiratz · ‎05-13-2019

I agree that this is probably the problem of the Intel Harp platform. I'll report this in the Harp community and see if someone can confirm this.

Thanks again!

hiratz · ‎05-14-2019

Hi HRZ,

@HRZ I got the reply from Michael Adler in the HARP community:

"The forum where you asked originally is correct. I'm sorry nobody from Intel answered. I've been poking around, hoping to find someone who will."

The link is: https://forums.intel.com/s/group/0F90P00000018wRSAQ/harp

or

https://forums.intel.com/s/question/0D70P000006IGaA/a-suspected-bug-about-the-opencl-channel-in-the-harp-platform

I'll let you know if they have more updates.