Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
17255 讨论

clEnqueueNDRangeKernel() blocking or no blocking

Altera_Forum
名誉分销商 II
3,785 次查看

Hi everyone, 

 

I would like to say if clEnqueueNDRangeKernel() is a blocking function. In fact, reading and writing operations can be blocking or no blocking depending on the flag I set inside the respective function (CL_TRUE or CL_FALSE). Thanks for your help 

 

Marco Montini
0 项奖励
5 回复数
Altera_Forum
名誉分销商 II
2,759 次查看

No, clEnqueueNDRangeKernel has no blocking capability. You can have it wait on events from, say, another queue or mark it as an event for other enqueued commands to wait on, but otherwise the kernel is launched when this command gets through the queue.

0 项奖励
Altera_Forum
名誉分销商 II
2,759 次查看

Hi, 

 

I think clEnqueueNDRangeKernel() is a non-blocking function. But I met a more specific problem. 

 

I want to implement collaborative computing on CPU-FPGA using OpenCL. For RANSAC algorithem I used data partitioning and the process of CPU and FPGA is independent. Now I want CPU and FPGA to execute RANSAC in parallel. How can I realize it? 

 

If I use: 

 

...... 

clStatus = clEnqueueNDRangeKernel(clCommandQueue,......); 

clFinish(clCommandQueue); 

cpu_thread.join(); 

...... 

 

then I think, it will be blocking. Is it right? Because clFinish dose not return until all queued commands in clCommandQueue have been processed and completed. But I want CPU and FPGA to execute RANSAC in parallel, then I tried to remove the clFinish, but I got a more longer total execution time of CPU and FPGA (than using clFinish). And I also tried to use clFlush instead of clFinish, and I also got a more longer total execution time. 

 

I mean, I got the shortest execution time using clFinish but I do not know why. And How can I realize paralleled processing? Who can help me? Thanks in advance. 

 

ps: The execution time is just refers to the time of processing RANSAC, not including other time such as transmission time. The results of all the three implementations mentioned above are right and now I just focus on the time.
0 项奖励
Altera_Forum
名誉分销商 II
2,759 次查看

@Jasmine-J, assuming that you are programming both the CPU and FPGA using OpenCL, you can create two separate queues, one for each device, and run your kernels in parallel on the different queues and use events to synchronize them. Either way, clEnqueueNDRangeKernel() is NOT a blocking call and the best way to synchronize kernels or code segments that are supposed to run in parallel is to use events.

0 项奖励
Altera_Forum
名誉分销商 II
2,759 次查看

 

--- Quote Start ---  

@Jasmine-J, assuming that you are programming both the CPU and FPGA using OpenCL, you can create two separate queues, one for each device, and run your kernels in parallel on the different queues and use events to synchronize them. Either way, clEnqueueNDRangeKernel() is NOT a blocking call and the best way to synchronize kernels or code segments that are supposed to run in parallel is to use events. 

--- Quote End ---  

 

 

 

Thanks HRZ, 

 

I did not use OpenCL for CPU, I used thread in C++11. 

Now I just want to , after clEnqueueNDRangeKernel() for FPGA, let CPU start its thread, WITHOUT synchronizing... 

And if I use clFinish, I think, it will WAIT to be synchronized. And without clFinish, it will NOT WAIT. But I can not understand why the execution time with clFinish is shorter than the time without clFinish... 

In other words, I just want the total execution time of CPU and FPGA is not simply the sum of CPU execution time and FPGA execution time... 

 

Thanks again.
0 项奖励
Altera_Forum
名誉分销商 II
2,759 次查看

The clFinish command is what is blocking, not clEnqueueNDRangeKernel.

0 项奖励
回复