why did clFinish cause a shorter execution time for collaborative computing

Altera_Forum · ‎01-25-2018

Hi,

:( So sorry... I am new and I do not know why I can not reply to @HRZ on this thread:

https://alteraforum.com/forum/showthread.php?t=57651

So sorry I have to post a new thread here...

what i want to reply to hrz is:

Thanks HRZ,

I did not use OpenCL for CPU, I used thread in C++11.

Now I just want to , after clEnqueueNDRangeKernel(), let CPU start its thread WITHOUT synchronizing...

And if I use clFinish between clEnqueueNDRangeKernel() and cpu_thread.join(), I think, CPU thread will not start until FPGA process have been completed. I think it will WAIT , to be synchronized. And without clFinish, I think, it will NOT WAIT. However I can not understand why the execution time without clFinish will be longer than the time with clFinish... Is my understanding wrong?

In other words, I just want the total execution time for CPU and FPGA is not simply the sum of the time for CPU and the time for FPGA...

Thanks again.

and here are my previous problem posted in https://alteraforum.com/forum/showthread.php?t=57651:

--- Quote Start ---

Hi,

I think clEnqueueNDRangeKernel() is a non-blocking function. But I met a more specific problem.

I want to implement collaborative computing on CPU-FPGA using OpenCL. For RANSAC algorithem I used data partitioning and the process of CPU and FPGA is independent. Now I want CPU and FPGA to execute RANSAC in parallel. How can I realize it?

If I use:

......
clStatus = clEnqueueNDRangeKernel(clCommandQueue,......);
clFinish(clCommandQueue);
cpu_thread.join();
......

then I think, it will be blocking. Is it right? Because clFinish dose not return until all queued commands in clCommandQueue have been processed and completed. But I want CPU and FPGA to execute RANSAC in parallel, then I tried to remove the clFinish, but I got a more longer total execution time of CPU and FPGA (than using clFinish). And I also tried to use clFlush instead of clFinish, and I also got a more longer total execution time.

I mean, I got the shortest execution time using clFinish but I do not know why. And How can I realize paralleled processing? Who can help me? Thanks in advance.

ps: The execution time is just refers to the time of processing RANSAC, not including other time such as transmission time. The results of all the three implementations mentioned above are right and now I just focus on the time.

--- Quote End ---

and hrz told me:

--- Quote Start ---

@Jasmine-J, assuming that you are programming both the CPU and FPGA using OpenCL, you can create two separate queues, one for each device, and run your kernels in parallel on the different queues and use events to synchronize them. Either way, clEnqueueNDRangeKernel() is NOT a blocking call and the best way to synchronize kernels or code segments that are supposed to run in parallel is to use events.

--- Quote End ---

Thanks in advance.

Altera_Forum · ‎01-25-2018

I believe that your understanding of clFinish is correct. The timing that you are seeing is strange though. OpenCL calls are thread-safe except for setKernelArgs, are there other operations that get enqueued to the same queue? Also where are you measuring the time for the thread execution on the CPU?

Altera_Forum · ‎01-26-2018

What you are doing should work if only the thread that is running the FPGA kernel waits on clFinish(), and the total run time that you measure should be the run time of the slowest thread (which could be the FPGA or the CPU thread), rather than the sum of their run times. As fand suggested, can you tell us where your timing starts and ends?

You can also try something like this, even though I am not if that would work correctly with respect to multi-threaded OpenCL execution:

initialize_opencl();
define_shared_opencl_event(event);
start_timing();
fork_threads();
if (thread_num == 1)
  execute_fpga_opencl_code(&event);
else
  execute cpu_cpp_code();
join_threads():
wait_on_opencl_event(event);
end_timing();

Altera_Forum · ‎01-26-2018

Hi fand, hi HRZ,

thank you very much.

Yes, there are some clEnqueueWriteBuffer that get enqueued to the same queue, but I also used clFinish after it. Specifically, something like this: (also about my time measurement)


......
time1 = getCurrentTimestamp();
clStatus = clEnqueueWriteBuffer(clCommandQueue,......);
clFinish(clCommandQueue);
time2 = getCurrentTimestamp();
clSetKernelArg(......);
clStatus = clEnqueueNDRangeKernel(clCommandQueue,......);
std::thread cpu_thread(......);
clFinish(clCommandQueue);
cpu_thread.join();
time3 = getCurrentTimestamp();
......

where getCurrentTimestamp() is defined in opencl.cpp, and I used time3-time2 to measure kernel time, time2-time1 to measure transmission time. Is there any mistake? And thanks HRZ, I will try to use event then.

Thanks again.

Altera_Forum · ‎01-26-2018

I am not familiar with threading using std::threads, so I cannot really comment on the correctness of your threading mechanism. The OpenCL part looks correct to me and time3-time2 should give you the kernel execution time. Though, you might want to move "clSetKernelArg()" outside of the timing region. Also you should probably make sure your kernel run time is long enough (at least a few seconds) so that the timing or kernel launch overhead does not dominate the measured run time.