parallel execution of kernels on EU's through OOQ is possible?

rajesh_k_ · ‎03-13-2018

Hi,

I am developing an opencl kernel with Out of order execution queue. I have read this article

https://software.intel.com/en-us/articles/opencl-out-of-order-queue-on-intel-processor-graphics which describes the OOQ and it performance implications.

i want to understand, when two kernels are enqueued in to OOQ will these two kernels get executed simultaneously on different EU's?. i am not able to conclude that from the article. What i understand is that even in the OOQ the kernels are executed serially not simulataneously on the EU's.

please clarify this confusion.

Best Regards,

Rajesh

Ben_A_Intel · ‎03-16-2018

Hi Rajesh,

If two kernels are enqueued to an out-of-order queue then they may execute concurrently. There's no guarantee that they will execute concurrently, or how much they will execute concurrently, but it tells the OpenCL runtime that they may execute concurrently if possible, which isn't possible if they are enqueued to an in-order queue.

Note that "executing concurrently" means that work groups from the enqueue will be assigned to concurrently running EU threads, which may or may not be running on different EUs. This is a low-level detail that may or may not matter to you, but listing it here for completeness.

rajesh_k_ · ‎03-22-2018

Thanks Ben.

Could you please let me know the changes that I would have to make in the host -side to tell the run-time to execute two kernels concurrently on EU threads? please share with me if there is an example.

Best Regards,

Rajesh

Ben_A_Intel · ‎03-22-2018

I've been trying to find a published sample that demonstrates out-of-order queue benefits but I haven't found a good one so far.

We've seen the best out-of-order queue performance by dividing work into batches to execute concurrently and separating the batches with command queue barriers. So, let's say you have two parallel streams of work, one where A produces and B consumes, and another where C produces and D consumes. If you wanted to execute A and C concurrently, then B and D concurrently, you could do something like the following:

clEnqueueNDRangeKernel( A, ... );
clEnqueueNDRangeKernel( C, ... );

clEnqueueBarrierWithWaitList( ... );

clEnqueueNDRangeKernel( B, ... );
clEnqueueNDRangeKernel( D, ... );

Give this a try and let us know if it works for you. Meanwhile, if I can't find a sample that does this I'll see if we can publish one. Thanks!

rajesh_k_ · ‎03-23-2018

Hi Ben,

I have already tried what you suggested earlier. in fact i raised this question because i did not observe the parallel execution of the kernels on the EU threads from the Vtune results .I have attached the results with this query.

what i have observed is that when i when i use a OOQ, the two kernels are not executed in parallel instead the second kernel starts immediately once the first kernel has finished the job. But when i use IOQ there is significant amount of delay before the start of the second kernel. This is because clEnqueueNDRangeKernel of the second kernel is initiated after the completion of the first kernel. you can see this phenomenon in the attached images.

from the vtune results the gain i see is due to reduction of the launching time for the second kernel in OOQ but not because of parallel execution of the kernels.

Please share your thoughts on this.

Thanks

Rajesh

Ben_A_Intel · ‎03-27-2018

rajesh k. wrote:

I have already tried what you suggested earlier. in fact i raised this question because i did not observe the parallel execution of the kernels on the EU threads from the Vtune results .I have attached the results with this query.

Hi Rajesh,

This is unfortunately a case where measuring the performance of a kernel using VTune is affecting the ability of the kernels to execute simultaneously. VTune is trying to show you how long each kernel executes, which requires measuring the start and end time of each kernel, which is inherently a synchronous operation. To see the performance improvement via overlapping execution you'll want to measure wall clock time, not something like event profiling time start and end times.

We're looking at ways to improve this in the future.

Looking at your timegraphs though, it looks like there is a clEnqueueWaitForEvents between the two kernels in the in-order queue. Is there a reason for this? This is a serializing event that prevents both kernels from going out in the same batch, and explains the large gap in your picture. Note that there is no clEnqueueWaitForEvents nor gap in the our-of-order queue picture.