<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic parallel execution of kernels on EU's through OOQ is possible? in OpenCL* for CPU</title>
    <link>https://community.intel.com/t5/OpenCL-for-CPU/parallel-execution-of-kernels-on-EU-s-through-OOQ-is-possible/m-p/1152848#M6108</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;I am developing an opencl kernel with Out of order execution queue. I have read this article&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/en-us/articles/opencl-out-of-order-queue-on-intel-processor-graphics" target="_blank"&gt;https://software.intel.com/en-us/articles/opencl-out-of-order-queue-on-intel-processor-graphics&lt;/A&gt; which describes the OOQ and it performance implications.&lt;/P&gt;

&lt;P&gt;i want to understand, when two kernels are enqueued in to OOQ will these two kernels get executed simultaneously on different EU's?.&amp;nbsp; i am not able to conclude that from the article. What i understand is that even in the OOQ the kernels are executed serially not simulataneously on the EU's.&lt;/P&gt;

&lt;P&gt;please clarify this confusion.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;/P&gt;

&lt;P&gt;Rajesh&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 13 Mar 2018 09:22:58 GMT</pubDate>
    <dc:creator>rajesh_k_</dc:creator>
    <dc:date>2018-03-13T09:22:58Z</dc:date>
    <item>
      <title>parallel execution of kernels on EU's through OOQ is possible?</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/parallel-execution-of-kernels-on-EU-s-through-OOQ-is-possible/m-p/1152848#M6108</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;I am developing an opencl kernel with Out of order execution queue. I have read this article&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/en-us/articles/opencl-out-of-order-queue-on-intel-processor-graphics" target="_blank"&gt;https://software.intel.com/en-us/articles/opencl-out-of-order-queue-on-intel-processor-graphics&lt;/A&gt; which describes the OOQ and it performance implications.&lt;/P&gt;

&lt;P&gt;i want to understand, when two kernels are enqueued in to OOQ will these two kernels get executed simultaneously on different EU's?.&amp;nbsp; i am not able to conclude that from the article. What i understand is that even in the OOQ the kernels are executed serially not simulataneously on the EU's.&lt;/P&gt;

&lt;P&gt;please clarify this confusion.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;/P&gt;

&lt;P&gt;Rajesh&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 13 Mar 2018 09:22:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/parallel-execution-of-kernels-on-EU-s-through-OOQ-is-possible/m-p/1152848#M6108</guid>
      <dc:creator>rajesh_k_</dc:creator>
      <dc:date>2018-03-13T09:22:58Z</dc:date>
    </item>
    <item>
      <title>Hi Rajesh,</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/parallel-execution-of-kernels-on-EU-s-through-OOQ-is-possible/m-p/1152849#M6109</link>
      <description>&lt;P&gt;Hi Rajesh,&lt;/P&gt;

&lt;P&gt;If two kernels are enqueued to an out-of-order queue then they may execute concurrently.&amp;nbsp; There's no guarantee that they will execute concurrently, or how much they will execute concurrently, but it tells the OpenCL runtime that they may execute concurrently if possible, which isn't possible if they are enqueued to an in-order queue.&lt;/P&gt;

&lt;P&gt;Note that "executing concurrently" means that work groups from the enqueue will be assigned to concurrently running EU threads, which may or may not be running on different EUs.&amp;nbsp; This is a low-level detail that may or may not matter to you, but listing it here for completeness.&lt;/P&gt;</description>
      <pubDate>Fri, 16 Mar 2018 17:14:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/parallel-execution-of-kernels-on-EU-s-through-OOQ-is-possible/m-p/1152849#M6109</guid>
      <dc:creator>Ben_A_Intel</dc:creator>
      <dc:date>2018-03-16T17:14:21Z</dc:date>
    </item>
    <item>
      <title>Thanks Ben.</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/parallel-execution-of-kernels-on-EU-s-through-OOQ-is-possible/m-p/1152850#M6110</link>
      <description>&lt;P&gt;Thanks Ben.&lt;/P&gt;

&lt;P&gt;Could you please let me know the changes that I would have to make in the host -side to tell the run-time to execute two kernels concurrently on EU threads? please share with me if there is an example.&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;/P&gt;

&lt;P&gt;Rajesh&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 22 Mar 2018 09:22:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/parallel-execution-of-kernels-on-EU-s-through-OOQ-is-possible/m-p/1152850#M6110</guid>
      <dc:creator>rajesh_k_</dc:creator>
      <dc:date>2018-03-22T09:22:10Z</dc:date>
    </item>
    <item>
      <title>I've been trying to find a</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/parallel-execution-of-kernels-on-EU-s-through-OOQ-is-possible/m-p/1152851#M6111</link>
      <description>&lt;P&gt;I've been trying to find a published sample that demonstrates out-of-order queue benefits but I haven't found a good one so far.&lt;/P&gt;

&lt;P&gt;We've seen the best out-of-order queue performance by dividing work into batches to execute concurrently and separating the batches with command queue barriers.&amp;nbsp; So, let's say you have two parallel streams of work, one where A produces and B consumes, and another where C produces and D consumes.&amp;nbsp; If you wanted to execute A and C concurrently, then B and D concurrently, you could do something like the following:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;clEnqueueNDRangeKernel( A, ... );
clEnqueueNDRangeKernel( C, ... );

clEnqueueBarrierWithWaitList( ... );

clEnqueueNDRangeKernel( B, ... );
clEnqueueNDRangeKernel( D, ... );&lt;/PRE&gt;

&lt;P&gt;Give this a try and let us know if it works for you.&amp;nbsp; Meanwhile, if I can't find a sample that does this I'll see if we can publish one.&amp;nbsp; Thanks!&lt;/P&gt;</description>
      <pubDate>Thu, 22 Mar 2018 15:35:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/parallel-execution-of-kernels-on-EU-s-through-OOQ-is-possible/m-p/1152851#M6111</guid>
      <dc:creator>Ben_A_Intel</dc:creator>
      <dc:date>2018-03-22T15:35:55Z</dc:date>
    </item>
    <item>
      <title>Hi Ben,</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/parallel-execution-of-kernels-on-EU-s-through-OOQ-is-possible/m-p/1152852#M6112</link>
      <description>&lt;P&gt;Hi Ben,&lt;/P&gt;

&lt;P&gt;I have already tried what you suggested earlier. in fact i raised this&amp;nbsp; question because i did not observe the parallel execution of the kernels on the EU threads&amp;nbsp; from the Vtune results .&lt;SPAN style="font-size: 1em;"&gt;I have attached the results with this query.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;what i have observed is that when i when i use a OOQ, the two kernels are not executed in parallel instead the second kernel starts immediately once the first kernel has finished the job. But when i use IOQ there is significant amount of delay before the start of the second kernel. This is because &lt;/SPAN&gt;&lt;SPAN style="color: rgb(0, 0, 0); font-family: Consolas, &amp;quot;Bitstream Vera Sans Mono&amp;quot;, &amp;quot;Courier New&amp;quot;, Courier, monospace; font-size: 13.008px; background-color: rgb(248, 248, 248);"&gt;clEnqueueNDRangeKernel of the second kernel is initiated after the completion of the first kernel. you can see this phenomenon in the attached images.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;FONT color="#000000" face="Consolas, Bitstream Vera Sans Mono, Courier New, Courier, monospace"&gt;&lt;SPAN style="background-color: rgb(248, 248, 248);"&gt;from the vtune results the gain i see is due to reduction of the launching time for the second kernel in OOQ but not because of parallel execution of the kernels.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;

&lt;P&gt;&lt;FONT color="#000000" face="Consolas, Bitstream Vera Sans Mono, Courier New, Courier, monospace"&gt;&lt;SPAN style="background-color: rgb(248, 248, 248);"&gt;Please share your thoughts on this.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;

&lt;P&gt;&lt;FONT color="#000000" face="Consolas, Bitstream Vera Sans Mono, Courier New, Courier, monospace"&gt;&lt;SPAN style="background-color: rgb(248, 248, 248);"&gt;Thanks&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;

&lt;P&gt;&lt;FONT color="#000000" face="Consolas, Bitstream Vera Sans Mono, Courier New, Courier, monospace"&gt;&lt;SPAN style="background-color: rgb(248, 248, 248);"&gt;Rajesh&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Mar 2018 10:36:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/parallel-execution-of-kernels-on-EU-s-through-OOQ-is-possible/m-p/1152852#M6112</guid>
      <dc:creator>rajesh_k_</dc:creator>
      <dc:date>2018-03-23T10:36:43Z</dc:date>
    </item>
    <item>
      <title>Quote:rajesh k. wrote:</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/parallel-execution-of-kernels-on-EU-s-through-OOQ-is-possible/m-p/1152853#M6113</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;rajesh k. wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;I have already tried what you suggested earlier. in fact i raised this&amp;nbsp; question because i did not observe the parallel execution of the kernels on the EU threads&amp;nbsp; from the Vtune results .I have attached the results with this query.&lt;BR /&gt;
	&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hi Rajesh,&lt;/P&gt;

&lt;P&gt;This is unfortunately a case where measuring the performance of a kernel using VTune is affecting the ability of the kernels to execute simultaneously.&amp;nbsp; VTune is trying to show you how long each kernel executes, which requires measuring the start and end time of each kernel, which is inherently a synchronous operation.&amp;nbsp; To see the performance improvement via overlapping execution you'll want to measure wall clock time, not something like event profiling time start and end times.&lt;/P&gt;

&lt;P&gt;We're looking at ways to improve this in the future.&lt;/P&gt;

&lt;P&gt;Looking at your timegraphs though, it looks like there is a clEnqueueWaitForEvents between the two kernels in the in-order queue.&amp;nbsp; Is there a reason for this?&amp;nbsp; This is a serializing event that prevents both kernels from going out in the same batch, and explains the large gap in your picture.&amp;nbsp; Note that there is no clEnqueueWaitForEvents nor gap in the our-of-order queue picture.&lt;/P&gt;</description>
      <pubDate>Tue, 27 Mar 2018 21:41:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/parallel-execution-of-kernels-on-EU-s-through-OOQ-is-possible/m-p/1152853#M6113</guid>
      <dc:creator>Ben_A_Intel</dc:creator>
      <dc:date>2018-03-27T21:41:34Z</dc:date>
    </item>
  </channel>
</rss>

