<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How can I reduce start latencies with OpenCL on the GPU? in OpenCL* for CPU</title>
    <link>https://community.intel.com/t5/OpenCL-for-CPU/How-can-I-reduce-start-latencies-with-OpenCL-on-the-GPU/m-p/1075466#M4550</link>
    <description>&lt;P&gt;I'm evaluating an Intel platform for an embedded real-time processor in our systems. Our application uses OpenCL to prcoess incoming data on a very short cycle in real-time. It is critical to the system that it is able to keep up with the input data stream. Latency between input and output is also critical so we are not able to batch up data and process it in larger quantities. For these reasons, start latency for tasks on the OpenCL command queue is as critical as kernel processing speed.&lt;/P&gt;

&lt;P&gt;One processing cycle looks something like this.&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;Steps 7 to 11 are all using events to trigger the next step.&lt;/SPAN&gt;&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;Map input buffer&lt;/LI&gt;
	&lt;LI&gt;Queue unmap input buffer (to be triggered by a user event)&lt;/LI&gt;
	&lt;LI&gt;Queue kernels&lt;/LI&gt;
	&lt;LI&gt;Queue map output buffer&lt;/LI&gt;
	&lt;LI&gt;Copy data in&lt;/LI&gt;
	&lt;LI&gt;Trigger unmap&lt;/LI&gt;
	&lt;LI&gt;Unmap&lt;/LI&gt;
	&lt;LI&gt;Kernel 1&lt;/LI&gt;
	&lt;LI&gt;Kernel 2&lt;/LI&gt;
	&lt;LI&gt;Kernel 3&lt;/LI&gt;
	&lt;LI&gt;Map output buffer&lt;/LI&gt;
	&lt;LI&gt;Copy data out&lt;/LI&gt;
&lt;/OL&gt;

&lt;P&gt;This sequence works very well on OpenCL on a different (non-Intel) processor but seems to suffer longer start latency than expected on this processor. &lt;SPAN style="font-size: 13.008px;"&gt;Examples&amp;nbsp;&lt;/SPAN&gt;of latency (microseconds) between the some of these steps is shown below.&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;end 7 (unmap) to start 8 (kernel 1) &amp;nbsp; &amp;nbsp;700 - 1400&lt;/LI&gt;
	&lt;LI&gt;end 8 (kernel 1) to start 9 (kernel 2) &amp;nbsp; 400 - 900&lt;/LI&gt;
	&lt;LI&gt;end 9 (kernel 2) to start 10 (kernel 3) &amp;nbsp; &amp;nbsp;400 - 700&lt;/LI&gt;
	&lt;LI&gt;end 10 (kernel 3) to start 11 (map) &amp;nbsp; &amp;nbsp;300 - 600&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;These times are huge for our system which operates on a short real-time cycle.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Does anyone have some insight into what might be causing this and how we could reduce the times? Some specifics of the system are given below in case they might help.&lt;/P&gt;

&lt;P&gt;Thanks, Tony&lt;/P&gt;

&lt;P&gt;Linux: Yocto from the Apollo Lake BSP release&amp;nbsp;&lt;EM&gt;gold, &lt;/EM&gt;build&amp;nbsp;&lt;EM&gt;&lt;SPAN style="font-size: 1em;"&gt;core-image-sato-sdk, &lt;/SPAN&gt;&lt;/EM&gt;&lt;SPAN style="font-size: 1em;"&gt;i&lt;/SPAN&gt;nstalled on onboard eMMC.&lt;/P&gt;

&lt;P&gt;Hardware: Oxbow Hill Rev B CRB with Intel Atom E3950 and 8GB DDR3 RAM&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;OpenCL: installed user space drivers from SRB4&amp;nbsp;&lt;/SPAN&gt;&lt;A data-saferedirecturl="https://www.google.com/url?hl=en-GB&amp;amp;q=https://software.intel.com/file/533571/download&amp;amp;source=gmail&amp;amp;ust=1484988773918000&amp;amp;usg=AFQjCNGmcYt7HQbGhDC2e5dTzkkLL3TXiA" href="https://software.intel.com/file/533571/download" style="font-size: 1em;" target="_blank"&gt;&lt;/A&gt;&lt;A href="https://software.intel" target="_blank"&gt;https://software.intel&lt;/A&gt;.&lt;WBR /&gt;com/file/533571/download&lt;/P&gt;</description>
    <pubDate>Fri, 20 Jan 2017 09:55:40 GMT</pubDate>
    <dc:creator>tony_w_</dc:creator>
    <dc:date>2017-01-20T09:55:40Z</dc:date>
    <item>
      <title>How can I reduce start latencies with OpenCL on the GPU?</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/How-can-I-reduce-start-latencies-with-OpenCL-on-the-GPU/m-p/1075466#M4550</link>
      <description>&lt;P&gt;I'm evaluating an Intel platform for an embedded real-time processor in our systems. Our application uses OpenCL to prcoess incoming data on a very short cycle in real-time. It is critical to the system that it is able to keep up with the input data stream. Latency between input and output is also critical so we are not able to batch up data and process it in larger quantities. For these reasons, start latency for tasks on the OpenCL command queue is as critical as kernel processing speed.&lt;/P&gt;

&lt;P&gt;One processing cycle looks something like this.&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;Steps 7 to 11 are all using events to trigger the next step.&lt;/SPAN&gt;&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;Map input buffer&lt;/LI&gt;
	&lt;LI&gt;Queue unmap input buffer (to be triggered by a user event)&lt;/LI&gt;
	&lt;LI&gt;Queue kernels&lt;/LI&gt;
	&lt;LI&gt;Queue map output buffer&lt;/LI&gt;
	&lt;LI&gt;Copy data in&lt;/LI&gt;
	&lt;LI&gt;Trigger unmap&lt;/LI&gt;
	&lt;LI&gt;Unmap&lt;/LI&gt;
	&lt;LI&gt;Kernel 1&lt;/LI&gt;
	&lt;LI&gt;Kernel 2&lt;/LI&gt;
	&lt;LI&gt;Kernel 3&lt;/LI&gt;
	&lt;LI&gt;Map output buffer&lt;/LI&gt;
	&lt;LI&gt;Copy data out&lt;/LI&gt;
&lt;/OL&gt;

&lt;P&gt;This sequence works very well on OpenCL on a different (non-Intel) processor but seems to suffer longer start latency than expected on this processor. &lt;SPAN style="font-size: 13.008px;"&gt;Examples&amp;nbsp;&lt;/SPAN&gt;of latency (microseconds) between the some of these steps is shown below.&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;end 7 (unmap) to start 8 (kernel 1) &amp;nbsp; &amp;nbsp;700 - 1400&lt;/LI&gt;
	&lt;LI&gt;end 8 (kernel 1) to start 9 (kernel 2) &amp;nbsp; 400 - 900&lt;/LI&gt;
	&lt;LI&gt;end 9 (kernel 2) to start 10 (kernel 3) &amp;nbsp; &amp;nbsp;400 - 700&lt;/LI&gt;
	&lt;LI&gt;end 10 (kernel 3) to start 11 (map) &amp;nbsp; &amp;nbsp;300 - 600&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;These times are huge for our system which operates on a short real-time cycle.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Does anyone have some insight into what might be causing this and how we could reduce the times? Some specifics of the system are given below in case they might help.&lt;/P&gt;

&lt;P&gt;Thanks, Tony&lt;/P&gt;

&lt;P&gt;Linux: Yocto from the Apollo Lake BSP release&amp;nbsp;&lt;EM&gt;gold, &lt;/EM&gt;build&amp;nbsp;&lt;EM&gt;&lt;SPAN style="font-size: 1em;"&gt;core-image-sato-sdk, &lt;/SPAN&gt;&lt;/EM&gt;&lt;SPAN style="font-size: 1em;"&gt;i&lt;/SPAN&gt;nstalled on onboard eMMC.&lt;/P&gt;

&lt;P&gt;Hardware: Oxbow Hill Rev B CRB with Intel Atom E3950 and 8GB DDR3 RAM&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;OpenCL: installed user space drivers from SRB4&amp;nbsp;&lt;/SPAN&gt;&lt;A data-saferedirecturl="https://www.google.com/url?hl=en-GB&amp;amp;q=https://software.intel.com/file/533571/download&amp;amp;source=gmail&amp;amp;ust=1484988773918000&amp;amp;usg=AFQjCNGmcYt7HQbGhDC2e5dTzkkLL3TXiA" href="https://software.intel.com/file/533571/download" style="font-size: 1em;" target="_blank"&gt;&lt;/A&gt;&lt;A href="https://software.intel" target="_blank"&gt;https://software.intel&lt;/A&gt;.&lt;WBR /&gt;com/file/533571/download&lt;/P&gt;</description>
      <pubDate>Fri, 20 Jan 2017 09:55:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/How-can-I-reduce-start-latencies-with-OpenCL-on-the-GPU/m-p/1075466#M4550</guid>
      <dc:creator>tony_w_</dc:creator>
      <dc:date>2017-01-20T09:55:40Z</dc:date>
    </item>
    <item>
      <title>Hello Tony,</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/How-can-I-reduce-start-latencies-with-OpenCL-on-the-GPU/m-p/1075467#M4551</link>
      <description>&lt;P&gt;Hello Tony,&lt;/P&gt;

&lt;P&gt;Could you provide a reproducer for the API sequence?&lt;/P&gt;

&lt;P&gt;Latency seems to be too high ( especially delta #2,#3 and #4), therefore better understanding of exact API calls / events sequence and resource setup would help in this case.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 20 Jan 2017 16:38:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/How-can-I-reduce-start-latencies-with-OpenCL-on-the-GPU/m-p/1075467#M4551</guid>
      <dc:creator>Michal_M_Intel</dc:creator>
      <dc:date>2017-01-20T16:38:25Z</dc:date>
    </item>
    <item>
      <title>Tony, this Intel presentation</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/How-can-I-reduce-start-latencies-with-OpenCL-on-the-GPU/m-p/1075468#M4552</link>
      <description>&lt;P&gt;Tony, this Intel presentation might be relevant to your work:&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;&lt;A href="http://www.iwocl.org/wp-content/uploads/iwocl-2016-gpu-daemon.pdf" target="_blank"&gt;http://www.iwocl.org/wp-content/uploads/iwocl-2016-gpu-daemon.pdf&lt;/A&gt;&lt;/P&gt;

	&lt;P&gt;&amp;nbsp;&lt;/P&gt;

	&lt;H3 class="spk-title" data-fontsize="16" data-lineheight="24" style="box-sizing: border-box; margin-bottom: 10px; font-size: 16px; font-weight: 400; font-family: &amp;quot;Open Sans&amp;quot;; line-height: 1.5; color: rgb(0, 51, 51);"&gt;&lt;STRONG&gt;GPU daemon – Road to Zero Cost Submission&lt;/STRONG&gt;&lt;/H3&gt;

	&lt;H4 class="spk-author" data-fontsize="15" data-lineheight="22" style="box-sizing: border-box; margin-top: 5px; margin-bottom: 10px; font-size: 15px; font-weight: 400; font-family: &amp;quot;Open Sans&amp;quot;; line-height: 1.5; color: rgb(0, 51, 51);"&gt;Michal Mrozek and Zbigniew Zdanowicz (Intel)&lt;/H4&gt;

	&lt;P&gt;&amp;nbsp;&lt;/P&gt;

	&lt;P class="spk-abstract" style="box-sizing: border-box; line-height: 1.2; margin-bottom: 20px; color: rgb(62, 62, 62); font-family: &amp;quot;Open Sans&amp;quot;; font-size: 16px;"&gt;One of the biggest problems of OpenCL efficient usage is the latency submission. Time needed to pass through the driver stack is so significant that it limits the use of OpenCL on GPU in applications requiring low-latency. This presentation we present a novel approach utilizing new features of OpenCL 2.0 : Fine-Grained SVM and device enqueue_kernel that allows completely new usage models. We will present the idea of GPU daemon that operates using different modes (polling, enqueue_kernel and monitored_fence) and offers various levels of flexibility for the end user application. Part of presentation will show the data &amp;amp; code samples for each approach and will also compare each mode with the traditional submission model.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;</description>
      <pubDate>Fri, 20 Jan 2017 16:50:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/How-can-I-reduce-start-latencies-with-OpenCL-on-the-GPU/m-p/1075468#M4552</guid>
      <dc:creator>allanmac1</dc:creator>
      <dc:date>2017-01-20T16:50:11Z</dc:date>
    </item>
  </channel>
</rss>

