<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Ilia, in OpenCL* for CPU</title>
    <link>https://community.intel.com/t5/OpenCL-for-CPU/Poor-performance-with-opencl-CPU-driver/m-p/1015632#M3133</link>
    <description>&lt;P&gt;Ilia,&lt;/P&gt;

&lt;P&gt;You are measuring total execution time of the program that has a number of issues:&lt;/P&gt;

&lt;P&gt;1. You are allocating and deallocating buffers in a loop, which is highly undesirable. Recommendation is typically to do buffer allocations outside of the loop&lt;/P&gt;

&lt;P&gt;2. You are allocating buffers the wrong way for our platforms: you need to use CL_USE_HOST_PTR flag, create arrays with aligned_alloc with 4096 byte alignment and size your buffers in multiples of 64 bytes.&lt;/P&gt;

&lt;P&gt;3. You shouldn't use clEnqueueReadBuffer and clEnqueueWriteBuffer: use clEnqueueMapBuffer, which should result in no copies to/from the device and almost instant execution&lt;/P&gt;

&lt;P&gt;Please check this article on how to do performance measurements for OpenCL &lt;A href="https://software.intel.com/en-us/articles/intel-sdk-for-opencl-applications-performance-debugging-intro"&gt;https://software.intel.com/en-us/articles/intel-sdk-for-opencl-applications-performance-debugging-intro&lt;/A&gt; and this article on how to allocate "zero-copy" buffers &lt;A href="https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics"&gt;https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;Bottom line: you should be measuring kernel performance. What you are measing is program build time, buffer allocation/deallocation, and copying data back and forth and a little bit of kernel performance.&lt;/P&gt;</description>
    <pubDate>Thu, 24 Sep 2015 18:36:48 GMT</pubDate>
    <dc:creator>Robert_I_Intel</dc:creator>
    <dc:date>2015-09-24T18:36:48Z</dc:date>
    <item>
      <title>Poor performance with opencl CPU driver</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Poor-performance-with-opencl-CPU-driver/m-p/1015631#M3132</link>
      <description>&lt;P&gt;&lt;A href="http://pastebin.com/FyZkMrvQ" lang="C"&gt;Link to source&lt;BR /&gt;
	http://pastebin.com/FyZkMrvQ&lt;/A&gt;&lt;BR /&gt;
	Used Intel&lt;SPAN class="st"&gt;®&lt;/SPAN&gt; software was OpenCL CPU driver opencl_runtime_15.1_x64_5.0.0.57 from &lt;A href="https://software.intel.com/en-us/articles/opencl-drivers#lin64" target="_blank"&gt;https://software.intel.com/en-us/articles/opencl-drivers#lin64&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;Compare Beignet (GPU, id 0) vs Intel&lt;SPAN class="st"&gt;®&lt;/SPAN&gt; proprietary driver (CPU, id 1) vs pocl (CPU, id 2)&lt;/P&gt;

&lt;P&gt;user@host:~/.dev/OpenCL$&amp;nbsp;gcc perftest.c -std=c11 -O2 -lOpenCL -o perftest&lt;BR /&gt;
	user@host:~/.dev/OpenCL$ for id in 0 1 2; do time ./perftest $id; done&lt;BR /&gt;
	Succeeded to create a device group!&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;Device: 0&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Name:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Vendor:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Intel&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Available:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Yes&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Compute Units:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;20&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Clock Frequency:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;1000 mHz&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Global Memory:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;2048 mb&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Max Allocateable Memory:&amp;nbsp;&amp;nbsp; &amp;nbsp;1024 mb&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Local Memory:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;65536 kb&lt;/P&gt;

&lt;P&gt;Succeeded to create a compute context!&lt;BR /&gt;
	Succeeded to create a command commands!&lt;BR /&gt;
	Succeeded to create compute program!&lt;BR /&gt;
	Succeeded to create program executable!&lt;BR /&gt;
	Succeeded to create compute kernel!&lt;/P&gt;

&lt;P&gt;real&amp;nbsp;&amp;nbsp; &amp;nbsp;0m25.741s&lt;BR /&gt;
	user&amp;nbsp;&amp;nbsp; &amp;nbsp;0m0.604s&lt;BR /&gt;
	sys&amp;nbsp;&amp;nbsp; &amp;nbsp;0m17.796s&lt;BR /&gt;
	Succeeded to create a device group!&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;Device: 1&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Name:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Vendor:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Intel(R) Corporation&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Available:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Yes&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Compute Units:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;4&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Clock Frequency:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;1600 mHz&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Global Memory:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;5664 mb&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Max Allocateable Memory:&amp;nbsp;&amp;nbsp; &amp;nbsp;1416 mb&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Local Memory:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;32768 kb&lt;/P&gt;

&lt;P&gt;Succeeded to create a compute context!&lt;BR /&gt;
	Succeeded to create a command commands!&lt;BR /&gt;
	Succeeded to create compute program!&lt;BR /&gt;
	Succeeded to create program executable!&lt;BR /&gt;
	Succeeded to create compute kernel!&lt;/P&gt;

&lt;P&gt;real&amp;nbsp;&amp;nbsp; &amp;nbsp;0m50.082s&lt;BR /&gt;
	user&amp;nbsp;&amp;nbsp; &amp;nbsp;1m21.951s&lt;BR /&gt;
	sys&amp;nbsp;&amp;nbsp; &amp;nbsp;0m40.065s&lt;BR /&gt;
	Succeeded to create a device group!&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;Device: 2&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Name:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;pthread-Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Vendor:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;GenuineIntel&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Available:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Yes&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Compute Units:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;4&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Clock Frequency:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;2600 mHz&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Global Memory:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;5664 mb&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Max Allocateable Memory:&amp;nbsp;&amp;nbsp; &amp;nbsp;5664 mb&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Local Memory:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;1643847680 kb&lt;/P&gt;

&lt;P&gt;Succeeded to create a compute context!&lt;BR /&gt;
	Succeeded to create a command commands!&lt;BR /&gt;
	Succeeded to create compute program!&lt;BR /&gt;
	Succeeded to create program executable!&lt;BR /&gt;
	Succeeded to create compute kernel!&lt;/P&gt;

&lt;P&gt;real&amp;nbsp;&amp;nbsp; &amp;nbsp;0m28.620s&lt;BR /&gt;
	user&amp;nbsp;&amp;nbsp; &amp;nbsp;0m49.843s&lt;BR /&gt;
	sys&amp;nbsp;&amp;nbsp; &amp;nbsp;0m4.252s&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	My clinfo output: &lt;A href="http://pastebin.com/30jkBzzs"&gt;http://pastebin.com/30jkBzzs&lt;/A&gt;&lt;BR /&gt;
	Looks strange - open source library pocl (&lt;A href="http://portablecl.org"&gt;http://portablecl.org&lt;/A&gt;) beats official Intel&lt;SPAN class="st"&gt;®&lt;/SPAN&gt; software in such simple test case (don't look at "Clock Frequency" reported - when loaded it runs at 2300 MHz in both cases). If it isn't bug in my system - maybe it will be better for Intel&lt;SPAN class="st"&gt;®&lt;/SPAN&gt; to support pocl (which still has a lot of problem with standards support and stability) in stead of development own driver?&lt;/P&gt;</description>
      <pubDate>Mon, 07 Sep 2015 10:34:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Poor-performance-with-opencl-CPU-driver/m-p/1015631#M3132</guid>
      <dc:creator>Ilia_E_</dc:creator>
      <dc:date>2015-09-07T10:34:50Z</dc:date>
    </item>
    <item>
      <title>Ilia,</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Poor-performance-with-opencl-CPU-driver/m-p/1015632#M3133</link>
      <description>&lt;P&gt;Ilia,&lt;/P&gt;

&lt;P&gt;You are measuring total execution time of the program that has a number of issues:&lt;/P&gt;

&lt;P&gt;1. You are allocating and deallocating buffers in a loop, which is highly undesirable. Recommendation is typically to do buffer allocations outside of the loop&lt;/P&gt;

&lt;P&gt;2. You are allocating buffers the wrong way for our platforms: you need to use CL_USE_HOST_PTR flag, create arrays with aligned_alloc with 4096 byte alignment and size your buffers in multiples of 64 bytes.&lt;/P&gt;

&lt;P&gt;3. You shouldn't use clEnqueueReadBuffer and clEnqueueWriteBuffer: use clEnqueueMapBuffer, which should result in no copies to/from the device and almost instant execution&lt;/P&gt;

&lt;P&gt;Please check this article on how to do performance measurements for OpenCL &lt;A href="https://software.intel.com/en-us/articles/intel-sdk-for-opencl-applications-performance-debugging-intro"&gt;https://software.intel.com/en-us/articles/intel-sdk-for-opencl-applications-performance-debugging-intro&lt;/A&gt; and this article on how to allocate "zero-copy" buffers &lt;A href="https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics"&gt;https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;Bottom line: you should be measuring kernel performance. What you are measing is program build time, buffer allocation/deallocation, and copying data back and forth and a little bit of kernel performance.&lt;/P&gt;</description>
      <pubDate>Thu, 24 Sep 2015 18:36:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Poor-performance-with-opencl-CPU-driver/m-p/1015632#M3133</guid>
      <dc:creator>Robert_I_Intel</dc:creator>
      <dc:date>2015-09-24T18:36:48Z</dc:date>
    </item>
  </channel>
</rss>

