<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic OpenCL overhead on empty kernel in OpenCL* for CPU</title>
    <link>https://community.intel.com/t5/OpenCL-for-CPU/OpenCL-overhead-on-empty-kernel/m-p/1127149#M5646</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;

&lt;P&gt;I am currently comparing my own implemention of GEMV in OpenCL to the MKL. I am benchmarking very small input sizes like 2x64 for example. On my system the MKL runs around 0,001ms for this input size and my kernel runs around 0,003ms.&lt;/P&gt;

&lt;P&gt;When executing a completly empty kernel I get a runtime of around 0,0025ms. Where does this overhead come from and why doesn't the MKL seem to have it? I am benchmarking my OpenCL kernel via the OpenCL events and MKL with the dsecnd() function, that is supplied by the MKL.&lt;/P&gt;

&lt;P&gt;Thanks in advance!&lt;/P&gt;</description>
    <pubDate>Tue, 13 Jun 2017 10:54:21 GMT</pubDate>
    <dc:creator>Richard_S_7</dc:creator>
    <dc:date>2017-06-13T10:54:21Z</dc:date>
    <item>
      <title>OpenCL overhead on empty kernel</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/OpenCL-overhead-on-empty-kernel/m-p/1127149#M5646</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;

&lt;P&gt;I am currently comparing my own implemention of GEMV in OpenCL to the MKL. I am benchmarking very small input sizes like 2x64 for example. On my system the MKL runs around 0,001ms for this input size and my kernel runs around 0,003ms.&lt;/P&gt;

&lt;P&gt;When executing a completly empty kernel I get a runtime of around 0,0025ms. Where does this overhead come from and why doesn't the MKL seem to have it? I am benchmarking my OpenCL kernel via the OpenCL events and MKL with the dsecnd() function, that is supplied by the MKL.&lt;/P&gt;

&lt;P&gt;Thanks in advance!&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jun 2017 10:54:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/OpenCL-overhead-on-empty-kernel/m-p/1127149#M5646</guid>
      <dc:creator>Richard_S_7</dc:creator>
      <dc:date>2017-06-13T10:54:21Z</dc:date>
    </item>
    <item>
      <title>OpenCL kernel enqueue and</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/OpenCL-overhead-on-empty-kernel/m-p/1127150#M5647</link>
      <description>&lt;P&gt;OpenCL kernel enqueue and launch will have some overhead. &amp;nbsp;Ideally the operation would be larger so this overhead would be a relatively small part of the overall execution time. &amp;nbsp;OpenCL's advantages are convenient threading for CPU and access to accelerator HW like GPU and FPGA. &amp;nbsp;OpenCL may be able to help more when scheduling many of these small operations, or with larger input sizes.&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jun 2017 21:33:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/OpenCL-overhead-on-empty-kernel/m-p/1127150#M5647</guid>
      <dc:creator>Jeffrey_M_Intel1</dc:creator>
      <dc:date>2017-06-13T21:33:20Z</dc:date>
    </item>
    <item>
      <title>Take into account that MKL's</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/OpenCL-overhead-on-empty-kernel/m-p/1127151#M5648</link>
      <description>Take into account that MKL's API is highly optimized for bigger data sets ( matrices, vectors ).

MKL also has overheads and a simple classic matrix multiplication algorithm ( triple-for-loop / processing core is less than 10 code lines ) outperforms MKL's &lt;STRONG&gt;sgemm&lt;/STRONG&gt; for matrices up to 2,048x2,048.

&amp;gt;&amp;gt;...On my system the MKL runs around 0,001ms for this input size and my kernel runs around 0,003ms...

How many times did you execute the test to get these numbers?</description>
      <pubDate>Tue, 20 Jun 2017 18:59:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/OpenCL-overhead-on-empty-kernel/m-p/1127151#M5648</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-06-20T18:59:06Z</dc:date>
    </item>
    <item>
      <title>Thank you for your feedback.</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/OpenCL-overhead-on-empty-kernel/m-p/1127152#M5649</link>
      <description>&lt;P&gt;Thank you for your feedback.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Sergey Kostrov wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;MKL also has overheads and a simple classic matrix multiplication algorithm ( triple-for-loop / processing core is less than 10 code lines ) outperforms MKL's &lt;STRONG&gt;sgemm&lt;/STRONG&gt; for matrices up to 2,048x2,048.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;Could you supply the classic matrix multiplication algorithm as described by you? I would highly appreciate it!&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Sergey Kostrov wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;How many times did you execute the test to get these numbers?&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;I made 3 warm up runs and calculated the average runtime of 5 following runs. The profiling was done by using the C++ chronos library and alternatively by using the MKL function dsecnd(). Both profiling methods produced the same results.&lt;/P&gt;</description>
      <pubDate>Thu, 22 Jun 2017 11:08:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/OpenCL-overhead-on-empty-kernel/m-p/1127152#M5649</guid>
      <dc:creator>Richard_S_7</dc:creator>
      <dc:date>2017-06-22T11:08:20Z</dc:date>
    </item>
  </channel>
</rss>

