<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Betreff: Re:enqueueMapBuffer for read-only buffer in GPU Compute Software</title>
    <link>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1398865#M559</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;here is my complete Code:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://gist.github.com/lolxdfly/43354a794f6f49cd0bcee1605027c5e8" target="_self"&gt;https://gist.github.com/lolxdfly/43354a794f6f49cd0bcee1605027c5e8&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;The lineitem.tbl is part of the TPC-H benchmark and is unfortunately too big to be uploaded.&lt;/P&gt;
&lt;P&gt;In line 72 of gpudb.cpp you will find the the alignment of the size of the input buffer. Right now it is commented out which makes the kernel run faster. If I uncomment that line, the buffer has the requirement of a zero-copy-buffer, but the kernel runs slower.&lt;/P&gt;</description>
    <pubDate>Fri, 08 Jul 2022 08:45:15 GMT</pubDate>
    <dc:creator>lolxdfly</dc:creator>
    <dc:date>2022-07-08T08:45:15Z</dc:date>
    <item>
      <title>enqueueMapBuffer for read-only buffer</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1385733#M468</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I have a follow up question to &lt;A href="https://community.intel.com/t5/GPU-Compute-Software/global-work-size-and-power-of-two/m-p/1381715" target="_blank" rel="noopener"&gt;my old question here&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;In the answer an enqueueMapBuffer call was suggested for both buffers. The input buffer which is read-only and the output buffer. Is it really necessary to map the input buffer after the kernel execution has finished? This is read-only memory and I do not care if the memory is not coherent with the CPU, because the GPU does not modify the memory anyway. I wonder if skipping the enqueueMapBuffer for read-only buffers could lead to the GPU not reading the hosts data.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This question actually belongs to a more complicated issue I have. In that case, the enqueueMapBuffer takes a lot of time. Allocating the buffers with 4K address alignment and 64-byte size alignment makes OpenCL do a "zero-copy" which is a lot faster. However, it is even faster, when I skip the enqueueMapBuffer and make sure the buffer is &lt;STRONG&gt;not&lt;/STRONG&gt; aligned. This is really strange, because as far as I know, a "zero-copy" buffer should already be the fastest way and making the memory not aligned should result in OpenCL copying the buffers data. This should be slower than not copying any data at all.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;PS: I try to create a minimal code example, but it is not that easy. The behavior is not reproducible with the simple copy-kernel of my last question. So it seems like the behavior also depends on the kernel itself.&lt;/P&gt;
&lt;P&gt;Edit: I was able to create some kind of minimal code example: &lt;A href="https://gist.github.com/lolxdfly/e6209ba776680fa6acc13ce18b7e36d5" target="_self"&gt;https://gist.github.com/lolxdfly/e6209ba776680fa6acc13ce18b7e36d5&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;The two important lines are lines 36 (64-byte size alignment) and 141 (enqueueMapBuffer).&lt;/P&gt;
&lt;P&gt;Without alignment, but with MapBuffer (line 36 inactive and line 141 active) the GPU time is 17320455.&lt;/P&gt;
&lt;P&gt;With alignment and with MapBuffer (line 36 active and line 141 active) the GPU time is 2264301.&lt;/P&gt;
&lt;P&gt;With alignment, but without MapBuffer (line 36 active and line 141 inactive) the GPU time is 2200300.&lt;/P&gt;
&lt;P&gt;Without alignment and without MapBuffer (line 36 inactive and line 141 inactive) the GPU time is 1971559.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="sub_section_element_selectors"&gt;&lt;SPAN class="sub_section_element_selectors"&gt;Intel Corporation TigerLake-H GT1 [UHD Graphics] (rev 01)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="sub_section_element_selectors"&gt;&lt;SPAN class="sub_section_element_selectors"&gt;Ubuntu 22.04 with 5.17.0-1004-oem kernel&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="sub_section_element_selectors"&gt;&lt;SPAN class="sub_section_element_selectors"&gt;intel-opencl-icd version: 22.14.22890-&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 20 May 2022 08:55:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1385733#M468</guid>
      <dc:creator>lolxdfly</dc:creator>
      <dc:date>2022-05-20T08:55:05Z</dc:date>
    </item>
    <item>
      <title>Re: enqueueMapBuffer for read-only buffer</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1386939#M479</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks for reaching out to us.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;According to the letter of the spec, it is not valid to read from a host pointer used to create a USE_HOST_PTR buffer.&amp;nbsp;The OpenCL spec isn’t as explicit about this as I would like, but the SYCL specification is:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#_default_behavior" target="_blank" rel="noopener"&gt;https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#_default_behavior&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;“When using an SYCL buffer, the ownership of the pointer passed to the constructor of the class is, by default, passed to &lt;A href="https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sycl-runtime" target="_self"&gt;sycl-runtime&lt;/A&gt;, and that pointer cannot be used on the host side until the buffer or image is destroyed. An SYCL application can access the contents of the memory managed by an SYCL buffer by using a host_accessor as defined in&amp;nbsp;&lt;A href="https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#subsec:accessors" target="_self"&gt;section 4.7.6&lt;/A&gt;&amp;nbsp; ”&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;That being said, in practice, the odds of a problem occurring if neither the CPU nor GPU are writing to the host pointer or the buffer are small, so things will _probably_ work without mapping.&amp;nbsp;I wouldn’t recommend doing this, and if it happened to break I’d call it an app bug, but I could certainly understand if an application does it in the cases where it is known to work.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Some other options to consider:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;- Host&amp;nbsp; &lt;A href="https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_unified_shared_memory.html" target="_self"&gt;USM&lt;/A&gt;&amp;nbsp;&amp;nbsp;can be read from the host and the device without mapping.&lt;/P&gt;
&lt;P&gt;- Creating the buffer with&amp;nbsp;&lt;A href="https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_API.html#CL_MEM_COPY_HOST_PTR" target="_self"&gt;COPY_HOST_PTR&lt;/A&gt;&amp;nbsp; will pay the cost of a copy, but will likely perform better on discrete GPUs, and will allow simultaneous reading from both the buffer and the host pointer.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;
&lt;P&gt;Noorjahan.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 24 May 2022 10:28:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1386939#M479</guid>
      <dc:creator>NoorjahanSk_Intel</dc:creator>
      <dc:date>2022-05-24T10:28:41Z</dc:date>
    </item>
    <item>
      <title>Re: enqueueMapBuffer for read-only buffer</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1388580#M489</link>
      <description>&lt;P&gt;Thank for the answer!&lt;/P&gt;
&lt;P&gt;This answers most of question. However, it is still unclear to my why my example code runs faster with a read-only non-zero-copy Buffer without mapping than with a read-only zero-copy buffer. The non-zero-copy Buffer should be copied and therefore be slower than the zero-copy Buffer, even if there is no call to map the memory back to the host.&lt;/P&gt;</description>
      <pubDate>Mon, 30 May 2022 10:22:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1388580#M489</guid>
      <dc:creator>lolxdfly</dc:creator>
      <dc:date>2022-05-30T10:22:48Z</dc:date>
    </item>
    <item>
      <title>Re: enqueueMapBuffer for read-only buffer</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1391064#M514</link>
      <description>&lt;P&gt;Hi, is the question why this scenario (I think this is "read-only non-zero-copy"?):&lt;/P&gt;
&lt;P&gt;&amp;gt;&amp;nbsp;&lt;SPAN&gt;Without alignment and without MapBuffer (line 36 inactive and line 141 inactive) the GPU time is 1971559.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;performs differently than this scenario (I think this is "read-only zero-copy"?)?&lt;/P&gt;
&lt;P&gt;&amp;gt;&amp;nbsp;With alignment, but without MapBuffer (line 36 active and line 141 inactive) the GPU time is 2200300.&lt;/P&gt;
&lt;P&gt;If so, I've been trying to reproduce these results, so far without much luck.&amp;nbsp; The timings I am seeing are a little noisy, but it doesn't seem that either one of these two scenarios is consistently better than the other.&amp;nbsp; I am running on a different GPU so it's admittedly not an apples-to-apples comparison.&lt;/P&gt;
&lt;P&gt;Are you seeing consistently different results?&amp;nbsp; If so, I'd be curious to see if it's a host API call or if the kernel execution (or perhaps the other call to clEnqueueMapBuffer) is taking longer in the slow case.&lt;/P&gt;
&lt;P&gt;Note, you can use the&amp;nbsp;&lt;A href="https://github.com/intel/opencl-intercept-layer" target="_blank" rel="noopener"&gt;OpenCL Intercept Layer&lt;/A&gt;&amp;nbsp;with HostPerformanceTiming and DevicePerformanceTiming to figure this out without needing to instrument the reproducer.&lt;/P&gt;</description>
      <pubDate>Wed, 08 Jun 2022 20:49:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1391064#M514</guid>
      <dc:creator>Ben_A_Intel</dc:creator>
      <dc:date>2022-06-08T20:49:11Z</dc:date>
    </item>
    <item>
      <title>Re:enqueueMapBuffer for read-only buffer</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1393288#M532</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;We haven't heard back from you. Could you please provide an update on your issue?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Noorjahan.&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 17 Jun 2022 04:45:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1393288#M532</guid>
      <dc:creator>NoorjahanSk_Intel</dc:creator>
      <dc:date>2022-06-17T04:45:57Z</dc:date>
    </item>
    <item>
      <title>Betreff: Re:enqueueMapBuffer for read-only buffer</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1393383#M535</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;sorry I had a lot of other stuff going on in the last time.&lt;/P&gt;
&lt;P&gt;Yes, Ben is right. My question is, why the scenario with non-zero-copy is faster that the scenario with zero-copy buffer.&lt;/P&gt;
&lt;P&gt;The issue is hard to see with the example code I provided. It becomes more clear and consistent in one of my other applications. I used the OpenCL Intercept Layer to record some numbers. You can find them in the attachments.&lt;/P&gt;
&lt;P&gt;I am interested in the time it takes from the clEnqueueNDRangeKernel to the end of clFinish because this is the time of the GPU Task without any setup. In the non-zero-copy scenario this takes about 2ms, but in the zero-copy scenario it takes over 6ms. According to the trace, it looks like the execution of the kernel itself is slower in the zero-copy example: 1.469ms vs 5.805ms.&lt;/P&gt;
&lt;P&gt;I also included a trace with the MapBuffer call. It makes sense that this is with about 19ms the slowest.&lt;/P&gt;</description>
      <pubDate>Fri, 17 Jun 2022 10:45:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1393383#M535</guid>
      <dc:creator>lolxdfly</dc:creator>
      <dc:date>2022-06-17T10:45:07Z</dc:date>
    </item>
    <item>
      <title>Betreff: Re:enqueueMapBuffer for read-only buffer</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1393517#M537</link>
      <description>&lt;P&gt;Would it be possible to run the kernel for a few more iterations to see if the difference is due to run-to-run variation or perhaps if there is only a difference on the first iteration?&amp;nbsp; One thought is that the copy in the non-zero-copy case means that the GPU is already running at a higher frequency when it executes its kernel, versus the zero-copy case where it may take some time for the GPU to ramp up to full frequency.&lt;/P&gt;
&lt;P&gt;Note: I'll be on vacation for a while but other Intel folks will monitor this thread while I am gone.&amp;nbsp; Thanks!&lt;/P&gt;</description>
      <pubDate>Fri, 17 Jun 2022 21:59:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1393517#M537</guid>
      <dc:creator>Ben_A_Intel</dc:creator>
      <dc:date>2022-06-17T21:59:06Z</dc:date>
    </item>
    <item>
      <title>Betreff: Re:enqueueMapBuffer for read-only buffer</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1393807#M540</link>
      <description>&lt;P&gt;I did more iterations. Without tracing the times are very consistent and the non-zero-copy version is always faster than the zero-copy version. If I use the cliloader to trace the application the measured times become a bit slower and they have more fluctuations, but the outcome is still the same.&lt;/P&gt;
&lt;P&gt;I also measured the times when I execute a completely different kernel with different data right before the kernel where I do my measurements. I wanted to check what this means for the executions times of the two scenarios. You can find the result traces in the attachments. The times are much more consistent now, even with tracing enabled. It had no impact on the non-zero-copy version, but the zero-copy version became faster. The execution time was reduced from about 5.805 ms to about 4.237 ms. This means that the already ramped up frequency in the non-zero-copy explains some of the time differences, but its not the complete story. There is still the gap of about 2.768 ms between the two versions.&lt;/P&gt;</description>
      <pubDate>Mon, 20 Jun 2022 09:03:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1393807#M540</guid>
      <dc:creator>lolxdfly</dc:creator>
      <dc:date>2022-06-20T09:03:50Z</dc:date>
    </item>
    <item>
      <title>Re:enqueueMapBuffer for read-only buffer</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1396410#M546</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Could you please provide us with the modified code along with the steps to reproduce the issue to investigate more from our end?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Noorjahan&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 29 Jun 2022 10:26:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1396410#M546</guid>
      <dc:creator>NoorjahanSk_Intel</dc:creator>
      <dc:date>2022-06-29T10:26:29Z</dc:date>
    </item>
    <item>
      <title>Betreff: Re:enqueueMapBuffer for read-only buffer</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1398865#M559</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;here is my complete Code:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://gist.github.com/lolxdfly/43354a794f6f49cd0bcee1605027c5e8" target="_self"&gt;https://gist.github.com/lolxdfly/43354a794f6f49cd0bcee1605027c5e8&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;The lineitem.tbl is part of the TPC-H benchmark and is unfortunately too big to be uploaded.&lt;/P&gt;
&lt;P&gt;In line 72 of gpudb.cpp you will find the the alignment of the size of the input buffer. Right now it is commented out which makes the kernel run faster. If I uncomment that line, the buffer has the requirement of a zero-copy-buffer, but the kernel runs slower.&lt;/P&gt;</description>
      <pubDate>Fri, 08 Jul 2022 08:45:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1398865#M559</guid>
      <dc:creator>lolxdfly</dc:creator>
      <dc:date>2022-07-08T08:45:15Z</dc:date>
    </item>
    <item>
      <title>Re:enqueueMapBuffer for read-only buffer</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1398895#M560</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks for providing the details.&lt;/P&gt;&lt;P&gt;We are working on your issue. We will get back to you soon.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Noorjahan&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 08 Jul 2022 10:40:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/enqueueMapBuffer-for-read-only-buffer/m-p/1398895#M560</guid>
      <dc:creator>NoorjahanSk_Intel</dc:creator>
      <dc:date>2022-07-08T10:40:05Z</dc:date>
    </item>
  </channel>
</rss>

