<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Could kernel executing and buffer writing be parallel? in GPU Compute Software</title>
    <link>https://community.intel.com/t5/GPU-Compute-Software/Could-kernel-executing-and-buffer-writing-be-parallel/m-p/1658676#M1709</link>
    <description>&lt;P&gt;hello, dear Intel team,&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; I have an application like this: there are 30000 arrays in host memory. I want to transfer one by one to GPU global memory, and do some simple calculation for each array.&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; I'm using&amp;nbsp;clEnqueueWriteBuffer to do data transfer and&amp;nbsp;clEnqueueNDRangeKernel to do calculation.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; In this case, I want to hide data transfer latency. The calculation takes 0.008s， data transfer for each array takes 0.006s. If they can be parallel, total performance could be the same as only calculation existing.&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp;My code is like this:&lt;/P&gt;&lt;P&gt;1. Use out of order option:&amp;nbsp;&lt;/P&gt;&lt;P&gt;queue = clCreateCommandQueue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &amp;amp;err);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;2. Transfer the first array, and let it generates an event:&lt;/P&gt;&lt;P&gt;err = clEnqueueWriteBuffer(queue, srcA, CL_TRUE, 0, M*N*sizeof(*A), A, 0, NULL, &amp;amp;transfer_event);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;3. Use a loop to get parallel executing: Executing the first calculation, and transfer next array at the same time. Wait for computing ends before next loop, since computing takes more time than data transfering.&lt;/P&gt;&lt;P&gt;for(i=0;i&amp;lt;30000;i++)&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; err = clEnqueueNDRangeKernel(queue, kernel, 2, NULL, global, local, 1, &amp;amp;transfer_event, &amp;amp;compute_event);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp;&amp;nbsp;err = clEnqueueWriteBuffer(queue, srcA, CL_TRUE, 0, M*N*sizeof(*A), A, 0, NULL, &amp;amp;transfer_event);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp;&amp;nbsp;err = clWaitForEvents(1, &amp;amp;compute_event);&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;With this code, I try to hide buffer writing latency behind kernel executing. If they can be parallel, there will be no performance drop at all.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But when I test it, result shows that total latency for each array is 0.015s (0.008 for kernel executing and 0.006 for buffer writing and 0.001 for unknown stuff).&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My question is, why doesn't buffer writing hide behind kernel executing? And how to achieve it?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;System info:&lt;/P&gt;&lt;P&gt;OS: ubuntu 22.04&lt;/P&gt;&lt;P&gt;GPU: Intel UHD graphics 730&lt;/P&gt;&lt;P&gt;Bandwidth for buffer transfer from host memory to GPU memory: 8GBPS (reported by Clinfo)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;By the way, I don't want to use zero copy method, since most GPUs have to transfer data between host memory to GPU memory through PCIE 3.0/4.0 with limited bandwidth, but they could achieve good performance if data transfering can be hide.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks a lot!&lt;/P&gt;</description>
    <pubDate>Tue, 21 Jan 2025 02:30:56 GMT</pubDate>
    <dc:creator>Scout</dc:creator>
    <dc:date>2025-01-21T02:30:56Z</dc:date>
    <item>
      <title>Could kernel executing and buffer writing be parallel?</title>
      <link>https://community.intel.com/t5/GPU-Compute-Software/Could-kernel-executing-and-buffer-writing-be-parallel/m-p/1658676#M1709</link>
      <description>&lt;P&gt;hello, dear Intel team,&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; I have an application like this: there are 30000 arrays in host memory. I want to transfer one by one to GPU global memory, and do some simple calculation for each array.&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; I'm using&amp;nbsp;clEnqueueWriteBuffer to do data transfer and&amp;nbsp;clEnqueueNDRangeKernel to do calculation.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; In this case, I want to hide data transfer latency. The calculation takes 0.008s， data transfer for each array takes 0.006s. If they can be parallel, total performance could be the same as only calculation existing.&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp;My code is like this:&lt;/P&gt;&lt;P&gt;1. Use out of order option:&amp;nbsp;&lt;/P&gt;&lt;P&gt;queue = clCreateCommandQueue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &amp;amp;err);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;2. Transfer the first array, and let it generates an event:&lt;/P&gt;&lt;P&gt;err = clEnqueueWriteBuffer(queue, srcA, CL_TRUE, 0, M*N*sizeof(*A), A, 0, NULL, &amp;amp;transfer_event);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;3. Use a loop to get parallel executing: Executing the first calculation, and transfer next array at the same time. Wait for computing ends before next loop, since computing takes more time than data transfering.&lt;/P&gt;&lt;P&gt;for(i=0;i&amp;lt;30000;i++)&lt;/P&gt;&lt;P&gt;{&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; err = clEnqueueNDRangeKernel(queue, kernel, 2, NULL, global, local, 1, &amp;amp;transfer_event, &amp;amp;compute_event);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp;&amp;nbsp;err = clEnqueueWriteBuffer(queue, srcA, CL_TRUE, 0, M*N*sizeof(*A), A, 0, NULL, &amp;amp;transfer_event);&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp;&amp;nbsp;err = clWaitForEvents(1, &amp;amp;compute_event);&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;With this code, I try to hide buffer writing latency behind kernel executing. If they can be parallel, there will be no performance drop at all.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But when I test it, result shows that total latency for each array is 0.015s (0.008 for kernel executing and 0.006 for buffer writing and 0.001 for unknown stuff).&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My question is, why doesn't buffer writing hide behind kernel executing? And how to achieve it?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;System info:&lt;/P&gt;&lt;P&gt;OS: ubuntu 22.04&lt;/P&gt;&lt;P&gt;GPU: Intel UHD graphics 730&lt;/P&gt;&lt;P&gt;Bandwidth for buffer transfer from host memory to GPU memory: 8GBPS (reported by Clinfo)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;By the way, I don't want to use zero copy method, since most GPUs have to transfer data between host memory to GPU memory through PCIE 3.0/4.0 with limited bandwidth, but they could achieve good performance if data transfering can be hide.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks a lot!&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jan 2025 02:30:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/GPU-Compute-Software/Could-kernel-executing-and-buffer-writing-be-parallel/m-p/1658676#M1709</guid>
      <dc:creator>Scout</dc:creator>
      <dc:date>2025-01-21T02:30:56Z</dc:date>
    </item>
  </channel>
</rss>

