<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Questions about data copy when using q.memcpy() in Intel® oneAPI DPC++/C++ Compiler</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Questions-about-data-copy-when-using-q-memcpy/m-p/1682367#M4413</link>
    <description>&lt;P&gt;Thank you for the response.&lt;/P&gt;&lt;P&gt;The example is helpful. Now, I understand how to overlap the data transfer with computation. The question that raises from this is whether we can add a dependency on the operation of releasing device memory.&lt;/P&gt;&lt;P&gt;The following is the code from the provided&amp;nbsp;&lt;A href="https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2025-0/asynchronous-and-overlapping-data-transfers.html" target="_self"&gt;example&lt;/A&gt;. It needs to malloc all device memory it needs and free it&amp;nbsp; after all the kernel is done. This might run out the memory on devices with limited memory.&lt;/P&gt;&lt;LI-CODE lang="cpp"&gt;  for (int it = 0; it &amp;lt; iter; it++) {
    for (int c = 0; c &amp;lt; num_chunks; c++) {
      auto add_one = [=](auto id) {
        for (int i = 0; i &amp;lt; KERNEL_ITERS; i++)
          device_data[c][id] += 1.0;
      };
      // Copy-in not dependent on previous event
      auto copy_in =
          q.memcpy(device_data[c], host_data[c], sizeof(float) * chunk_size);
      // Compute waits for copy_in
      auto compute = q.parallel_for(chunk_size, copy_in, add_one);
      auto cg = [=](auto &amp;amp;h) {
        h.depends_on(compute);
        h.memcpy(host_data[c], device_data[c], sizeof(float) * chunk_size);
      };
      // Copy out waits for compute
      auto copy_out = q.submit(cg);
      // Q:Can user manually free device_memory in here when copy_out operation is done???    
    }

    q.wait();
  }&lt;/LI-CODE&gt;&lt;P&gt;So, manually releasing device memory is one of the solutions. Dose SYCL provides any method to wait for the previous kernel to finish and then release the device memory?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Best,&lt;/P&gt;&lt;P&gt;TCK&lt;/P&gt;</description>
    <pubDate>Fri, 11 Apr 2025 16:17:08 GMT</pubDate>
    <dc:creator>TCK</dc:creator>
    <dc:date>2025-04-11T16:17:08Z</dc:date>
    <item>
      <title>Questions about data copy when using q.memcpy()</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Questions-about-data-copy-when-using-q-memcpy/m-p/1681774#M4401</link>
      <description>&lt;P&gt;I want to do calculations on GPU, the code structure is as follows. Input data copy to GPU memory. Do calculation, then copy results from GPU to CPU.&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="cpp"&gt;for(int i=0;i&amp;lt;100;i++) {
    int* data_dev = malloc_device&amp;lt;int&amp;gt;(data_size, q);
    int* result_dev = malloc_device&amp;lt;int&amp;gt;(result_size, q);
    q.memcpy(data_dev, data_host, sizeof(int) * data_size).wait();// data copy to gpu
    q.submit([&amp;amp;](handler&amp;amp; h) {
        // kernel_1
    });
    q.wait();
    q.submit([&amp;amp;](handler&amp;amp; h) {
        // kernel_2, store results to result_dev 
    });
    q.wait();
    q.memcpy(result_host, result_dev , sizeof(int) * result_size).wait();// copy result back
    free(data_dev,q);
    free(result_dev,q);
}&lt;/LI-CODE&gt;&lt;P&gt;Ideally, I want the for loop submits all of the kernels to GPU, then waiting the calculations finish after the for loop. However,&amp;nbsp; the wait of q.memcpy() will block the for loop (This is my understanding).&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My question is:&lt;/P&gt;&lt;P&gt;Is there any way to do it without blocking the loop? Like the sycl::events that making two kernel related?&lt;/P&gt;&lt;P&gt;(p.s. The reason I don't use accessor is that input data need to use in kernel_1 and kernel_2. The use of accessor results in two data copies.)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Best,&amp;nbsp;&lt;/P&gt;&lt;P&gt;TCK&lt;/P&gt;</description>
      <pubDate>Wed, 09 Apr 2025 16:41:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Questions-about-data-copy-when-using-q-memcpy/m-p/1681774#M4401</guid>
      <dc:creator>TCK</dc:creator>
      <dc:date>2025-04-09T16:41:04Z</dc:date>
    </item>
    <item>
      <title>Re: Questions about data copy when using q.memcpy()</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Questions-about-data-copy-when-using-q-memcpy/m-p/1682160#M4407</link>
      <description>&lt;P&gt;Using SYCL events and adding dependencies accordingly can help overlap data transfers with computation on the device. You may check an example of this &lt;A href="https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2025-0/asynchronous-and-overlapping-data-transfers.html" target="_self"&gt;here&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Thu, 10 Apr 2025 23:12:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Questions-about-data-copy-when-using-q-memcpy/m-p/1682160#M4407</guid>
      <dc:creator>Sravani_K_Intel</dc:creator>
      <dc:date>2025-04-10T23:12:57Z</dc:date>
    </item>
    <item>
      <title>Re: Questions about data copy when using q.memcpy()</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Questions-about-data-copy-when-using-q-memcpy/m-p/1682327#M4410</link>
      <description>&lt;P&gt;Great information on using q.memcpy() for data transfer between CPU and GPU! As a UAE-based Website Dev Company, we're always exploring efficient computing methods to improve performance. Understanding memory operations like this helps us optimize backend systems for high-speed web applications. Thanks for sharing this valuable discussion and code example!&lt;/P&gt;</description>
      <pubDate>Fri, 11 Apr 2025 11:57:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Questions-about-data-copy-when-using-q-memcpy/m-p/1682327#M4410</guid>
      <dc:creator>AbhiwanTechnology</dc:creator>
      <dc:date>2025-04-11T11:57:10Z</dc:date>
    </item>
    <item>
      <title>Re: Questions about data copy when using q.memcpy()</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Questions-about-data-copy-when-using-q-memcpy/m-p/1682367#M4413</link>
      <description>&lt;P&gt;Thank you for the response.&lt;/P&gt;&lt;P&gt;The example is helpful. Now, I understand how to overlap the data transfer with computation. The question that raises from this is whether we can add a dependency on the operation of releasing device memory.&lt;/P&gt;&lt;P&gt;The following is the code from the provided&amp;nbsp;&lt;A href="https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2025-0/asynchronous-and-overlapping-data-transfers.html" target="_self"&gt;example&lt;/A&gt;. It needs to malloc all device memory it needs and free it&amp;nbsp; after all the kernel is done. This might run out the memory on devices with limited memory.&lt;/P&gt;&lt;LI-CODE lang="cpp"&gt;  for (int it = 0; it &amp;lt; iter; it++) {
    for (int c = 0; c &amp;lt; num_chunks; c++) {
      auto add_one = [=](auto id) {
        for (int i = 0; i &amp;lt; KERNEL_ITERS; i++)
          device_data[c][id] += 1.0;
      };
      // Copy-in not dependent on previous event
      auto copy_in =
          q.memcpy(device_data[c], host_data[c], sizeof(float) * chunk_size);
      // Compute waits for copy_in
      auto compute = q.parallel_for(chunk_size, copy_in, add_one);
      auto cg = [=](auto &amp;amp;h) {
        h.depends_on(compute);
        h.memcpy(host_data[c], device_data[c], sizeof(float) * chunk_size);
      };
      // Copy out waits for compute
      auto copy_out = q.submit(cg);
      // Q:Can user manually free device_memory in here when copy_out operation is done???    
    }

    q.wait();
  }&lt;/LI-CODE&gt;&lt;P&gt;So, manually releasing device memory is one of the solutions. Dose SYCL provides any method to wait for the previous kernel to finish and then release the device memory?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Best,&lt;/P&gt;&lt;P&gt;TCK&lt;/P&gt;</description>
      <pubDate>Fri, 11 Apr 2025 16:17:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Questions-about-data-copy-when-using-q-memcpy/m-p/1682367#M4413</guid>
      <dc:creator>TCK</dc:creator>
      <dc:date>2025-04-11T16:17:08Z</dc:date>
    </item>
  </channel>
</rss>

