<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re:Overlap copy and compute on two queues in Intel® oneAPI DPC++/C++ Compiler</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Overlap-copy-and-compute-on-two-queues/m-p/1494980#M3134</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks for sharing the DPC++ code with us.&lt;/P&gt;&lt;P&gt;It would be greatly helpful if you share the CUDA code with us so that we can reproduce the issue on our end.&lt;/P&gt;&lt;P&gt;And also could you please let us know how you are observing the timeline?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks and Regards,&lt;/P&gt;&lt;P&gt;Pendyala Sesha Srinivas&lt;/P&gt;&lt;BR /&gt;</description>
    <pubDate>Mon, 12 Jun 2023 17:44:35 GMT</pubDate>
    <dc:creator>SeshaP_Intel</dc:creator>
    <dc:date>2023-06-12T17:44:35Z</dc:date>
    <item>
      <title>Overlap copy and compute on two queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Overlap-copy-and-compute-on-two-queues/m-p/1491260#M3113</link>
      <description>&lt;P&gt;Hello all,&lt;/P&gt;&lt;P&gt;I am trying to port my CUDA/HIP code to support oneapi with the minimal necessary changes. However I'm encountering issues with regards to how oneapi treats queues versus streams on CUDA/HIP.&lt;/P&gt;&lt;P&gt;Specifically, I have 3 serially-dependent kernels that execute back-to-back under an outer loop, which operate on memory regions allocated using malloc_device. Before executing the first kernel I want to launch a host-&amp;gt;device copy to prepare data for the next loop iteration, which needs to happen in parallel to the computation. A synchronization barrier on the copy is performed at the end of the loop iteration to ensure the data is ready for the next. This is trivial to implement on CUDA/HIP using two different streams. However there appears to be no good way to replicate this in oneapi for the following reasons:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;If I use a single FIFO queue to to launch the kernels and memory copies, and thus ensure the serial dependency of the kernels, I won't get the overlapping of the copy and the compute.&lt;/LI&gt;&lt;LI&gt;If I use a single, default (non-FIFO) queue, I need to have a barrier between the kernel launches to ensure they are completed in order. However this barrier will also block until the copy completes, preventing the continued copying-behind-the-scenes during the execution of the second and third kernels.&lt;/LI&gt;&lt;LI&gt;If I create multiple queues (FIFO or non-FIFO) I can more closely replicate the CUDA/HIP stream setup, with the copy issued on one (the copy queue) and the kernels on the other (the compute queue), with barriers on the compute queue (or using a FIFO queue) to ensure kernel ordering. However, my testing, performed on ALCF Sunspot with onetrace, suggests that copies and kernels launched on different queues&amp;nbsp;(even out-of-order ones) do not overlap, even if the queues&amp;nbsp; are created with the same context.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Is what I'm asking achievable using oneapi?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 30 May 2023 18:10:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Overlap-copy-and-compute-on-two-queues/m-p/1491260#M3113</guid>
      <dc:creator>ckelly1312</dc:creator>
      <dc:date>2023-05-30T18:10:22Z</dc:date>
    </item>
    <item>
      <title>Re: Overlap copy and compute on two queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Overlap-copy-and-compute-on-two-queues/m-p/1491539#M3116</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thank you for posting in Intel Communities.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;It would be greatly helpful if you provide the complete CUDA/HIP code to us so that we can investigate the issue more from our end.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks and Regards,&lt;/P&gt;
&lt;P&gt;Pendyala Sesha Srinivas&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Jun 2023 06:17:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Overlap-copy-and-compute-on-two-queues/m-p/1491539#M3116</guid>
      <dc:creator>SeshaP_Intel</dc:creator>
      <dc:date>2023-06-01T06:17:42Z</dc:date>
    </item>
    <item>
      <title>Re:Overlap copy and compute on two queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Overlap-copy-and-compute-on-two-queues/m-p/1493166#M3123</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;We haven't heard back from you. Could you please provide an update on your issue?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks and Regards,&lt;/P&gt;&lt;P&gt;Pendyala Sesha Srinivas&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Tue, 06 Jun 2023 07:04:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Overlap-copy-and-compute-on-two-queues/m-p/1493166#M3123</guid>
      <dc:creator>SeshaP_Intel</dc:creator>
      <dc:date>2023-06-06T07:04:35Z</dc:date>
    </item>
    <item>
      <title>Re: Overlap copy and compute on two queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Overlap-copy-and-compute-on-two-queues/m-p/1493350#M3126</link>
      <description>&lt;P&gt;Hi Pendlaya,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Apologies for the delayed response. I am wondering if my problem is more fundamental, as I have not been able to overlap copies and kernels even with a single, non-FIFO queue.&lt;/P&gt;&lt;P&gt;The code is as follows:&lt;/P&gt;&lt;LI-CODE lang="cpp"&gt;#include&amp;lt;iostream&amp;gt;
#include&amp;lt;chrono&amp;gt;
#include &amp;lt;sycl/CL/sycl.hpp&amp;gt;
#include &amp;lt;sycl/usm.hpp&amp;gt;
#include &amp;lt;level_zero/ze_api.h&amp;gt;
#include &amp;lt;sycl/ext/oneapi/backend/level_zero.hpp&amp;gt;

cl::sycl::queue *accelerator;

#define accelerator_for2dNB( iter1, num1, iter2, num2, ... ) \
  accelerator-&amp;gt;submit([&amp;amp;](cl::sycl::handler &amp;amp;cgh) {			\
  unsigned long nt=8;							\
  unsigned long unum1 = num1;						\
  unsigned long unum2 = num2;						\
									\
  unsigned long unum1_use = ((unum1 + nt - 1)/nt) * nt     ; /*round up s.t. divisible by nt*/ \
  cl::sycl::range&amp;lt;3&amp;gt; local {nt,1,1};				\
  cl::sycl::range&amp;lt;3&amp;gt; global{unum1_use,unum2,1};			\
  cgh.parallel_for(							\
		   cl::sycl::nd_range&amp;lt;3&amp;gt;(global,local),			\
		   [=] (cl::sycl::nd_item&amp;lt;3&amp;gt; item) /*mutable*/		\
		   [[intel::reqd_sub_group_size(16)]]			\
		   {							\
		     auto iter1    = item.get_global_id(0);		\
		     auto iter2    = item.get_global_id(1);		\
		     auto lane     = item.get_global_id(2);		\
		     if(iter1&amp;lt;unum1){ __VA_ARGS__ };			\
		   });							\
    });									


  
#define accelerator_barrier(){ accelerator-&amp;gt;wait(); }
inline void acceleratorCopySynchronise(void) {  accelerator-&amp;gt;wait(); }
inline void acceleratorBarrierAll(){ accelerator_barrier();  }


inline void *acceleratorAllocDevice(size_t bytes){ return malloc_device(bytes,*accelerator);};
inline void acceleratorFreeDevice(void *ptr){free(ptr,*accelerator);};
inline void acceleratorCopyDeviceToDeviceAsynch(void *from,void *to,size_t bytes)  {  accelerator-&amp;gt;memcpy(to,from,bytes);}
inline void acceleratorCopyToDevice(void *from,void *to,size_t bytes)  { accelerator-&amp;gt;memcpy(to,from,bytes); accelerator-&amp;gt;wait();}
inline void acceleratorCopyFromDevice(void *from,void *to,size_t bytes){ accelerator-&amp;gt;memcpy(to,from,bytes); accelerator-&amp;gt;wait();}
inline void acceleratorCopyToDeviceAsync(void *from,void *to,size_t bytes)  { accelerator-&amp;gt;memcpy(to,from,bytes); }



int main(int argc, char **argv){
  std::cout &amp;lt;&amp;lt; "START" &amp;lt;&amp;lt; std::endl;
  int nDevices = 1;
  cl::sycl::gpu_selector selector;
  cl::sycl::device selectedDevice { selector };
  accelerator = new sycl::queue (selectedDevice);

  size_t n = 10000000;
  size_t b = n*sizeof(double);
  double* dd = (double*)acceleratorAllocDevice(b);
  double* dh = (double*)malloc(b);
  double* rd1 = (double*)acceleratorAllocDevice(10*b);
  double* rh1 = (double*)malloc(10*b);
  
  acceleratorCopyToDevice(dh,dd,b);
  acceleratorBarrierAll();

  for(int i=0;i&amp;lt;10;i++){
    acceleratorCopyToDeviceAsync(rh1,rd1,10*b);
    accelerator_for2dNB( i, n, dummy,1, {
	for(int j=0;j&amp;lt;1000;j++)
	  dd[i] = dd[i]*dd[i] + j;
      });
    accelerator_for2dNB( i, n, dummy,1, {
	for(int j=0;j&amp;lt;1000;j++)
	  dd[i] = dd[i]*dd[i] + j;
      });

    acceleratorBarrierAll();
  }
  //acceleratorFreeDevice(dd);
  //free(dh);
  std::cout &amp;lt;&amp;lt; "END" &amp;lt;&amp;lt; std::endl;
  return 0;
}&lt;/LI-CODE&gt;&lt;P&gt;As you see, it launches a non-blocking memcpy to device as well as 2 kernels. In the output I observe the following pattern:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image.png" style="width: 999px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/42266iC0DEF9C4CC30E6B3/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="image.png" alt="image.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;where we see that the memcpy and kernels simply refuse to overlap (the two kernels do, however). I am using the following environment variables:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;export ONEAPI_DEVICE_SELECTOR=gpu,level_zero
export EnableImplicitScaling=0
export ForceThreadGroupDispatchSize=0
export SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=0
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE=0&lt;/LI-CODE&gt;&lt;P&gt;and running on ALCF Sunspot with the default oneapi installation.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I would appreciate any help you can provide.&lt;/P&gt;</description>
      <pubDate>Tue, 06 Jun 2023 18:42:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Overlap-copy-and-compute-on-two-queues/m-p/1493350#M3126</guid>
      <dc:creator>ckelly1312</dc:creator>
      <dc:date>2023-06-06T18:42:05Z</dc:date>
    </item>
    <item>
      <title>Re:Overlap copy and compute on two queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Overlap-copy-and-compute-on-two-queues/m-p/1494980#M3134</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks for sharing the DPC++ code with us.&lt;/P&gt;&lt;P&gt;It would be greatly helpful if you share the CUDA code with us so that we can reproduce the issue on our end.&lt;/P&gt;&lt;P&gt;And also could you please let us know how you are observing the timeline?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks and Regards,&lt;/P&gt;&lt;P&gt;Pendyala Sesha Srinivas&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 12 Jun 2023 17:44:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Overlap-copy-and-compute-on-two-queues/m-p/1494980#M3134</guid>
      <dc:creator>SeshaP_Intel</dc:creator>
      <dc:date>2023-06-12T17:44:35Z</dc:date>
    </item>
    <item>
      <title>Re:Overlap copy and compute on two queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Overlap-copy-and-compute-on-two-queues/m-p/1496376#M3138</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;We haven't heard back from you. Could you please provide an update on the issue?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks and Regards,&lt;/P&gt;&lt;P&gt;Pendyala Sesha Srinivas&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 16 Jun 2023 07:09:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Overlap-copy-and-compute-on-two-queues/m-p/1496376#M3138</guid>
      <dc:creator>SeshaP_Intel</dc:creator>
      <dc:date>2023-06-16T07:09:23Z</dc:date>
    </item>
    <item>
      <title>Re:Overlap copy and compute on two queues</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Overlap-copy-and-compute-on-two-queues/m-p/1498085#M3147</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;I have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks and Regards,&lt;/P&gt;&lt;P&gt;Pendyala Sesha Srinivas&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Thu, 22 Jun 2023 07:41:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Overlap-copy-and-compute-on-two-queues/m-p/1498085#M3147</guid>
      <dc:creator>SeshaP_Intel</dc:creator>
      <dc:date>2023-06-22T07:41:50Z</dc:date>
    </item>
  </channel>
</rss>

