Overlap copy and compute on two queues

ckelly1312 · ‎05-30-2023

Hello all,

I am trying to port my CUDA/HIP code to support oneapi with the minimal necessary changes. However I'm encountering issues with regards to how oneapi treats queues versus streams on CUDA/HIP.

Specifically, I have 3 serially-dependent kernels that execute back-to-back under an outer loop, which operate on memory regions allocated using malloc_device. Before executing the first kernel I want to launch a host->device copy to prepare data for the next loop iteration, which needs to happen in parallel to the computation. A synchronization barrier on the copy is performed at the end of the loop iteration to ensure the data is ready for the next. This is trivial to implement on CUDA/HIP using two different streams. However there appears to be no good way to replicate this in oneapi for the following reasons:

If I use a single FIFO queue to to launch the kernels and memory copies, and thus ensure the serial dependency of the kernels, I won't get the overlapping of the copy and the compute.
If I use a single, default (non-FIFO) queue, I need to have a barrier between the kernel launches to ensure they are completed in order. However this barrier will also block until the copy completes, preventing the continued copying-behind-the-scenes during the execution of the second and third kernels.
If I create multiple queues (FIFO or non-FIFO) I can more closely replicate the CUDA/HIP stream setup, with the copy issued on one (the copy queue) and the kernels on the other (the compute queue), with barriers on the compute queue (or using a FIFO queue) to ensure kernel ordering. However, my testing, performed on ALCF Sunspot with onetrace, suggests that copies and kernels launched on different queues (even out-of-order ones) do not overlap, even if the queues are created with the same context.

Is what I'm asking achievable using oneapi?

SeshaP_Intel · ‎05-31-2023

Hi,

Thank you for posting in Intel Communities.

It would be greatly helpful if you provide the complete CUDA/HIP code to us so that we can investigate the issue more from our end.

Thanks and Regards,

Pendyala Sesha Srinivas

SeshaP_Intel · ‎06-06-2023

Hi,

We haven't heard back from you. Could you please provide an update on your issue?

Thanks and Regards,

Pendyala Sesha Srinivas

ckelly1312 · ‎06-06-2023

Hi Pendlaya,

Apologies for the delayed response. I am wondering if my problem is more fundamental, as I have not been able to overlap copies and kernels even with a single, non-FIFO queue.

The code is as follows:

#include<iostream>
#include<chrono>
#include <sycl/CL/sycl.hpp>
#include <sycl/usm.hpp>
#include <level_zero/ze_api.h>
#include <sycl/ext/oneapi/backend/level_zero.hpp>

cl::sycl::queue *accelerator;

#define accelerator_for2dNB( iter1, num1, iter2, num2, ... ) \
  accelerator->submit([&](cl::sycl::handler &cgh) {			\
  unsigned long nt=8;							\
  unsigned long unum1 = num1;						\
  unsigned long unum2 = num2;						\
									\
  unsigned long unum1_use = ((unum1 + nt - 1)/nt) * nt     ; /*round up s.t. divisible by nt*/ \
  cl::sycl::range<3> local {nt,1,1};				\
  cl::sycl::range<3> global{unum1_use,unum2,1};			\
  cgh.parallel_for(							\
		   cl::sycl::nd_range<3>(global,local),			\
		   [=] (cl::sycl::nd_item<3> item) /*mutable*/		\
		   [[intel::reqd_sub_group_size(16)]]			\
		   {							\
		     auto iter1    = item.get_global_id(0);		\
		     auto iter2    = item.get_global_id(1);		\
		     auto lane     = item.get_global_id(2);		\
		     if(iter1<unum1){ __VA_ARGS__ };			\
		   });							\
    });									


  
#define accelerator_barrier(){ accelerator->wait(); }
inline void acceleratorCopySynchronise(void) {  accelerator->wait(); }
inline void acceleratorBarrierAll(){ accelerator_barrier();  }


inline void *acceleratorAllocDevice(size_t bytes){ return malloc_device(bytes,*accelerator);};
inline void acceleratorFreeDevice(void *ptr){free(ptr,*accelerator);};
inline void acceleratorCopyDeviceToDeviceAsynch(void *from,void *to,size_t bytes)  {  accelerator->memcpy(to,from,bytes);}
inline void acceleratorCopyToDevice(void *from,void *to,size_t bytes)  { accelerator->memcpy(to,from,bytes); accelerator->wait();}
inline void acceleratorCopyFromDevice(void *from,void *to,size_t bytes){ accelerator->memcpy(to,from,bytes); accelerator->wait();}
inline void acceleratorCopyToDeviceAsync(void *from,void *to,size_t bytes)  { accelerator->memcpy(to,from,bytes); }



int main(int argc, char **argv){
  std::cout << "START" << std::endl;
  int nDevices = 1;
  cl::sycl::gpu_selector selector;
  cl::sycl::device selectedDevice { selector };
  accelerator = new sycl::queue (selectedDevice);

  size_t n = 10000000;
  size_t b = n*sizeof(double);
  double* dd = (double*)acceleratorAllocDevice(b);
  double* dh = (double*)malloc(b);
  double* rd1 = (double*)acceleratorAllocDevice(10*b);
  double* rh1 = (double*)malloc(10*b);
  
  acceleratorCopyToDevice(dh,dd,b);
  acceleratorBarrierAll();

  for(int i=0;i<10;i++){
    acceleratorCopyToDeviceAsync(rh1,rd1,10*b);
    accelerator_for2dNB( i, n, dummy,1, {
	for(int j=0;j<1000;j++)
	  dd[i] = dd[i]*dd[i] + j;
      });
    accelerator_for2dNB( i, n, dummy,1, {
	for(int j=0;j<1000;j++)
	  dd[i] = dd[i]*dd[i] + j;
      });

    acceleratorBarrierAll();
  }
  //acceleratorFreeDevice(dd);
  //free(dh);
  std::cout << "END" << std::endl;
  return 0;
}

As you see, it launches a non-blocking memcpy to device as well as 2 kernels. In the output I observe the following pattern:

where we see that the memcpy and kernels simply refuse to overlap (the two kernels do, however). I am using the following environment variables:

export ONEAPI_DEVICE_SELECTOR=gpu,level_zero
export EnableImplicitScaling=0
export ForceThreadGroupDispatchSize=0
export SYCL_PI_LEVEL_ZERO_DEVICE_SCOPE_EVENTS=0
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_PI_LEVEL_ZERO_USE_COPY_ENGINE=0

and running on ALCF Sunspot with the default oneapi installation.

I would appreciate any help you can provide.

SeshaP_Intel · ‎06-12-2023

Hi,

Thanks for sharing the DPC++ code with us.

It would be greatly helpful if you share the CUDA code with us so that we can reproduce the issue on our end.

And also could you please let us know how you are observing the timeline?

Thanks and Regards,

Pendyala Sesha Srinivas

SeshaP_Intel · ‎06-16-2023

Hi,

We haven't heard back from you. Could you please provide an update on the issue?

Thanks and Regards,

Pendyala Sesha Srinivas

SeshaP_Intel · ‎06-22-2023

Hi,

I have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.

Thanks and Regards,

Pendyala Sesha Srinivas