Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*
760 Discussions

Questions about data copy when using q.memcpy()

TCK
Novice
220 Views

I want to do calculations on GPU, the code structure is as follows. Input data copy to GPU memory. Do calculation, then copy results from GPU to CPU. 

for(int i=0;i<100;i++) {
    int* data_dev = malloc_device<int>(data_size, q);
    int* result_dev = malloc_device<int>(result_size, q);
    q.memcpy(data_dev, data_host, sizeof(int) * data_size).wait();// data copy to gpu
    q.submit([&](handler& h) {
        // kernel_1
    });
    q.wait();
    q.submit([&](handler& h) {
        // kernel_2, store results to result_dev 
    });
    q.wait();
    q.memcpy(result_host, result_dev , sizeof(int) * result_size).wait();// copy result back
    free(data_dev,q);
    free(result_dev,q);
}

Ideally, I want the for loop submits all of the kernels to GPU, then waiting the calculations finish after the for loop. However,  the wait of q.memcpy() will block the for loop (This is my understanding). 

 

My question is:

Is there any way to do it without blocking the loop? Like the sycl::events that making two kernel related?

(p.s. The reason I don't use accessor is that input data need to use in kernel_1 and kernel_2. The use of accessor results in two data copies.)

 

Best, 

TCK

Labels (1)
0 Kudos
3 Replies
Sravani_K_Intel
Moderator
143 Views

Using SYCL events and adding dependencies accordingly can help overlap data transfers with computation on the device. You may check an example of this here.

0 Kudos
TCK
Novice
96 Views

Thank you for the response.

The example is helpful. Now, I understand how to overlap the data transfer with computation. The question that raises from this is whether we can add a dependency on the operation of releasing device memory.

The following is the code from the provided example. It needs to malloc all device memory it needs and free it  after all the kernel is done. This might run out the memory on devices with limited memory.

  for (int it = 0; it < iter; it++) {
    for (int c = 0; c < num_chunks; c++) {
      auto add_one = [=](auto id) {
        for (int i = 0; i < KERNEL_ITERS; i++)
          device_data[c][id] += 1.0;
      };
      // Copy-in not dependent on previous event
      auto copy_in =
          q.memcpy(device_data[c], host_data[c], sizeof(float) * chunk_size);
      // Compute waits for copy_in
      auto compute = q.parallel_for(chunk_size, copy_in, add_one);
      auto cg = [=](auto &h) {
        h.depends_on(compute);
        h.memcpy(host_data[c], device_data[c], sizeof(float) * chunk_size);
      };
      // Copy out waits for compute
      auto copy_out = q.submit(cg);
      // Q:Can user manually free device_memory in here when copy_out operation is done???    
    }

    q.wait();
  }

So, manually releasing device memory is one of the solutions. Dose SYCL provides any method to wait for the previous kernel to finish and then release the device memory?

 

Best,

TCK

0 Kudos
AbhiwanTechnology
116 Views

Great information on using q.memcpy() for data transfer between CPU and GPU! As a UAE-based Website Dev Company, we're always exploring efficient computing methods to improve performance. Understanding memory operations like this helps us optimize backend systems for high-speed web applications. Thanks for sharing this valuable discussion and code example!

Reply