Questions about data copy when using q.memcpy()

TCK · ‎04-09-2025

I want to do calculations on GPU, the code structure is as follows. Input data copy to GPU memory. Do calculation, then copy results from GPU to CPU.

for(int i=0;i<100;i++) {
    int* data_dev = malloc_device<int>(data_size, q);
    int* result_dev = malloc_device<int>(result_size, q);
    q.memcpy(data_dev, data_host, sizeof(int) * data_size).wait();// data copy to gpu
    q.submit([&](handler& h) {
        // kernel_1
    });
    q.wait();
    q.submit([&](handler& h) {
        // kernel_2, store results to result_dev 
    });
    q.wait();
    q.memcpy(result_host, result_dev , sizeof(int) * result_size).wait();// copy result back
    free(data_dev,q);
    free(result_dev,q);
}

Ideally, I want the for loop submits all of the kernels to GPU, then waiting the calculations finish after the for loop. However, the wait of q.memcpy() will block the for loop (This is my understanding).

My question is:

Is there any way to do it without blocking the loop? Like the sycl::events that making two kernel related?

(p.s. The reason I don't use accessor is that input data need to use in kernel_1 and kernel_2. The use of accessor results in two data copies.)

Best,

TCK

Sravani_K_Intel · ‎04-10-2025

Using SYCL events and adding dependencies accordingly can help overlap data transfers with computation on the device. You may check an example of this here.

TCK · ‎04-11-2025

Thank you for the response.

The example is helpful. Now, I understand how to overlap the data transfer with computation. The question that raises from this is whether we can add a dependency on the operation of releasing device memory.

The following is the code from the provided example. It needs to malloc all device memory it needs and free it after all the kernel is done. This might run out the memory on devices with limited memory.

  for (int it = 0; it < iter; it++) {
    for (int c = 0; c < num_chunks; c++) {
      auto add_one = [=](auto id) {
        for (int i = 0; i < KERNEL_ITERS; i++)
          device_data[c][id] += 1.0;
      };
      // Copy-in not dependent on previous event
      auto copy_in =
          q.memcpy(device_data[c], host_data[c], sizeof(float) * chunk_size);
      // Compute waits for copy_in
      auto compute = q.parallel_for(chunk_size, copy_in, add_one);
      auto cg = [=](auto &h) {
        h.depends_on(compute);
        h.memcpy(host_data[c], device_data[c], sizeof(float) * chunk_size);
      };
      // Copy out waits for compute
      auto copy_out = q.submit(cg);
      // Q:Can user manually free device_memory in here when copy_out operation is done???    
    }

    q.wait();
  }

So, manually releasing device memory is one of the solutions. Dose SYCL provides any method to wait for the previous kernel to finish and then release the device memory?

Best,

TCK

AbhiwanTechnology · ‎04-11-2025

Great information on using q.memcpy() for data transfer between CPU and GPU! As a UAE-based Website Dev Company, we're always exploring efficient computing methods to improve performance. Understanding memory operations like this helps us optimize backend systems for high-speed web applications. Thanks for sharing this valuable discussion and code example!

Questions about data copy when using q.memcpy()

Intel® oneAPI Data Parallel C++