topic Re: Questions about data copy when using q.memcpy() in Intel® oneAPI DPC++/C++ Compiler

Questions about data copy when using q.memcpy()

TCK — Wed, 09 Apr 2025 16:41:04 GMT

I want to do calculations on GPU, the code structure is as follows. Input data copy to GPU memory. Do calculation, then copy results from GPU to CPU.

for(int i=0;i<100;i++) { int* data_dev = malloc_device<int>(data_size, q); int* result_dev = malloc_device<int>(result_size, q); q.memcpy(data_dev, data_host, sizeof(int) * data_size).wait();// data copy to gpu q.submit([&](handler& h) { // kernel_1 }); q.wait(); q.submit([&](handler& h) { // kernel_2, store results to result_dev }); q.wait(); q.memcpy(result_host, result_dev , sizeof(int) * result_size).wait();// copy result back free(data_dev,q); free(result_dev,q); }

Ideally, I want the for loop submits all of the kernels to GPU, then waiting the calculations finish after the for loop. However, the wait of q.memcpy() will block the for loop (This is my understanding).

My question is:

Is there any way to do it without blocking the loop? Like the sycl::events that making two kernel related?

(p.s. The reason I don't use accessor is that input data need to use in kernel_1 and kernel_2. The use of accessor results in two data copies.)

Best,

TCK

Re: Questions about data copy when using q.memcpy()

Sravani_K_Intel — Thu, 10 Apr 2025 23:12:57 GMT

Using SYCL events and adding dependencies accordingly can help overlap data transfers with computation on the device. You may check an example of this here.

Re: Questions about data copy when using q.memcpy()

AbhiwanTechnology — Fri, 11 Apr 2025 11:57:10 GMT

Great information on using q.memcpy() for data transfer between CPU and GPU! As a UAE-based Website Dev Company, we're always exploring efficient computing methods to improve performance. Understanding memory operations like this helps us optimize backend systems for high-speed web applications. Thanks for sharing this valuable discussion and code example!

Re: Questions about data copy when using q.memcpy()

TCK — Fri, 11 Apr 2025 16:17:08 GMT

Thank you for the response.

The example is helpful. Now, I understand how to overlap the data transfer with computation. The question that raises from this is whether we can add a dependency on the operation of releasing device memory.

The following is the code from the provided example. It needs to malloc all device memory it needs and free it after all the kernel is done. This might run out the memory on devices with limited memory.

for (int it = 0; it < iter; it++) { for (int c = 0; c < num_chunks; c++) { auto add_one = [=](auto id) { for (int i = 0; i < KERNEL_ITERS; i++) device_data[c][id] += 1.0; }; // Copy-in not dependent on previous event auto copy_in = q.memcpy(device_data[c], host_data[c], sizeof(float) * chunk_size); // Compute waits for copy_in auto compute = q.parallel_for(chunk_size, copy_in, add_one); auto cg = [=](auto &h) { h.depends_on(compute); h.memcpy(host_data[c], device_data[c], sizeof(float) * chunk_size); }; // Copy out waits for compute auto copy_out = q.submit(cg); // Q:Can user manually free device_memory in here when copy_out operation is done??? } q.wait(); }

So, manually releasing device memory is one of the solutions. Dose SYCL provides any method to wait for the previous kernel to finish and then release the device memory?

Best,

TCK