- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I want to do calculations on GPU, the code structure is as follows. Input data copy to GPU memory. Do calculation, then copy results from GPU to CPU.
for(int i=0;i<100;i++) {
int* data_dev = malloc_device<int>(data_size, q);
int* result_dev = malloc_device<int>(result_size, q);
q.memcpy(data_dev, data_host, sizeof(int) * data_size).wait();// data copy to gpu
q.submit([&](handler& h) {
// kernel_1
});
q.wait();
q.submit([&](handler& h) {
// kernel_2, store results to result_dev
});
q.wait();
q.memcpy(result_host, result_dev , sizeof(int) * result_size).wait();// copy result back
free(data_dev,q);
free(result_dev,q);
}
Ideally, I want the for loop submits all of the kernels to GPU, then waiting the calculations finish after the for loop. However, the wait of q.memcpy() will block the for loop (This is my understanding).
My question is:
Is there any way to do it without blocking the loop? Like the sycl::events that making two kernel related?
(p.s. The reason I don't use accessor is that input data need to use in kernel_1 and kernel_2. The use of accessor results in two data copies.)
Best,
TCK
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Using SYCL events and adding dependencies accordingly can help overlap data transfers with computation on the device. You may check an example of this here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the response.
The example is helpful. Now, I understand how to overlap the data transfer with computation. The question that raises from this is whether we can add a dependency on the operation of releasing device memory.
The following is the code from the provided example. It needs to malloc all device memory it needs and free it after all the kernel is done. This might run out the memory on devices with limited memory.
for (int it = 0; it < iter; it++) {
for (int c = 0; c < num_chunks; c++) {
auto add_one = [=](auto id) {
for (int i = 0; i < KERNEL_ITERS; i++)
device_data[c][id] += 1.0;
};
// Copy-in not dependent on previous event
auto copy_in =
q.memcpy(device_data[c], host_data[c], sizeof(float) * chunk_size);
// Compute waits for copy_in
auto compute = q.parallel_for(chunk_size, copy_in, add_one);
auto cg = [=](auto &h) {
h.depends_on(compute);
h.memcpy(host_data[c], device_data[c], sizeof(float) * chunk_size);
};
// Copy out waits for compute
auto copy_out = q.submit(cg);
// Q:Can user manually free device_memory in here when copy_out operation is done???
}
q.wait();
}
So, manually releasing device memory is one of the solutions. Dose SYCL provides any method to wait for the previous kernel to finish and then release the device memory?
Best,
TCK
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Great information on using q.memcpy() for data transfer between CPU and GPU! As a UAE-based Website Dev Company, we're always exploring efficient computing methods to improve performance. Understanding memory operations like this helps us optimize backend systems for high-speed web applications. Thanks for sharing this valuable discussion and code example!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page