Could kernel executing and buffer writing be parallel?

Scout · ‎01-20-2025

hello, dear Intel team,

I have an application like this: there are 30000 arrays in host memory. I want to transfer one by one to GPU global memory, and do some simple calculation for each array.

I'm using clEnqueueWriteBuffer to do data transfer and clEnqueueNDRangeKernel to do calculation.

In this case, I want to hide data transfer latency. The calculation takes 0.008s， data transfer for each array takes 0.006s. If they can be parallel, total performance could be the same as only calculation existing.

My code is like this:

1. Use out of order option:

queue = clCreateCommandQueue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);

2. Transfer the first array, and let it generates an event:

err = clEnqueueWriteBuffer(queue, srcA, CL_TRUE, 0, M*N*sizeof(*A), A, 0, NULL, &transfer_event);

3. Use a loop to get parallel executing: Executing the first calculation, and transfer next array at the same time. Wait for computing ends before next loop, since computing takes more time than data transfering.

for(i=0;i<30000;i++)

{

err = clEnqueueNDRangeKernel(queue, kernel, 2, NULL, global, local, 1, &transfer_event, &compute_event);

err = clEnqueueWriteBuffer(queue, srcA, CL_TRUE, 0, M*N*sizeof(*A), A, 0, NULL, &transfer_event);

err = clWaitForEvents(1, &compute_event);

}

With this code, I try to hide buffer writing latency behind kernel executing. If they can be parallel, there will be no performance drop at all.

But when I test it, result shows that total latency for each array is 0.015s (0.008 for kernel executing and 0.006 for buffer writing and 0.001 for unknown stuff).

My question is, why doesn't buffer writing hide behind kernel executing? And how to achieve it?

System info:

OS: ubuntu 22.04

GPU: Intel UHD graphics 730

Bandwidth for buffer transfer from host memory to GPU memory: 8GBPS (reported by Clinfo)

By the way, I don't want to use zero copy method, since most GPUs have to transfer data between host memory to GPU memory through PCIE 3.0/4.0 with limited bandwidth, but they could achieve good performance if data transfering can be hide.

Thanks a lot!