zero-copy didn't improve the copy performance

Fan_Z_2 · ‎02-18-2016

Have some confusions on the usage scenario of the zero-copy buffers.I use CL_MEM_USE_HOST_PTR flag to create an zero-copy 2dimage buffer.The host buffer is allocated at a 4096 byte boundary and the total size is that a multiple of 64 byte. So it should be a zero-copy buffer.In my application, i need to write data to this buffer in every loop.So i mapped the buffer for write, the mapping process took more time than I expected, even more than a direct write.The pseudo-code is down below.

posix_memalign(&host_ptr, 4096, size);

image=create_Image2d(CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, host_ptr);

for(;;)

{

ptr = map_memobj(BLOCK_MAP, CL_MAP_WRITE, image);

write new data to ptr;

unmap(image, ptr);

...

}

I use an image2d object instead of a buffer object, would this be a reason to this inefficiency.Also, I noticed that I don't even need map operation when the buffer is small, just directly operate on the host pointer, and the result is still correct, this is odd. Could you give me some clues on what may go wrong. Thank you.

Robert_I_Intel · ‎02-19-2016

Hi Fan,

Yes, using 2D images always results in copying to the device, since images need to be tiled in the device memory to enable samplers and other fixed function hardware to work with them. See this link https://software.intel.com/en-us/articles/using-image2d-from-buffer-extension for more info and a sample on how to use that extension.

Fan_Z_2 · ‎02-22-2016

Hi, Robert,

Thanks for clear that up for me.Unfortunately, I didn't find the "cl_khr_image2d_from_buffer" extension on my device, so I created a zero-copy buffer as an interchange buffer between host and device, and use the direct copy API clEnqueueCopyBufferToImage and clEnqueueCopyImageToBuffer to copy data to or from image object. These APIs don't take a blocking flag as argument, and since both of them are enqueueAPI, I take the copy operation is non-blocking.But the test showed otherwise.my code steps are down below,

1,enqueue kernel A

2, enqueue kernel B

3, enqueueCopyImagetoBuffer(ImageA, bufferA)

4, enqueueCopyImagetoBuffer(ImageB, bufferB)

5,cl_flush

6,other CPU works

7, cl_finish

8,map bufferA & bufferB for read

The step 3 cost a lot more time than other enqueueAPIs, does that mean the copy API is blocked until kernel A & B finish executing and bufferA finish copying.

Robert_I_Intel · ‎02-22-2016

Technically, enqueueCopyImagetoBuffer should be a non-blocking call, so you shouldn't wait a long time waiting for it to return.

Maybe try to put event dependency between 1. and 3. and 2. and 4 and try again: see if it has any impact on enqueue speed.

If you have a small reproducer code, that would also help. What processor, OS, driver version are you using?

Thanks!

Fan_Z_2 · ‎02-23-2016

I tried event, it didn't improve the enqueue speed.Here is the info you asked,

OS:

CentOS 7.1

kernel 3.10.0-229.1.2.47109.MSSr1.el7.centos.x86_64

CPU:

Intel(R) Core(TM) i7-4860EQ CPU @ 1.80GHz

GPU:
CL_DEVICE_NAME: Intel(R) HD Graphics 5200
CL_DEVICE_AVAILABLE: 1
CL_DEVICE_VENDOR: Intel(R) Corporation
CL_DEVICE_PROFILE: FULL_PROFILE
CL_DEVICE_VERSION: OpenCL 1.2
CL_DRIVER_VERSION: 16.4.4.47109
CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.2
CL_DEVICE_MAX_COMPUTE_UNITS: 40

here is the APIs' elapsed time:

map/memcpy/unmap zero-copy bufferA&B 0.010ms

enqueue copy bufferA&B to imageA'&B' 0.047ms

enqueue kernel(A', B', C', D') 0.008ms

enqueue copy imageC'&D' to bufferC&D 0.044ms