OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1663 Discussions

zero-copy didn't improve the copy performance

Fan_Z_2
Beginner
255 Views

Have some confusions on the usage scenario of the zero-copy buffers.I use CL_MEM_USE_HOST_PTR flag to create an zero-copy 2dimage buffer.The host buffer is allocated at a 4096 byte boundary and the total size is that a multiple of 64 byte.  So it should be a zero-copy buffer.In my application, i need to write data to this buffer in every loop.So i mapped the buffer for write, the mapping process took more time than I expected, even more than a direct write.The pseudo-code is down below.

posix_memalign(&host_ptr, 4096, size);

image=create_Image2d(CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, host_ptr);

for(;;)

{

  ptr = map_memobj(BLOCK_MAP, CL_MAP_WRITE, image);

  write new data to ptr;

 unmap(image, ptr);

 ...

}

I use an image2d object instead of a buffer object, would this be a reason to this inefficiency.Also, I noticed that I don't even need map operation when the buffer is small, just directly operate on the host pointer, and the result is still correct, this is odd. Could you give me some clues on what may go wrong.  Thank you.

0 Kudos
4 Replies
Robert_I_Intel
Employee
255 Views

Hi Fan,

Yes, using 2D images always results in copying to the device, since images need to be tiled in the device memory to enable samplers and other fixed function hardware to work with them. See this link https://software.intel.com/en-us/articles/using-image2d-from-buffer-extension for more info and a sample on how to use that extension.

Fan_Z_2
Beginner
255 Views

Hi, Robert,

Thanks for clear that up for me.Unfortunately, I didn't find the "cl_khr_image2d_from_buffer" extension on my device, so I  created a zero-copy buffer as an interchange buffer between host and device, and use the direct copy API clEnqueueCopyBufferToImage  and clEnqueueCopyImageToBuffer to copy data to or from image object. These APIs don't take a blocking flag as argument, and since both of them are enqueueAPI, I take the copy operation is non-blocking.But the test showed otherwise.my code steps are down below,

1,enqueue kernel A

2, enqueue kernel B

3, enqueueCopyImagetoBuffer(ImageA, bufferA)

4, enqueueCopyImagetoBuffer(ImageB, bufferB)

5,cl_flush

6,other CPU works

7, cl_finish

8,map bufferA & bufferB for read

The step 3 cost a lot more time than other enqueueAPIs, does that mean the copy API is blocked until kernel A & B finish executing and bufferA finish copying.

 

 

 

Robert_I_Intel
Employee
255 Views

Technically, enqueueCopyImagetoBuffer should be a non-blocking call, so you shouldn't wait a long time waiting for it to return.

Maybe try to put event dependency between 1. and 3. and 2. and 4 and try again: see if it has any impact on enqueue speed.

If you have a small reproducer code, that would also help. What processor, OS, driver version are you using?

Thanks!

Fan_Z_2
Beginner
255 Views

I tried event, it didn't improve the enqueue speed.Here is the info you asked,

OS:

CentOS 7.1 

kernel 3.10.0-229.1.2.47109.MSSr1.el7.centos.x86_64

CPU: 

Intel(R) Core(TM) i7-4860EQ CPU @ 1.80GHz

GPU:
    CL_DEVICE_NAME: Intel(R) HD Graphics 5200
    CL_DEVICE_AVAILABLE: 1
    CL_DEVICE_VENDOR: Intel(R) Corporation
    CL_DEVICE_PROFILE: FULL_PROFILE
    CL_DEVICE_VERSION: OpenCL 1.2 
    CL_DRIVER_VERSION: 16.4.4.47109
    CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.2 
    CL_DEVICE_MAX_COMPUTE_UNITS: 40
 

here is the APIs' elapsed time:

map/memcpy/unmap zero-copy bufferA&B      0.010ms

enqueue copy bufferA&B to imageA'&B'          0.047ms

enqueue kernel(A', B', C', D')                           0.008ms

enqueue copy imageC'&D' to bufferC&D         0.044ms

 

 

 

 

Reply