- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have some confusions on the usage scenario of the zero-copy buffers.I use CL_MEM_USE_HOST_PTR flag to create an zero-copy 2dimage buffer.The host buffer is allocated at a 4096 byte boundary and the total size is that a multiple of 64 byte. So it should be a zero-copy buffer.In my application, i need to write data to this buffer in every loop.So i mapped the buffer for write, the mapping process took more time than I expected, even more than a direct write.The pseudo-code is down below.
posix_memalign(&host_ptr, 4096, size);
image=create_Image2d(CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, host_ptr);
for(;;)
{
ptr = map_memobj(BLOCK_MAP, CL_MAP_WRITE, image);
write new data to ptr;
unmap(image, ptr);
...
}
I use an image2d object instead of a buffer object, would this be a reason to this inefficiency.Also, I noticed that I don't even need map operation when the buffer is small, just directly operate on the host pointer, and the result is still correct, this is odd. Could you give me some clues on what may go wrong. Thank you.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Fan,
Yes, using 2D images always results in copying to the device, since images need to be tiled in the device memory to enable samplers and other fixed function hardware to work with them. See this link https://software.intel.com/en-us/articles/using-image2d-from-buffer-extension for more info and a sample on how to use that extension.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Robert,
Thanks for clear that up for me.Unfortunately, I didn't find the "cl_khr_image2d_from_buffer" extension on my device, so I created a zero-copy buffer as an interchange buffer between host and device, and use the direct copy API clEnqueueCopyBufferToImage and clEnqueueCopyImageToBuffer to copy data to or from image object. These APIs don't take a blocking flag as argument, and since both of them are enqueueAPI, I take the copy operation is non-blocking.But the test showed otherwise.my code steps are down below,
1,enqueue kernel A
2, enqueue kernel B
3, enqueueCopyImagetoBuffer(ImageA, bufferA)
4, enqueueCopyImagetoBuffer(ImageB, bufferB)
5,cl_flush
6,other CPU works
7, cl_finish
8,map bufferA & bufferB for read
The step 3 cost a lot more time than other enqueueAPIs, does that mean the copy API is blocked until kernel A & B finish executing and bufferA finish copying.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Technically, enqueueCopyImagetoBuffer should be a non-blocking call, so you shouldn't wait a long time waiting for it to return.
Maybe try to put event dependency between 1. and 3. and 2. and 4 and try again: see if it has any impact on enqueue speed.
If you have a small reproducer code, that would also help. What processor, OS, driver version are you using?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried event, it didn't improve the enqueue speed.Here is the info you asked,
OS:
CentOS 7.1
kernel 3.10.0-229.1.2.47109.MSSr1.el7.centos.x86_64
CPU:
Intel(R) Core(TM) i7-4860EQ CPU @ 1.80GHz
GPU:
CL_DEVICE_NAME: Intel(R) HD Graphics 5200
CL_DEVICE_AVAILABLE: 1
CL_DEVICE_VENDOR: Intel(R) Corporation
CL_DEVICE_PROFILE: FULL_PROFILE
CL_DEVICE_VERSION: OpenCL 1.2
CL_DRIVER_VERSION: 16.4.4.47109
CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.2
CL_DEVICE_MAX_COMPUTE_UNITS: 40
here is the APIs' elapsed time:
map/memcpy/unmap zero-copy bufferA&B 0.010ms
enqueue copy bufferA&B to imageA'&B' 0.047ms
enqueue kernel(A', B', C', D') 0.008ms
enqueue copy imageC'&D' to bufferC&D 0.044ms

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page