- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We found the same kernel performance varies dramatically if the input image is created from different ways. With the attached test tool:
- if the input image is created from a host ptr directly, the performance is good, e.g. for 8K x 8K input image:
- ./blockread
- Average kernel 2.033509 ms
- if the input image is created from a buffer object (which is created from the same host ptr), the performance drops much: for the same 8K x 8K process:
- ./blockread -b
- Average kernel 3.763424 ms
The buffer pitch/base address are aligned at 4K, not sure why the performance difference is so big...
The code snippet for image creation is listed bellow
if (create_image_from_buf) {
buf_from_hostptr = clCreateBuffer(context, CL_MEM_READ_WRITE| CL_MEM_USE_HOST_PTR, src_size, src_ptr, &errNum);if (buf_from_hostptr == 0) {
printf("clCreateBuffer failed \n");
exit(1);
}
desc.buffer = buf_from_hostptr;// flags inherited from buffer
img_from_buf = clCreateImage(context,0, &format, &desc,NULL,&errNum);if (img_from_buf == 0) {
printf("clCreateImage failed \n");
exit(1);
}
} else {
img_from_hostptr = clCreateImage(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, &format, &desc, src_ptr, &errNum);
if (img_from_hostptr == NULL)
{
std::cerr << "Error creating memory objects." << std::endl;
return false;
}
}
Thanks
-Austin
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you create an image directly from a host buffer pointer it isn't zero copy. During initialization the data is copied into tile format by the driver to better match the HW design. There is some overhead, but it is a one-time cost.
As your test shows, this one time copy overhead can often be less expensive overall than linear access. Data access remains linear when you skip the data layout update (copy) by doing clCreateBuffer first then create an image directly using that buffer. Here the image data is still linear, like the original buffer, which is less efficient at each access.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Jerrfey,
One more question about the pitch alignment requirement for clCreateImage. It looks clCreateImage from a buffer object has more restrictions on the pitch alignment. From clinfo, the pitch alignement is 4 bytes:
-----------------------------------------------------------------------
Image support Yes
Base address alignment for 2D image buffers 4 bytes
Pitch alignment for 2D image buffers 4 bytes
-----------------------------------------------------------------------
The real situation is (use 4x4 as the example, here the pitch is 4 byte):
- if the input image is created from a host ptr directly, clCreateImage is successful
- if the input image is created from a buffer object (which is created from the same host ptr), clCreateImage will fail and error number is (-39)
- -39 is CL_INVALID_IMAGE_FORMAT_DESCRIPTOR (from spec: if a 2D image is created from a buffer and the row pitch and base address alignment does not follow the rules described for creating a 2D image from a buffer
How to explain this phenomena? What's the pitch alignment requirement for 2?
Jeffrey M. (Intel) wrote:
As your test shows, this one time copy overhead can often be less expensive overall than linear access. Data access remains linear when you skip the data layout update (copy) by doing clCreateBuffer first then create an image directly using that buffer. Here the image data is still linear, like the original buffer, which is less efficient at each access.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, very convincing explanation
For "clCreateBuffer->clCreateImage", you mentioned the difference is because of "skip the data layout update", do you mean if I follow "clCreateBuffer->clEnqueueWrite/ReadBuffer->clCreateImage2D", it will have the same performance behavior as "clCreateImage from HOST_PTR"?
Jeffrey M. (Intel) wrote:
skip the data layout update (copy)
Thanks
-Austin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The behavior you're seeing is also related to the driver implementation. When you create an image with a copy the driver can help with alignment and padding while it is converting your data to tiled layout. This approach can have better performance and fewer restrictions.
However, when you skip the copy all of the rules in the spec must be enforced. This is why you see an invalid format descriptor error with your "-b" case for the same parameters allowed by the first scenario.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jeffrey, is there a way to enable copy/tile format for "clCreateBuffer->clCreateImage"?
Jeffrey M. (Intel) wrote:
The behavior you're seeing is also related to the driver implementation. When you create an image with a copy the driver can help with alignment and padding while it is converting your data to tiled layout. This approach can have better performance and fewer restrictions.
However, when you skip the copy all of the rules in the spec must be enforced. This is why you see an invalid format descriptor error with your "-b" case for the same parameters allowed by the first scenario.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you initialize the image this way it forces the image data to remain just like it is in the buffer. You are specifying zero copy. If this is what you want, use the clCreatBuffer->clCreateImage path. If you want copy/tile just use clCreateImage.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Note, you may also find clEnqueueCopyBufferToImage() to be useful, if you want to explicitly copy data from a buffer memory object to an already existing image memory object.
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clEnqueueCopyBufferToImage.html

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page