Big kernel performance difference between the image created from HOST_PTR and the image created from Buffer Object

Shengquan_Y_Intel · ‎11-17-2016

Hi,

We found the same kernel performance varies dramatically if the input image is created from different ways. With the attached test tool:

if the input image is created from a host ptr directly, the performance is good, e.g. for 8K x 8K input image:
- ./blockread
- Average kernel 2.033509 ms
if the input image is created from a buffer object (which is created from the same host ptr), the performance drops much: for the same 8K x 8K process:
- ./blockread -b
- Average kernel 3.763424 ms

The buffer pitch/base address are aligned at 4K, not sure why the performance difference is so big...

The code snippet for image creation is listed bellow

    if (create_image_from_buf) {
        buf_from_hostptr = clCreateBuffer(context, CL_MEM_READ_WRITE| CL_MEM_USE_HOST_PTR, src_size, src_ptr, &errNum);

        if (buf_from_hostptr == 0) {
            printf("clCreateBuffer failed \n");
            exit(1);
        }
        desc.buffer = buf_from_hostptr;

        // flags inherited from buffer
        img_from_buf = clCreateImage(context,0, &format, &desc,NULL,&errNum);

        if (img_from_buf == 0) {
            printf("clCreateImage failed \n");
            exit(1);
        }
    } else {
        img_from_hostptr = clCreateImage(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, &format, &desc, src_ptr, &errNum);
        if (img_from_hostptr == NULL)
        {
            std::cerr << "Error creating memory objects." << std::endl;
            return false;
        }
    }

Thanks

-Austin

Jeffrey_M_Intel1 · ‎11-18-2016

If you create an image directly from a host buffer pointer it isn't zero copy. During initialization the data is copied into tile format by the driver to better match the HW design. There is some overhead, but it is a one-time cost.

As your test shows, this one time copy overhead can often be less expensive overall than linear access. Data access remains linear when you skip the data layout update (copy) by doing clCreateBuffer first then create an image directly using that buffer. Here the image data is still linear, like the original buffer, which is less efficient at each access.

Shengquan_Y_Intel · ‎11-21-2016

Hi, Jerrfey,

One more question about the pitch alignment requirement for clCreateImage. It looks clCreateImage from a buffer object has more restrictions on the pitch alignment. From clinfo, the pitch alignement is 4 bytes:

-----------------------------------------------------------------------

Image support                                   Yes
    Base address alignment for 2D image buffers   4 bytes
    Pitch alignment for 2D image buffers          4 bytes

-----------------------------------------------------------------------

The real situation is (use 4x4 as the example, here the pitch is 4 byte):

if the input image is created from a host ptr directly, clCreateImage is successful
if the input image is created from a buffer object (which is created from the same host ptr), clCreateImage will fail and error number is (-39)
- -39 is CL_INVALID_IMAGE_FORMAT_DESCRIPTOR (from spec: if a 2D image is created from a buffer and the row pitch and base address alignment does not follow the rules described for creating a 2D image from a buffer

How to explain this phenomena? What's the pitch alignment requirement for 2?

Jeffrey M. (Intel) wrote:

As your test shows, this one time copy overhead can often be less expensive overall than linear access. Data access remains linear when you skip the data layout update (copy) by doing clCreateBuffer first then create an image directly using that buffer. Here the image data is still linear, like the original buffer, which is less efficient at each access.

Shengquan_Y_Intel · ‎11-21-2016

Thanks, very convincing explanation

For "clCreateBuffer->clCreateImage", you mentioned the difference is because of "skip the data layout update", do you mean if I follow "clCreateBuffer->clEnqueueWrite/ReadBuffer->clCreateImage2D", it will have the same performance behavior as "clCreateImage from HOST_PTR"?

Jeffrey M. (Intel) wrote:

skip the data layout update (copy)

Thanks

-Austin

Jeffrey_M_Intel1 · ‎11-22-2016

The behavior you're seeing is also related to the driver implementation. When you create an image with a copy the driver can help with alignment and padding while it is converting your data to tiled layout. This approach can have better performance and fewer restrictions.

However, when you skip the copy all of the rules in the spec must be enforced. This is why you see an invalid format descriptor error with your "-b" case for the same parameters allowed by the first scenario.

Shengquan_Y_Intel · ‎11-22-2016

Jeffrey, is there a way to enable copy/tile format for "clCreateBuffer->clCreateImage"?

Jeffrey M. (Intel) wrote:

The behavior you're seeing is also related to the driver implementation. When you create an image with a copy the driver can help with alignment and padding while it is converting your data to tiled layout. This approach can have better performance and fewer restrictions.

However, when you skip the copy all of the rules in the spec must be enforced. This is why you see an invalid format descriptor error with your "-b" case for the same parameters allowed by the first scenario.

Jeffrey_M_Intel1 · ‎11-22-2016

When you initialize the image this way it forces the image data to remain just like it is in the buffer. You are specifying zero copy. If this is what you want, use the clCreatBuffer->clCreateImage path. If you want copy/tile just use clCreateImage.

Ben_A_Intel · ‎11-23-2016

Note, you may also find clEnqueueCopyBufferToImage() to be useful, if you want to explicitly copy data from a buffer memory object to an already existing image memory object.

https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clEnqueueCopyBufferToImage.html