OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1720 Discussions

image2d_t direct pixel access with vload/vstore


Hi all, I have question regarding the read/write of image2d_t pixels and hope someone can post a solution.

I am using MediaSDK to decompress images. After decompression the picture (NV12) resides in an IDirect3DSurface9.
    cl_mem memY = clCreateFromDX9MediaSurfaceKHR(context, CL_MEM_READ_ONLY, CL_ADAPTER_D3D9EX_KHR, &surfaceIn, 0, &err);
    clEnqueueAcquireDX9MediaSurfacesKHR(queue, 1, memY, 0, 0, 0));
I got a cl_mem handle(which is image2d_t type) and can be passed to my kernel
    clSetKernelArg(m_kernel, 1, sizeof(cl_mem), (void*)&memY); // srcImg
Now it's possible to use it in my kernel
__kernel void Dummy(__read_only image2d_t srcY)
    uint16 pix;
    for (int i=0; i<16; i++)
        float4 val= read_imagef(srcY, CLK_FILTER_NEAREST, sCoord);
        pix[0]= convert_uint(val.x*255); // val.x  is y value
 This works pretty fine, but the performance of read_imagef (single pixel access) is very low.

As explained in the Sobel tutorial (, I would like to access the pixels in the form of uchar* like:

    __global uchar* pSrcImage;
    uint16 pix = convert_uint16(vload16(0, pSrcImage));
to read 16 pixel (256 bit) in a single memory access from the Y plane of the NV12 surface. This is possible, when I create a cl_mem with clCreateBuffer(), but I did not find a way to get access to the image2d_t data. The only way (I found) to read the pixels from image2d_t is with read_imagef() which is very slow.

My questions are:

How can I read the pixels of an image2d_t with vload() ?


Can I convert the DirectX surface to a cl_mem which is a "flat" buffer, and not a image2d_t?

Thanks for any help


0 Kudos
1 Reply

Hi Carsten,

Please take a look at cl_intel_subgroups extension - specifically, intel_sub_group_block_read functions.

    uint  intel_sub_group_block_read(         Reads 1, 2, 4, or 8 uints of data for each
              image2d_t image,                work item in the subgroup from the specified
              int2 byte_coord )               image at the specified coordinate as a block
    uint2 intel_sub_group_block_read2(        operation.  Note that the coordinate is a 
              image2d_t image,                byte coordinate, not an image element 
              int2 byte_coord )               coordinate.  Also note that the image data 
    uint4 intel_sub_group_block_read4(        is read without format conversion, so each 
              image2d_t image,                work item may read multiple image elements 
              int2 byte_coord )               (for images with element size smaller than 
    uint8 intel_sub_group_block_read8(        32-bits).
              image2d_t image,                
              int2 byte_coord )               The data is read row-by-row, so the first
                                              value read is from the row specified in the
                                              y-component of the provided byte_coord, the
                                              second value is read from the y-component 
                                              of the provided byte_coord plus one, etc.

So, if you don't mind Intel specific extensions and your targeted hardware supports it, you may try that one. You may also try to change the axis of your read_imagef accesses to see if that improves your performance.

0 Kudos