How to deal alpha blending with opencl efficiently

Jzhan59 · ‎05-15-2013

Hi Everyone,

Does any one know how to deal alpha blending with opencl efficiently? I had ported the host code to opencl, but the performance come out too poor. does anybody has any advice?

LLess · ‎05-16-2013

Can you provide a bit more details about what you are trying to do?

Alpha blending is a relatively generic terms and just for me I can see quite a few situation where it applies.

Laurent

Jzhan59 · ‎05-16-2013

Hi Laurent,

I am working on a video effect project with opencl hardware accelerated. And following is the code of opencl kernel for alpha blending. The performance is poor comparing the AVX code. Can you give me more suggestion or guide me to to optimize the opencl kernel?

uint applyAlphaToPixel(uint uPixel, uint uAlpha)
{
uint uColor = uPixel;
uPixel &= 0x00FF00FF;
uPixel *= uAlpha;
uPixel += 0x00800080;
uPixel &= 0xFF00FF00;
uPixel >>= 8;
uColor &= 0xFF00FF00;
uColor >>= 8;
uColor *= uAlpha;
uColor += 0x00800080;
uColor &= 0xFF00FF00;
uColor |= uPixel;
return uColor;
}
__kernel void alphaBlend_kernel(__global uint* pTarget, __global uint* pSource, const uint width, const uint height)
{
uint gx = get_global_id(0);
uint gy = get_global_id(1);
if (gx >= width || gy >= height)
return;
uint id = gy * width + gx;
uint tc = pTarget[id];
uint sc = pSource[id];
uint sa = sc >> 24;
if (sa == 0x0) {
pTarget[id] = tc;
} else if (sa == 0xFF) {
pTarget[id] = sc;
} else {
sc = applyAlphaToPixel(sc, sa);
tc = applyAlphaToPixel(tc, (0xFF ^ sa));
pTarget[id] = sc + tc;
}
}

LLess · ‎05-19-2013

How about using uchar4 instead of UInt and shifts... You should be able to treat your 4 channels at the same time.

Also reading an uchar4 into a float4 then doing alpha blending in float and storing back to uchar4 is likely to be faster since the units are optimized for floats.

you can then removed the shifts for the tests too, by directly checking sc.w with 0 and 1.0f (once you have it declared as uchar4 and normalized the color by multiplying them by 1.0f/255.0f).

by the way, are you sure about the 0xFF^sa ? should it not be 0xFF-sa (or 1.0f-sc.w in uchar4 case)..

Another option is to use textures for that with point sampling (it would convert stuff directly from uchar4 to normalized float4 for you and back).

However you would need to have a third texture for the destination since you can't have read/write access on the same texture. (That is if you are using the HD4000, on CPU I would stick with the code above actually but still use a uchar4.

That would probably be much faster but only with the HD4000 as target.

Laurent

Jzhan59 · ‎05-21-2013

Hi Laurent,

I have used uchar4 instead of uint and tested it with Intel Kernel Builder. the performance is gained, but still can not catch up the Intel AVX counterpart with OMP. As you say, using texture will need a third texture and will have to copy the memory between buffer and texture, can it(using texture) get performance gain?

the new opencl kernel code now become as follow, can you see what improves can be done?

__kernel void alphaBlend_kernel(__global uchar4* pTarget, __global uchar4* pSource, const uint workSize)
{
uint id = get_global_id(0);
if (id >= workSize)
return;

uchar4 ut = pTarget[id];
uchar4 us = pSource[id];
float4 ft = convert_float4(ut);
float4 fs = convert_float4(us);
float fa = (255.0f - fs.w) / 255.0f;
pTarget[id] = convert_uchar4(mix(fs, ft, fa));
}

LLess · ‎05-21-2013

The code looks lot simpler now :)

What's your target device? CPU or GPU ? For CPU I have serious doubts that you will be able to match an AVX implementation done with intrinsics or assembly code.Using textures on CPU will probably not bring you any benefit due to the simple fact that there is no dedicated units for that.

For GPU, yes I would use a texture and I am pretty sure that it would be faster. Also you would get free uchar4 to float4 conversion in these cases.

Jzhan59 · ‎05-22-2013

Hi Laurent,

my target device is GPU. I have rewrite the kernel using image and test the kernel with Intel Kernel Builder on i7-3770K. Comparing with the buffer implemented kernel, i got performance gain but not too much. However, by using image implementation, I have to copy the memory from the buffer to the image before the start of alpha blend kernel and copy the image to the buffer after the process of alpha blend kernel. So when I put the two implementation into the application and compare the performance, the buffer implementation's performance is much faster.

The image implementation kernel is as follow. Do you have any suggestion?

const sampler_t iSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_NONE | CLK_FILTER_NEAREST;
__kernel void alphaBlend_kernel(__write_only image2d_t iTarget, __read_only image2d_t iSource,
__read_only image2d_t iBuffer, uint width, uint height)
{
uint gx = get_global_id(0);
uint gy = get_global_id(1);
if (gx >= width || gy >= height)
return;
int2 uv = (int2)(gx, gy);
float4 sc = read_imagef(iSource, iSampler, uv);
float4 bc = read_imagef(iBuffer, iSampler, uv);
float4 tc = mix(bc, sc, sc.w);
write_imagef(iTarget, uv, tc);
}

LLess · ‎05-22-2013

Looks good to me. I doubt that there is more to do on the kernel side.

Well if you are using the HD4000 then one big benefit can come from having your buffer on the CPU side allocated properly.

If the buffer is aligned with a 4K page (VirtualAlloc can do that easily for you) then you can share the memory with the GPU without doing the extra copies. This is the fastest way to do it. However that is not always possible depending on how your images are allocated.

Look into the documentation for the createImage with the CL_MEM_USE_HOST_PTR.

One thing i have never tried is to create two images using the same host pointer :) You could have one as source and one as destination and avoid the intermediate buffer completely.

As far as I can see there is nothing in the documentation saying that it is forbidden but it might mess up with the cache. Although in your case that should be ok since you are not reading again data written by the kernel.

Someone from Intel could probably shed some light on that possibility.

Laurent

LLess · ‎05-22-2013

Oh and another thing is make sure to try various tile size for your call to the kernels.

16x16 or 16x8 is usually quite good from my experience. You might even be able to use Intel Offline Kernel Compiler to search that for you depending on your input formats.

This can make a kernel run 2,3 times faster.

Jzhan59 · ‎05-23-2013

Hi Laurent,

I have tested the kernel with various work group size to find the best performance. For the image implemented solution, the bottleneck lie in memory copy. Using clEnqueueCopyBufferToImage and clEnqueueCopyImageToBuffer to do the memory translation. To my knowledge, it seem Impossible to share memory between buffer(clCreateBuffer) and image(clCreateImage). Is it right?

I know that the 3d rendering API(OpenGL and Direct3D) has the ability to do alpha blending in the hardware rendering pipeline. Is there any way to use this hardware ability? by using build-in kernel of anything else?

LLess · ‎05-23-2013

Well yes you could just use OpenGL or D3D to do that with a pixel shader. That will be a question for the OpenGL or D3D forum though.

I don't know where your images are coming from but there are ways in OpenCL by using the USE_MEM_HOST_PTR to avoid the copy that you describe (for both Images and Buffers). I am using that all the time since it is one of the big advantages of the HD4000.

What I am talking about here, is sharing memory between CPU memory and GPU OpenCL image/buffers. There should be a few samples on the web showing how to use that. I think Intel has some sample for that too.

OpenCL 1.2 code:

cl_image_desc desc;

... //setup your desc

cl_image_format format;

.... //setup your format

   m_device_image = (void*)clCreateImage(
                        (cl_context)gpu_context, //Your opencl context
                        CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR ,
                        &format,
                        &desc,
                        buffer, //Buffer is the memory pointer to share, it should be aligned on a 4KB boundary
                        &result
                        );

freedayman · ‎05-23-2013

Hi Laurent,

My target devices are all of the opencl enabled gpu. Intel HD 4000 is only one of my target devices. As you say, by using CL_MEM_USE_HOST_PTR to share memory is one of the big advantages for Intel HD 4000. What about the performance on discrete GPU(nVIDIA GPU and AMD GPU)? Will it lead to much performance degradation?

I know how to use OpenGL or Direct3D to do alpha blending. What I do not understand is that why can't I use the OpenCL to do it since the hardware has the capability. Alpha blending is a so commond process in many GPGPU related region. Maybe it should be exported by opencl build-in kernel or any else by the GPU vendor.

By the way, thank your for give me so many suggestion. really appreciate it.

LLess · ‎05-26-2013

Well about NVidia and AMD GPU, simple they don't support USE_HOST_PTR (Except probably AMD APU since they are like the HD4000 and not add-on GPUs). So your code will have to use the COPY instead which means that it will be a bit slower.

However we are then talking about PCIe Gen2 or Gen3 and texture transfer are really fast but since your kernel is really simple that might indeed become a bottleneck.

Well OpenCL is for computing, it is not OpenGL :) So it makes little sense to expose Alpha blending in it, like Z Buffer support (also the Depth buffers have been added as an extension now). You don't want OpenCL to become just another OpenGL ;-)

On PC, there are a lot of things that can make more sense in DirectX or OpenGL Shaders if you are really targetting Rendering.

Good luck with your project.