Hi,
I seem to remember that we had a similar issue with colour conversion from YCbCr4:2:0 to BGR32 in a DirectShow filter when connected to a VMR9 render (that as far as I know uses Direct3D surfaces "underneath"). That is the YCbCr 4:2:0 source image was colour converted directly to the DirectShow media buffer (being the Direct3D surface). In this situation, we also experienced a significant slow-down compared to do what you are doing, i.e. colour convert into a temp. buffer and then simply copy that to the output buffer. I can not remember the IPP version and precise setup details.
What's your setup? And have you tried either a previous version of IPP (assuming you are using one of the latest releases), or using a different "lower" optimization variant of the function (e.g. if optimal is v8, then try t7)?
One potential reason for this behaviour might be that the function actually reads some bytes from the destination image buffer during the copying/conversion. Reading back from a graphics card buffer has in general a detrimental effect on performance (reading is sloooow). Well, that would explain it but I also cannot imagine why these types of functions would need to read from the destination buffer. That does not make much sense. The operations are fairly simple and should only need to read from source, do a bit of math (in the copying, there is no math) and write the result. No nead to read back anything. But it may be due to some optimization choices that work better for CPU memory acesses. The fact that the BGR to YCbCr4:2:2 function that you are also using does not exhibit the problem could be due to different optimizations (e.g. no reading).
If this is indeed the case, it would be good that the behaviour was properly documented. For functions with this behaviour, it would be good with an option to turn off such additional optimizations to ensure that the function can be used for e.g. graphics card buffer without a big penalty and without the need to use a temporary step. For the Copy functions, I would actually have expected that they would be able to make optimal without additional "read back tricks". Well, I am just speculating; it may not be the actual reason, and it may also not be as possible to easily solve...
- Jay