Hi everyone: I have a strange time waste issue when I copy the raster image to the DirectDraw surface. Usually when I'm ready to render frame I call Lock on the surface to optain the pointer to the surface and then I copy the data from raster image to the surface. I use ipp function to copy the image in video memory, but when I have to conver RGB24 to RGB32 using ippiCopy_8u_C3AC4R the function waste a lot of time (I think at least 500ms) If I use ippiCopy_8u_C3AC4R to convert in an intermediate buffer and then I use memcpy to put it on the surface the copy is faster. When I create an Overlay surface (like a YUY2 or UYVY) and I use the function ippiBGRToYCbCr422_8u_C3C2R to convert from RGB24 to YUY2 directy in memory video is faster than the ippiCopy. Do you know why the ippiCopy have this strange time behavior?
I seem to remember that we had a similar issue with colour conversion from YCbCr4:2:0 to BGR32 in a DirectShow filter when connected to a VMR9 render (that as far as I know uses Direct3D surfaces "underneath"). That is the YCbCr 4:2:0 source image was colour converted directly to the DirectShow media buffer (being the Direct3D surface). In this situation, we also experienced a significant slow-down compared to do what you are doing, i.e. colour convert into a temp. buffer and then simply copy that to the output buffer. I can not remember the IPP version and precise setup details. What's your setup? And have you tried either a previous version of IPP (assuming you are using one of the latest releases), or using a different "lower" optimization variant of the function (e.g. if optimal is v8, then try t7)?
One potential reason for this behaviour might be that the function actually reads some bytes from the destination image buffer during the copying/conversion. Reading back from a graphics card buffer has in general a detrimental effect on performance (reading is sloooow). Well, that would explain it but I also cannot imagine why these types of functions would need to read from the destination buffer. That does not make much sense. The operations are fairly simple and should only need to read from source, do a bit of math (in the copying, there is no math) and write the result. No nead to read back anything. But it may be due to some optimization choices that work better for CPU memory acesses. The fact that the BGR to YCbCr4:2:2 function that you are also using does not exhibit the problem could be due to different optimizations (e.g. no reading).
If this is indeed the case, it would be good that the behaviour was properly documented. For functions with this behaviour, it would be good with an option to turn off such additional optimizations to ensure that the function can be used for e.g. graphics card buffer without a big penalty and without the need to use a temporary step. For the Copy functions, I would actually have expected that they would be able to make optimal without additional "read back tricks". Well, I am just speculating; it may not be the actual reason, and it may also not be as possible to easily solve...