I have an app that captures 6 x HD television feeds in real time via 6 separate threads. The second part of the app requires that all of the 6 HD buffers get resized (1 into 1280x720 and 5 into 640x360). The third part of the app is that once the resizing is completed then the 6 resized images are composited together to recreate one full HD image (1920x1080) which is then output back to TV.
The problem is that the final output is not stable and seems to drop frames in some of the sub-windows but not all of them. I am assuming that this is most likely a timing issue compounded by the WaitForMultipleObjects construct that I am using.
I am assuming from what I have read about TBB that there maybe a more productive way of streamlining this application using TBB but not sure where I should begin.
Any suggestions greatly appreciated.
Dropping frames seems like part of the territory, unless you have enough cycles to spare. I can't rhyme WaitForMultipleObjects with only dropping frames in subframes, though.
Because this isn't just maximising throughput, I couldn't say with a lot of confidence whether, e.g., a TBB flow graph might help, although somebody who has had more experience with it might know more. My intuition says that you'll have to allow for some latency with that solution, because it's a lot more difficult to prioritise when a deadline comes nearer, and TBB does not preempt running tasks.
What are the hardware details, CPU usage, GPU usage? Maybe you can just get a bigger chip to buy yourself out of saturation territory if dropping frames is not an option?
So far it is a CPU only project using IPP extensively of which I have tried a couple of different resizing functions on a couple of different PC's.
One unthreaded function (ippiResizeYUV422) and one threaded function (ippiResizeSqrPixel) have been tested and compared with two types of interpolation as well (Linear vs Supersampling). With linear interpolation the unthreaded function seemed to give the best results but not by much. My initial reaction was this might be because the unthreaded function was working on YUV data that consisted of 2-bytes-per-pixel whereas the threaded function only worked with RGB data that consisted of 3-bytes-per-pixel and required extra function calls to convert from YUV to RGB and vice-versa.
This was a different case when supersampling was used. The threaded function produced much more stable output (not as good as linear) but far better than the non threaded function. The most obvious difference in this case was that the threaded function was hitting all available cores at 100%.
Overall, I was unable to get acceptable stable output under any of the tested scenarios after which I started to think that there was probably a major bottleneck (most likely in the WaitForMultipleObjects procedure) that was holding everything up.
I hope this helps
Those seem to be deprecated functions. There's a new API for resizing that might also be more efficient when used by multiple threads, and even asynchronous functions that execute on the integrated graphics processor. You should be able to get some breathing space going that way.
But again, I don't see how WaitForMultipleObjects() would result in only dropped sub-windows.
Been working on image streaming including resizing with IPP's ResizeSqrPixel in a TBB pipeline.
Just to give you an idea of what's realistic (more optimalisation is always possible of course):
1024x768 8 bits RGB (internally processed as raw RGB) gets processed at about 400 frames/s
Dividing that by roughly 4 for HD resolution gives about 100 fps, so at 25 fps you should be able to handle 4 channels.
But it is doing read-from file, decompress, convert to RGB and then IPP shrink in that time.
If you don't need to read from disk you should probably be able to get better results.
On the other hand, this doesn't have to operate within fixed sync moments but can go flat-out.
So what you're trying to achieve sounds possible, but challenging.
Note that these measurements are on a not-so-modern workstation (Xeon E5420, 2.5 Ghz, Task manager shows 8 cores)
and it scales well on a more powerful server.
Is it possible that your main image (larger than the others) is hogging CPU or cache? Also be careful who allocates memory (central resource?) and when. If possible try to reserve image buffers up front and recycle. I'm wondering if the threads are not thrashing out each other's cache lines and it might not be more efficient to process the images sequentially but parallelize the shrink per image.
Just my two cent's,
Roel, do you have any insights on the new functionality I mentioned above? If the (graphical!) functions currently being used are indeed executed on the CPU, then it seems only logical to use alternatives that leverage the built-in GPU...
Are you double or multi-buffering?
rawFeed1 -> ringOfRawFeed1Buffers
rawFeed6 -> ringOfRawFeed6Buffers
ringOfRawFeed1Buffers -> SliceOfRingOfStridedCompressedOutputBuffer
ringOfRawFeed6Buffers -> SliceOfRingOfStridedCompressedOutputBuffer
Each input feed has separate ring of buffers
SliceOfRingOfStridedCompressedOutputBuffer is two or more output frames, each thread compresses to an appropriate slice of the output frame. Multiple SliceOfRingOfStridedCompressedOutputBuffer permit you to compose one while dispatching the other.