Use session joining to speed

Victor_D_ · ‎11-23-2013

My application needs to stream multiple H.264 streams of various resolutions. When I decode several 5 MegaPixel H.264 I'm able to decode only a few streams when using hardware decoding. When software decoding, the number of streams is several times higher, but is 50% fewer than when using IPP 7.1 for H.264 decode. I'm using low latency mode in all of these cases. My application is a 32-bit application for now and the system has 4 GBytes of memory.

Is there a way to reduce the memory footprint of the H.264 decoder per stream?

OTorg · ‎11-25-2013

Use session joining to speed up several simultaneous tasks and reduce some memory.

Victor_D_ · ‎12-25-2013

I got session joining working, which slightly reduced memory usage for software decodes. But, it's still not anywhere near as small of a memory footprint as IPP decode (H.264 with 5 MegaPixel streams).

Jeffrey_M_Intel1 · ‎12-27-2013

I'm looking into this and will get back to you soon. Just to clarify, does your streaming application decode as separate sessions or do your sessions have decode and encode stages?

Victor_D_ · ‎12-30-2013

My streaming application decodes many, many H.264 streams. Each decode is its own session. Each decode is its own thread. I'm decoding using low latency settings, which lowers the footprint as the allocator of the decoder asks for fewer frames. Hardware decode runs out of memory with much fewer streams than software decode, whether using system memory for decode or not. Software decode uses about 50% more memory than H.264 decode using IPP.

We are in the process of porting our application to 64-bit Windows, but still would like the memory footprint to be smaller, to be able to decode larger number of video streams without requiring client systems to have large amount of memory.

Jeffrey_M_Intel1 · ‎12-30-2013

Here are some experiments with a 3136x1776 H.264 test stream that may help point toward an answer

Product Sample/settings MB used per decode
(Media SDK)   sample_decode, SW, system memory 80
(Media SDK)   sample_decode, HW, system memory 240
(Media SDK)   sample_decode, HW, d3d11 memory       180
(IPP 8) umc_video_dec_con: 180

Experiments were with an Intel(R) Core(TM) i5-3427U processor, Media SDK 2013 R2 and driver 15.33.8.64.3345 (10.18.10.3345). MB/decode gathered via Sysinternals Process Explorer.

I'm not seeing the low memory footprint for IPP that you are, but your implementation may not have as much overhead as the umc_video_dec_con sample and I'm looking at a different IPP version.

If your application decodes then does work on the CPU before encode it can make sense to decode in software. In addition to needing extra memory, the GPU->CPU copy required to do HW decode in this scenario adds a lot of overhead. Of course it is up to you to choose which decoder makes the most sense, but please keep in mind that UMC was always a sample. It was never intended for production use without significant modification and it is now end of life. Media SDK is fully productized and will continue to improve.

When using hardware sessions there are many advantages to using video memory. That may help decrease your memory footprint further. However, even if memory requirements for HW decode could be reduced to the level of SW decode, if your pipelines decode->copy to CPU->do CPU work->copy to GPU->encode you may not see a performance advantage from HW decode. The best performance scenario will be to find a way to keep the entire pipeline on the GPU for most of your streams.

Memory usage during H.264 decode...