Decoder input bitstream (and general performance) optimization

jnickerson · ‎05-17-2012

Is there anything particular we should be aware of when trying to get the highest performance out of hardware decoding with low-latency streaming media? We're trying to run quite a few such decoding sessions in parallel (>10), and I seem to be noticing peculiar differences in the ability of the decoding to keep up depending on the management of the input MfxBitstream.

I had thought that circular buffer-style management of the bitstream read offset pointer (along with similar added write logic) would be an improvement over the method used in the example code, which involves a bitstream whose offset data is copied back to the starting offset as often as possible. It seems like only moving pointers would be much more efficient than constantly memcpying encoded video data -- but instead the best-performing solution we've found is a double-buffered approach, similar to the one used in the example except with a second input bitstream.

Any idea why this is the case? Is there some reason decoding operations perform better when they tend to start from the beginning of the MfxBitstream buffer, rather than at some offset along it?

And why does the guide recommend decoding all remaining frames from the input buffer before adding more? Shouldn't write operations appending to the end of a section of memory be transparent to the decoder, who is simply reading along it's own (earlier) section of memory? It's not as though there's any sort of inherent lock implemented in the bitstream data, since it's randomly accessible.

Again, this is all in a low-latency setup, with AsyncDepth set to 1 and the bitstream flag set to indicate a complete frame is available every time DecodeFrameAsync is called. Though increasing the AsyncDepth and clearing that flag don't seem to improve throughput at the expense of latency, as I would have expected. Maybe it's because of the large number of simultaneous decoding sessions, such that the hardware can't work on any one in parallel, anyway?

Any advice, insights, or assistance are appreciated!

Thanks,

James

Petter_L_Intel · ‎05-18-2012

Hi James,

based on your description there is no reason performance should differ regardless of how treat the bitstream buffer. The sample implementation just showcases one way of how to load into and process data from the buffer.Also, there is no specific reason that the bitstream buffer should be drained first before adding more data. The recommendation just stems from how the sample was designed, which is a simplistic approach.

I suspect the reason for different performance lies somewhere else?

Media SDK does not limit the number of simultaneous ongoing sessions. The HW resources are shared equally between the running Media SDK workloads.

If you can provide some more details about your architecture, configuration and bottlenecks we may be able to help you assess this further.

Regards,

Petter