Re: H264 Decoder Latency.

david_jacksonipfx_co · ‎08-18-2009

Hi,
We have been attempting to develop a video conferencing product using IPP, we have been having decoding latency problems. A bit of searching turned up http://software.intel.com/en-us/forums/showpost.php?p=36228, but changing the level_idc param does not seem to decrease the internal buffer size of the decoder. Is it possible to clarify what needs to be done in order to reduce the decoder frame buffer to 1 or 2 frames, some configuration examples would be great.

Has anybody else used the H246 codec in a real-time application? any tips would be appreciated.
Thank you in advance,
David.

Vladimir_Dudnik · ‎08-19-2009

Hi David,

there is comment from our experts: threading add delay in frames equal to number of threads. You may try to run decoder with numThreads=1 parameter. Engineering team also havethe questions to you on this:
-what is video resolution do you use
-what number of FPS you want to decode to
-whatare level_idc and profile_idc values in your case

Regards,
Vladimir

david_jacksonipfx_co · ‎08-19-2009

Quoting - Vladimir Dudnik (Intel)

Hi David,

there is comment from our experts: threading add delay in frames equal to number of threads. You may try to run decoder with numThreads=1 parameter. Engineering team also havethe questions to you on this:
-what is video resolution do you use
-what number of FPS you want to decode to
-whatare level_idc and profile_idc values in your case

Regards,
Vladimir

1. We are experimenting with different resolutions currently we are testing with 160x120, 320x240 and 640x480.
2. Once again thing this is currently variable, 10, 30 & 60 are the numbers we have been trying.
3. level_idc and profile_idc values not been configured to anything specific, encoder configuration is currently as per example code on the Intel website (clip res, frame rate, bit rate & slice number).

Please let me know if there is any other information I can provide to help as we are under time pressure to get a prototype working.
Thank you,
David.

david_jacksonipfx_co · ‎08-19-2009

Quoting - david.jacksonipfx.com

1. We are experimenting with different resolutions currently we are testing with 160x120, 320x240 and 640x480.
2. Once again thing this is currently variable, 10, 30 & 60 are the numbers we have been trying.
3. level_idc and profile_idc values not been configured to anything specific, encoder configuration is currently as per example code on the Intel website (clip res, frame rate, bit rate & slice number).

Please let me know if there is any other information I can provide to help as we are under time pressure to get a prototype working.
Thank you,
David.

To provide a bit more context, we had being trying to solve our latency problem at higher frame rates, the timings we took suggested that none of the process pipeline was responsible for the sizes of delay we were seeing. Upon slowing the frame rate down to 1 fps we realized that what goes into the decoder is not what comes out, we assume this is the 'DPB' mentioned in the IPP documents. Unfortunately, even a delay of a 3 or 4 frames produces a perceptible lag even at higher frame rates.

Vladimir_Dudnik · ‎08-20-2009

Hello, the comment from engineeringteam was like:

It is possible to make delay equal to 2 frames using small level value and Baseline profile. And no threading, of course. GOP pattern like IPPPPPP should cause such a small delay.

Vladimir

Ying_H_Intel · ‎08-20-2009

Hello David,

Or if you are using umc decoder, linking h264_dec.lib,youcan try to modify the library code, like
1) in h264_dec => umc_h264_task_supplier.cpp
make sure the value of m_maxDecFrameBuffering=0
i.g, comment out line 921
// if (m_TrickModeSpeed != 1)
// {
m_maxDecFrameBuffering = 0;
// }
2) at line 2318 of umc_h264_task_supplier.cpp.
Add two lines as below.

if (!pSource)
{
AddSlice(0, 0);
....
}
else
{
// add two lines here
if (MINIMAL_DATA_SIZE >= pSource->GetDataSize())
AddSlice(0, 0);
// decoding
return RunDecoding(dst, pSource, force);
}
and rebuild the h264_dec.lib

Regards,
Ying

david_jacksonipfx_co · ‎08-20-2009

Quoting - Vladimir Dudnik (Intel)

Hello, the comment from engineeringteam was like:
It is possible to make delay equal to 2 frames using small level value and Baseline profile. And no threading, of course. GOP pattern like IPPPPPP should cause such a small delay.

Vladimir

Hi Vladimr,

I greatly appreciate your help so far on this issue however I desperately need some further advise from you.

We have tried for days to get performance out of the IPP 264 library comparable to other video products such as Google mail video that uses H264/SVC and Microsoft OCS etc without success.

The best we have been able to achieve so far is about 200ms in total from video capture to display due to delays in encoding, decoding and the 1-3 frame buffer delay depending on resolution frame rates etc.

A high frame rate chews up too much CPU for us and a low frame rate causes extra delays due to fps and frame buffers etc. On the same PCs the above products work perfectly with little noticeable lag. 640x480 plus seems unusable with any settings as it consumes too much CPU.

We are still not totally clear as to the actual relationship between the idc_level parameter and the buffer size so we accept if could still be errors on our part. We believe we have tried your recommended settings but realize there could be alternatives for our needs.

Therefore can you please give us ALL the settings you suggest for low cpu & low latency for 320x240 or 640x480 (or even better sample code with all settings) or let us know if we are better off evaluating alternative encoder/decoders if this one is not suitable for real time video. Sorry to push the issue but I am running out of time for this project.

Thanks again for your help
Regards
David.

Vladimir_Dudnik · ‎08-21-2009

Hi David,

IPP engineering teamcommented that200 ms for capture-encode-decode is probably the best you can do with IPP codec. It have to maintain strict confomance to H.264 spec which third party codecs you mention may not.

Regards,
Vladimir

oxydius · ‎08-21-2009

Quoting - Vladimir Dudnik (Intel)

Hello, the comment from engineeringteam was like:
It is possible to make delay equal to 2 frames using small level value and Baseline profile. And no threading, of course. GOP pattern like IPPPPPP should cause such a small delay.

Vladimir

There is no reason for such a GOP pattern to cause any delay in single-threaded mode. The I frame can be decoded instantly in the first GetFrame call, and all subsequent P frames depend on the previous frame, so they can also be returned instantly. That's why B frames are never used for realtime video applications.

The decode picture buffer described in the H.264 specification is nothing but a suggestion for buffering and synchronization purposes. It is not required to artificially delay frames to achieve conformance, unless the sequence parameter set (SPS) explicitly mentions out-of-order (B) frames are present in the stream.

j_miles · ‎08-21-2009

Quoting - oxydius

There is no reason for such a GOP pattern to cause any delay in single-threaded mode. The I frame can be decoded instantly in the first GetFrame call, and all subsequent P frames depend on the previous frame, so they can also be returned instantly. That's why B frames are never used for realtime video applications.

The decode picture buffer described in the H.264 specification is nothing but a suggestion for buffering and synchronization purposes. It is not required to artificially delay frames to achieve conformance, unless the sequence parameter set (SPS) explicitly mentions out-of-order (B) frames are present in the stream.

Well, I think you're incorrect on the buffer (DPB) described in the H.264 specification is merely a suggestion as such. The buffering mechanism (however it is handled) will have to adhere to the specification to claim it is a conforming decoder (see Appendix C). Note that there are two types of conformance, output timing conformance and output order conformance.

To my knowledge: The Intel implementation in the IPP samples can only deliver in the correct reordered output order, whereas some of the other codecs also allow decoding order (immediate) output ordering. When providing the reordered output, buffering in the decoder needs to take place to take care of the reordered frames. Usually, this would be B-frames, but in H.264 this can also be P-frames. Therefore, for the GOP pattern described, we can not really know - but we assume that no reordering is taking place (as it just adds to the delay). In general, the decoder can not in advance know whether or not a reordered picture may appear at some point in the stream. Therefore, it seems that Intel has chosen a "safe path" in that the decoder use the "worst" possible buffering (delaying) that would be necessary to deliver the stream in a fluent manner. Elaborating on that, if the decoder did not buffer (delay) and an out-of-order picture suddenly appears, the flow out of the decoder would contain a gap as the out-of-order picture would need to be buffered before output. In other words, the decoder will buffer to the maximum number of pictures that is allowed for a given stream (I'll come to that later) to be able to deliver the frames in a fluent (one-by-one) flow.

The maximum buffering required is determined by the 'max_dec_frame_buffering' parameter as described in the H.264 specification in Annex E. This is part of the bitstream_restrictions in the VUI parameters of the SPS. As it is optional, the parameter is to be derived from 'MaxDpbSize', which again is specified/derived from the profile and level and the coded picture resolution as defined in Annex A. Note that the 'max_dec_frame_buffering' parameter is constrained at the low end to be >= the 'num_ref_frames' parameter of the SPS. The Intel decoder uses the 'max_dec_frame_buffering' parameter to set the "worst-case" buffering, and thus you can with the right encoding parameters and with the proper addition of the VUI parameters set this as low as possible to obtain the smallest possible buffering.

The 'max_dec_frame_buffering' parameter defines the maximum for the 'num_reorder_frames' parameter, which is also given in the VUI set. This thus sets a limit to the amount of reordering that can occur in a stream. This is actually the only information a decoder can derive about reordering directly from the H.264 stream. The SPS thus does not explicitly state whether or not there will be B-pictures in a stream (and P-pictures may also be reordered), and it also does not state if they actually do appear, i.e. even if the 'num_reorder_frames' parameter is >0, the stream is not required to actually use it.

Anyway, this does not mean you can handle it otherwise; especially if you have a closed-circuit system with control over both encoder and decoder side, as it seems to be the case. In this case, it is essential to choose the right encoding parameters and provide the right information in the stream, and/or adapt the decoder to use as little buffering as possible.

Hope this helps shed some light on the subject...

- Jay

j_miles · ‎08-21-2009

Well, let me add some more specific information in addition to my lengthy (more general) post:

As suggested, do not use multi-threading in the decoder, as it adds to the frame buffering employed.
The reason for suggesting to go with an as low as possible level (relative the picture resolution) is exactly because it influences the 'MaxDPBSize' parameter that thus sets the 'max_dec_frame_buffering' parameter when not present. If you would like to use a higher level (for other compression quality reasons), you would need to specify the 'max_dec_frame_buffering' directly in the VUI set.
There might be a "bug" in the handling of the 'max_dec_frame_buffering' in the decoder for a value of 0 - and I also cannot remember if it handles the inferring of the value (when not present) correctly...

Good luck with the project.

- Jay

oxydius · ‎08-21-2009

In other words, I meant the IPP decoder achieves conformance by assuming upfront the worst case buffering/reordering scenario. It is therefore, latency-wise, sub-optimal in all cases, even though it does not exhibit any failure.

For a given H.264 stream level and resolution, you can deduce the worst case to be a 14-frame reordering delay. That doesn't mean you need to hold the first IDR frame for 14 frames. While a 14-frame buffer would be allocated, it could be used if needed only, such that getting IDR plus a sequence of P frames that only reference the previous frame would introduce zero latency. That is typically why videoconferencing software use the baseline profile, as it prohibits complex decoder re-ordering.

Conformance can be maintained for high profile streams, when upon the first encounter of a P or B frame that needs re-ordering, the decoder returns UMC_ERR_NOT_ENOUGH_DATA until enough frame dependencies are resolved to fully decode and output the oldest frame. This may only require buffering 2-3 frames rather than the worst case of 14. I think IPP provides a great decoder and use it in all my projects, but it could use improvements in dynamic re-ordering dependency handling.

By the way, for the baseline profile with no re-ordering (decode order = present order), Ying's fix works perfectly to eliminate artificial latency.

j_miles · ‎08-24-2009

Excellent points, oxydius. I can see I got into a bit of late night standards rambling... Let me add a few comments:

Quoting - oxydius

In other words, I meant the IPP decoder achieves conformance by assuming upfront the worst case buffering/reordering scenario. It is therefore, latency-wise, sub-optimal in all cases, even though it does not exhibit any failure.

Yes, but you can (possibly with a few modifications to the code - bugfixes?) deliver this information embedded in the stream in the bitstream restrictions of the VUI parameters. This would allow other decoder implementations incl. hardware to also handle it with little latency - depending on their individual implementation/handling of buffering. I always suggest adding as much stream information directly in the stream. It is seldomly a problem with the small overhead it adds...

For a given H.264 stream level and resolution, you can deduce the worst case to be a 14-frame reordering delay. That doesn't mean you need to hold the first IDR frame for 14 frames. While a 14-frame buffer would be allocated, it could be used if needed only, such that getting IDR plus a sequence of P frames that only reference the previous frame would introduce zero latency. That is typically why videoconferencing software use the baseline profile, as it prohibits complex decoder re-ordering.

From what I remember, I think the specification defines a max. of 16 but that hardly changes anything in this scenario. But do keep in mind that this is the worst-case across any stream at any level but for a given level for a specific stream the number may well be less (defined by the 'MaxDpbSize' derived parameter). Although for small picture resolutions as are discussed here, the level value needs to be very low to avoid a large 'MaxDpbSize' (or the upper limit of 16 depending on which is smallest).

Conformance can be maintained for high profile streams, when upon the first encounter of a P or B frame that needs re-ordering, the decoder returns UMC_ERR_NOT_ENOUGH_DATA until enough frame dependencies are resolved to fully decode and output the oldest frame. This may only require buffering 2-3 frames rather than the worst case of 14. I think IPP provides a great decoder and use it in all my projects, but it could use improvements in dynamic re-ordering dependency handling.

By the way, for the baseline profile with no re-ordering (decode order = present order), Ying's fix works perfectly to eliminate artificial latency.

Agree - it is a good decoder implementation but could need improvements in various aspects including to some degree the frame buffer handling...
Good point on low-delay handling for streams with reordering. In a general solution, one may opt to go up to the max. decoder buffering when the first out-of-order picture arrives as this would only create a single "gap" in the stream out of the decoder, whereas going adaptively with the re-ordering as it arrives could require several changes (increases) in the buffering - although that is probably not a common situation. And it all depends entirely on how the pictures are handled once out of the decoder.

- Jay

david_jacksonipfx_co · ‎08-24-2009

Quoting - Ying Hu (Intel)

Hello David,

Or if you are using umc decoder, linking h264_dec.lib,youcan try to modify the library code, like
1) in h264_dec => umc_h264_task_supplier.cpp
make sure the value of m_maxDecFrameBuffering=0
i.g, comment out line 921
// if (m_TrickModeSpeed != 1)
// {
m_maxDecFrameBuffering = 0;
// }
2) at line 2318 of umc_h264_task_supplier.cpp.
Add two lines as below.

if (!pSource)
{
AddSlice(0, 0);
....
}
else
{
// add two lines here
if (MINIMAL_DATA_SIZE >= pSource->GetDataSize())
AddSlice(0, 0);
// decoding
return RunDecoding(dst, pSource, force);
}
and rebuild the h264_dec.lib

Regards,
Ying

Firstly, thank you for all all the replies, this exchange his certainly shed some much needed light on the pros and cons of the IPP implementation, it would seem to me that the ability to configure the IPP decoder for real time applications (i.e. not having the buffer) should have been a no brainer, seen as how the code change to achieve this is so trivial.

To Ying: Thank you very much for your code suggestion, it certainly does cut out the buffering, however it has a problem in that it does not seem to work when slicing is enabled. Is it possible to reconcile these to requirements? (i.e. no buffering and slicing).

Kind Regards,
David.

david_jacksonipfx_co · ‎08-30-2009

Quoting - david.jacksonipfx.com

Firstly, thank you for all all the replies, this exchange his certainly shed some much needed light on the pros and cons of the IPP implementation, it would seem to me that the ability to configure the IPP decoder for real time applications (i.e. not having the buffer) should have been a no brainer, seen as how the code change to achieve this is so trivial.

To Ying: Thank you very much for your code suggestion, it certainly does cut out the buffering, however it has a problem in that it does not seem to work when slicing is enabled. Is it possible to reconcile these to requirements? (i.e. no buffering and slicing).

Kind Regards,
David.

Bump: An answer would be appricated, as if it is not possible to reconsile buffering with silcing I shall have to start implementing (yet more) code to work around this problem.

Ying_H_Intel · ‎08-31-2009

Quoting - david.jacksonipfx.com

Bump: An answer would be appricated, as if it is not possible to reconsile buffering with silcing I shall have to start implementing (yet more) code to work around this problem.

Hello David,

Sorry, our engineer experts haven't further suggestion, seems not easy workaround, please go ahead toimplement according to your requirement.

Regards,
Ying

Emmanuel_W_ · ‎09-01-2009

Quoting - david.jacksonipfx.com

Firstly, thank you for all all the replies, this exchange his certainly shed some much needed light on the pros and cons of the IPP implementation, it would seem to me that the ability to configure the IPP decoder for real time applications (i.e. not having the buffer) should have been a no brainer, seen as how the code change to achieve this is so trivial.

To Ying: Thank you very much for your code suggestion, it certainly does cut out the buffering, however it has a problem in that it does not seem to work when slicing is enabled. Is it possible to reconcile these to requirements? (i.e. no buffering and slicing).

Kind Regards,
David.

Hi,

I was able to accomodate both requirements multiple slices and low latentcy for IPPPP sequences but this involved some refactoring. I do not remember beeing it too hard tobut I have enough changes that it is difficult for me to pinpoint the exact modifications. I am forcing decoding upon receiving a marker bit in the RTP streams (not having to wait for the next packet to detect a new frame), I disable threading, I set the DBP buffering to 0 as indicated in this thread.
With a fully optimize framework from capture to playback using IPP encoder/decoder and some video filteringI am able to get delay as low as 120ms on windows box with a I7 3 Ghz.
Sorry I can't be of more help.

Emamnuel

david_jacksonipfx_co · ‎09-03-2009

Quoting - eweber

Hi,

I was able to accomodate both requirements multiple slices and low latentcy for IPPPP sequences but this involved some refactoring. I do not remember beeing it too hard tobut I have enough changes that it is difficult for me to pinpoint the exact modifications. I am forcing decoding upon receiving a marker bit in the RTP streams (not having to wait for the next packet to detect a new frame), I disable threading, I set the DBP buffering to 0 as indicated in this thread.
With a fully optimize framework from capture to playback using IPP encoder/decoder and some video filteringI am able to get delay as low as 120ms on windows box with a I7 3 Ghz.
Sorry I can't be of more help.

Emamnuel

Thank you for your reply, I think I'll end up implementing it as some type of external buffering to the decoder.

All these modifications just to get a certain use case working rather defeat the notion of library as a reducer of complexity in my opinion. The whole thing seems to have been implemented as a intellectual exercise as opposed to meeting real world demands.