Media (Intel® Video Processing Library, Intel Media SDK)
Access community support with transcoding, decoding, and encoding in applications using media tools like Intel® oneAPI Video Processing Library and Intel® Media SDK
The Intel Media SDK project is no longer active. For continued support and access to new features, Intel Media SDK users are encouraged to read the transition guide on upgrading from Intel® Media SDK to Intel® Video Processing Library (VPL), and to move to VPL as soon as possible.
For more information, see the VPL website.

How to change encoder's base pts-dts shift?

New Contributor III

Consider a scenario when need to join (at idr points) two h264-encoded segments (produced by different codecs). There should be no gap by pts nor by dts at the junction point.
imsdk h264 encoder always produce first encoded frame's dts = minimal_input_pts - 80ms (-2 frames at 25 fps rate), except low-delay I/P-only configurations with zero shift.
Is there a way to instruct encoder to make base shift (on idr frame) equal to -1 or -3 frames, e.g.? I tried to play with the different encoder settings (GopRefDist, NumRefFrame, InitialDelayInKB, etc), but always got -2 frames delta...

0 Kudos
2 Replies


There is not an API to adjust/offset the encoded timestamp, but the Media SDK encoding step just uses the timestamp of the frame it is supplied.  There are some cases where the timstamp is calculated (frame rate converstion, etc.) and it sounds like you are seeing generated timestamps.  The 2-frame offset you are seeing could be generated for the encoded GOP pattern (2 frames before IDR frame). Setting different pattern (GopRefDist, etc.) should make difference, but MSDK may adapt (optimize/change) the request if it feels another pattern is better.  You can force the GOP behavior by using the "MFX_GOP_STRICT" optoin of GopOptFlag.

Can you provide some more details about the src clip framerates, encode GOP patterns and TIMESTAMP flags you are using?


0 Kudos
New Contributor III

Hi, Tony
As far as I understand h264 coding (and any coding with forward predictions, in general), the main aim of that base dts-pts shift is to have enough timespace to transfer/decode frames intended for forward predictions within the whole gop and all subsequent gops (and not only for satisfing B-frames located before IDR in presentation order).
I other words, -1 frame base shift allows to use not more than 1 reference frame for forward prediction, -2 frames shift allows max 2 forward reference frames, etc.
This judgement is the result of a system of rules/restrictions/inequalities: pts>=dts for each frame, fixed framerate, absence of gaps at decoder output, monotonic growth of dts.
On the other hand, I see no sense in using a larger shift than the maximum number of reference frames - it will be just an unreasonable increase of decoding delay and memory requirements.
Thus, we get a straight correlation between base dts-pts shift and NumRefFrame (once again, I give a definition of base dts-pts shift as "base_shift = dts_of_first_output_frame - pts_of_first_input_frame").
Therefore, variation of NumRefFrame parameter was my first aim. But it did not bring the desired result.
I'll try to employ MFX_GOP_STRICT and then write research results here.
Thank you!

By the way, you did not reply to another one of my posts:

So, I present the results of research below.
GopPicSize = 30 for all tests.
GopRefDist = 1, NumRefFrame = any: shift = 0 (it is low-delay IP-only sequences).
GopRefDist >= 4, NumRefFrame = 1 or 2: shift = -1 frame.
GopRefDist >= 4, NumRefFrame >= 3: shift = -2 frames, not more

It is quite interesting how imsdk h264 encoder distributes reference frames on backward and forward classes (it's about B-frames decoding) for NumRefFrame >= 3 cases.
Maximum number of backward reference frames (L0) is NumRefFrame-1, number of forward reference frames (L1) is 1 or 2 always, independent of how large NumRefFrame is. Those numbers concern progressive content, for interlaced input they are naturally multiplied by 2.
E.g., NumRefFrame = 10 results in: max 9 frames are used for backward prediction and max 2 frames are used for forward prediction only!
Such asymmetric approach to coding is the cause of the max -2 frames dts-pts shift I saw.

What to say to that? I think a good idea to give programmers the ability to manage NumForwardRefFrame parameter (as GopPicSize/GopRefDist/NumRefFrame-level variable) because it strongly affects coding quality (on the one hand) and coding/decoding delay (on the other hand).

0 Kudos