Media (Intel® Video Processing Library, Intel Media SDK)
Access community support with transcoding, decoding, and encoding in applications using media tools like Intel® oneAPI Video Processing Library and Intel® Media SDK
Announcements
The Intel Media SDK project is no longer active. For continued support and access to new features, Intel Media SDK users are encouraged to read the transition guide on upgrading from Intel® Media SDK to Intel® Video Processing Library (VPL), and to move to VPL as soon as possible.
For more information, see the VPL website.

big latency of decoding H.264 stream with Media SDK

Brooks_L_
New Contributor I
2,635 Views

Folks, I have troubles in decoding H.264 stream with Media SDK,(Media SDK 2014 for Clients). In order to decode real time H.264 stream from a encoder, I did as every tip about reducing latency in media sdk document and  Media_SDK_video_conferencing, so that the configuration in my decoder side is as follow,

m_mfxVideoParams.mfx.CodecId = MFX_CODEC_AVC;
m_mfxVideoParams.IOPattern = MFX_IOPATTERN_OUT_VIDEO_MEMORY;
m_mfxVideoParams.AsyncDepth = 1;

...

 m_mfxBS.MaxLength = 1024 * 1024;
 m_mfxBS.Data = new mfxU8[m_mfxBS.MaxLength];
 m_mfxBS.DataFlag = MFX_BITSTREAM_COMPLETE_FRAME;

...

However, the outcome is that the decoding latency is still more than 10 seconds. I determined it by comparing the time label on the source video  screen and presention screen at decoding side.( I did it on purpose ).  Is there any limit on this for Media SDK? Or I missed something?

Previously, I used to decode stream by ffmpeg. The latency was quite satisfying, less than 1 seconds.

Anyone who gives me hints would be highly appreciated!

Thanks in advance!

Brooks Li.

0 Kudos
1 Solution
Sravanthi_K_Intel
2,635 Views

Hello Brooks - If you look at our sample_decode, we provide the option of writing the decoded stream to a file, or render on screen or run without producing and output (which is not what you want). The first option is known to slow down performance since the fwrite process of raw video stream can choke up the system (more so the disk and cpu). Looks like that is similar to you are doing here.

So, I would suggest you use the render option that we have in sample_decode:pipeline_decode.cpp -> mfxStatus CDecodingPipeline::CreateRenderingWindow(sInputParams *pParams, bool try_s3d)

This function has the details you are looking for - how to operate (or render) on d3d surfaces. Hope this helps.

View solution in original post

0 Kudos
19 Replies
Surbhi_M_Intel
Employee
2,634 Views

Hi Brooks Li,

Thank you for the question. One point which I see right away is to change the async depth to be 4 or more to parallelize the decoding loop. With this you should definitely see an increase in decoding speed, hence reducing the latency.
I need few details to analyze this issue better - 
1. System configration, please run the Media SDK system analyzer (details can be found here - https://software.intel.com/en-us/articles/details-necessary-for-filing-forum-questions)
2. what input you are using, resolution?
3. can you reproduce this issue using existing video decoding sample ? In general the decoding speed of the Media SDK AVC decoder is pretty good, around 1000 frames per second for 1080p HD bitstream.

Thanks,
Surbhi

0 Kudos
Brooks_L_
New Contributor I
2,634 Views

Hi, Surbhi

      Thank you for quick response. I increased async depth to 4, however, the result is not as we expected. I see no change on latency. Under the guidance of document you mentioned in #1. I got analyer log to show my system and details about the installed SDK, and also the log from tracer indicating my API calls during decoding proccess. Both are attached here.  The video resolution is 1280x720. I could not reproduce it by the original sample code, for the sample is based on input file, while my scenario is for live stream feed through local ethernet.

Looking forward to your further help after checking logs. Any questions, please do not hesitate to ask me.

B.R.

Brooks

0 Kudos
Sravanthi_K_Intel
2,634 Views

Hello there,

You mention your app is a video conferencing, I am wondering if the encoding process is tailored for low latency setting as well? For low latency mode, we have a tutorial simple_6_transcode_opaque_lowlat that shows parameters to set for low latency for both decode and encode. Can you please use that application and see if you see the same behavior?

    // Configuration for low latency for encode
    mfxEncParams.AsyncDepth = 1;    //1 is best for low latency
    mfxEncParams.mfx.GopRefDist = 1;        //1 is best for low latency, I and P frames only

Can you please update your graphics driver to the latest? You seem to be running an older version.

Regarding your tracer.log, there are some messages at the end that suggest the stream may have changed some video parameters at frame #3 or so. Can you verify that as well?

=====mfxSTATUS MAP=====
DecodeFrameAsync: EXTERNAL : MFX_ERR_MORE_SURFACE (167)
DecodeFrameAsync: EXTERNAL : MFX_ERR_NONE (167)
DecodeFrameAsync: EXTERNAL : MFX_WRN_VIDEO_PARAM_CHANGED (27)
SyncOperation(D): EXTERNAL : MFX_ERR_NONE (167)
0 Kudos
Brooks_L_
New Contributor I
2,634 Views

Hi Sravanthi

    Actually, the encoder has been indeed tailored for my conferencing scenario. The encoder is configured as the tutorial shows. I am sure that the problem only lies in decoder side, for I have 2 versions of decoding application. The 1st one takes ffmpeg to decode stream. It works well with half second's  latency as I observed. And the second version is current one, built on top of intel media sdk. This version produced more than 10, even 20 secs' latency. 

     This morning I upgraded driver to be up-to-date. And your last suggestion about parameter changing helped me a lot. After checking my code carefully, I eventually found that I made a mistake when feeding buffer to intel decoder. The pointer offset is wrong. By fixing the problem, now it works fine and the latency is around 0.5 seconds, slightly longer than ffmpeg's version. However, that's satisfying enough for me.

     I hope all the issues have been gone, Unforturenately,another problem appears, which is that, about every 2 or 3 seconds, intel decoder ouputs a badly distorted picture. ( In fact, at the beginning, within about 2 seconds, intel decoder always outputs the distorted picture, after initated. After this period, the picture becomes normal, but periodically, appears distorted one, as I said ). I captured the log with tracer and attach tracer2.txt here for your professional insight. Hoping you could save me again. :) 

    Would you please be so kind to get me out once more?

 

B.R.

Brooks Li.

0 Kudos
Sravanthi_K_Intel
2,634 Views

Regarding the decoder distorting the output every so often, I suspect it has something to do with the bitstream it is decoding. The decoder quality has been pretty stable and non-buggy, so you could be running into some corner case may be. Can you try big_buck_bunny or other commonly used bitstreams out there to test this theory? Please let me know if you can narrow down the issue a little more - also,If you could send in the bitstream, we can take a look at it.

Few more suggestions - Just check the bitrates, framerates etc. If the bitrate is too low or too high and/or some quick scene changes, that could play a role as well.

0 Kudos
Brooks_L_
New Contributor I
2,634 Views

Hey, Sravanthi, I got the root cause for distortion problem just now. My decoding control process was not the same as the sample does. I rewrote the whole loop, in which I dequeue the  incoming 264 encoded frames one by one and afterwards put into intel decoder, to make it to be aligned with decoding 3D sample. And, I think, the most important change is that I added 'wait' when the error code indicates 'busy' and only when the error tells more data required, call the methods to load new encoded frames. Currently, it works well, generating continuous clear and clean pictures.

Up to now, decoding functionality is OK. But the performance is not as good as expected. Since, on intel I5 3470, one stream of 1080 30fps leads to around 25% cpu usage, which is obviously too high. After efforts on tracing, in surprise, I found the most consuming part is memory copy operation to move NV12 data from intel decoder surface to another place. So, any good suggestion on how to improve this part? 

0 Kudos
Brooks_L_
New Contributor I
2,634 Views

Additionally, when I comment out the memcpy there, the cpu usage is cut dramatically to be about 5%. However, in my decoder application, there are multiple places where the memcpy for the data chunk of almost same size takes place. Consequently, I think memcpy there must be much heavier than any other place. Furtherly, the root cause should NOT lie in the method memcpy itself. What do you say about this?

0 Kudos
Sravanthi_K_Intel
2,634 Views

Hello Brooks - Can you explain where you are doing a "memcpy"? (Correct me if my assumptions are wrong) If you are using Video memory and surfaces (basically HW implementation), there are no mallocs and memcpy from system to video memory. As soon as you use memcpy, you are operating in the system memory which is known to be less performant than video memory.

Are you referring to the YUV to NV12 conversion where we read YUV file and convert it to NV12 for processing? If that is the case, the tutprials take the easy route of color conversion (by coding it up in C++!). You can, alternately use VPP for color conversion and that will surely give you a better performance.

In any case, can you give me some more information - for example, if you can paste the code snippets with memcpy is happening, I can help you with that.

0 Kudos
Brooks_L_
New Contributor I
2,634 Views

Thanks, Sravanthi, for warmhearted and patient help. Pleae see the relevant code snippet below,

m_sts = m_pMfxDEC->DecodeFrameAsync(&m_mfxBS, m_pmfxSurfaces[nIndex], &pmfxOutSurface, &syncp);
 
 // Ignore warnings if output is available,
 // if no output and no action required just repeat the DecodeFrameAsync call
 if (MFX_ERR_NONE < m_sts && syncp)
  m_sts = MFX_ERR_NONE;
 
 if (MFX_ERR_NONE == m_sts)
  m_sts = m_mfxSession.SyncOperation(syncp, 60000); // Synchronize. Wait until decoded frame is ready

if (MFX_ERR_NONE == m_sts)
 {

  // Surface locking required when read/write D3D surfaces
  m_sts = m_mfxAllocator.Lock(m_mfxAllocator.pthis, pmfxOutSurface->Data.MemId, &(pmfxOutSurface->Data));
  MSDK_RETURN_ON_ERROR(m_sts); 

mfxFrameInfo *pInfo = &pmfxOutSurface->Info;
mfxFrameData *pData = &pmfxOutSurface->Data; 

// copy Y data to Tmp buffer
BYTE* pY = pData->Y + ( pData->Pitch * pInfo->CropY + pInfo->CropX ); 
 for( int row = 0; row < pInfo->CropH ; row++ )
 {
  memcpy( pTmpBuf, pY, pInfo->CropW );
  pTmpBuf += pInfo->CropW;
  pY +=   pData->Pitch;
 }


  // copy UV data to Tmp buffer
 BYTE* pUV = pData->UV + ( pData->Pitch * pInfo->CropY + pInfo->CropX );
for( int row = 0 ; row < pInfo->CropH / 2 ; row ++ )
 {
  memcpy( pTmpBuf, pUV, pInfo->CropW );
  pTmpBuf += pInfo->CropW;
  pUV += pData->Pitch;
 }
  
  m_sts = m_mfxAllocator.Unlock(m_mfxAllocator.pthis, pmfxOutSurface->Data.MemId, &(pmfxOutSurface->Data));
  MSDK_RETURN_ON_ERROR(m_sts); 
 }      

 

After succeeded in decoding the frame, I need get the decoded NV12 data from surface, copy them to a temperary buffer and then send to presentation buffer list. The presentation buffer list would be visited by my render to get NV12 buffer one by one and draw on screen with D3D.

Maybe, here, there is another possibility to do improvement. That is, moving data from decoder output surface to D3D surface directly. But how shall I do it?

B.R.

Brooks Li.

 

0 Kudos
Sravanthi_K_Intel
2,636 Views

Hello Brooks - If you look at our sample_decode, we provide the option of writing the decoded stream to a file, or render on screen or run without producing and output (which is not what you want). The first option is known to slow down performance since the fwrite process of raw video stream can choke up the system (more so the disk and cpu). Looks like that is similar to you are doing here.

So, I would suggest you use the render option that we have in sample_decode:pipeline_decode.cpp -> mfxStatus CDecodingPipeline::CreateRenderingWindow(sInputParams *pParams, bool try_s3d)

This function has the details you are looking for - how to operate (or render) on d3d surfaces. Hope this helps.

0 Kudos
Sravanthi_K_Intel
2,634 Views

Hello Brooks - For your question on compiling the samples, "I also tried to compile it on another system, win8.1 + Intel Media SDK 2014 for Clients, but still failed. It seems that some new macros are added into SDK, is it? Should I upgrade SDK to newest one?"

You need to use R2 version of the Media SDK for Clients, which is still avaialble online. Our latest offering of SDK can be found here: https://software.intel.com/en-us/intel-media-server-studio

0 Kudos
Brooks_L_
New Contributor I
2,634 Views

Hi, Sravanthik, I do not know how to express my appreciation and excitement! The avg. cpu usage is cut down to 5% for one realtime stream of 30fps, 1080p, after integration with pipeline solution. The performance improvement is awksome!

Thank you so much!

Brooks Li.

0 Kudos
Sravanthi_K_Intel
2,634 Views

Hello Brooks - Thanks for your kind words. Very glad you are satisfied with the performance.

0 Kudos
Brooks_L_
New Contributor I
2,634 Views

Hi, Sravanthik

    I've got new problem with this pipeline solution, based on the code of sample_decode. For some reason, I need capture the decoded picture and save all its RGB data into a local file, either BMP or JPG, or something else. But when I tried to do it with directx API, D3DXSaveSurfaceToFile, I always get E_FAILED by returned value, my code snippet is like,

        IDirect3DSurface9* pDecSurf = (IDirect3DSurface9*)pSurface->Data.MemId;

        IDirect3DDevice9* pD3Device = m_hwdev->GetD3DDevice();

        if( pD3Device )
        {
            HRESULT hr = D3DXSaveSurfaceToFile( pDestFileName,D3DXIFF_BMP,pDecSurf ,NULL, NULL);

    }

In above, the orignal pSurface in bold is output of decoder and delivered to CDecodeD3DRender::RenderFrame method as one of input parameters. Is there anyting wrong ?

Hope you could recall what we once discussed in this thread...

 

0 Kudos
Sravanthi_K_Intel
2,634 Views

Hi Brooks,

There are few more steps your code is missing - mapping pSurface to D3D9 surface, locking it when writing etc., You can take a look at D3DFrameAllocator::LockFrame(mfxMemId mid, mfxFrameData *ptr) function in sample_common/src/d3d_allocator.c to understand how to do that. Once you do these correctly, you can save the surface to the format you want (make note of the pSurface format as well). The function I pointed to you above will help you.

0 Kudos
Brooks_L_
New Contributor I
2,634 Views

Very well, it works fine. How I wish I could mark 2 replies as 'Best Reply' in this thread!

Thank you very much, Sravanthik!

0 Kudos
Sravanthi_K_Intel
2,634 Views

Thanks you Brooks, glad it worked for you. Happy coding!

(For any other questions, may I suggest you start a new thread and reference this inside it if needed? I will close this one as solved).

0 Kudos
Sravanthi_K_Intel
2,634 Views

Thanks you Brooks, glad it worked for you. Happy coding!

(For any other questions, may I suggest you start a new thread and reference this inside it if needed? I will close this one as solved).

0 Kudos
Brooks_L_
New Contributor I
2,634 Views

Sure, I would issue other discussion by starting new thread in future. You could close current one as resolved.

0 Kudos
Reply