Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.
6709 Discussions

Is H.264 decoder threading broken in 5.3 upgrade?

rcpdesigns
Beginner
838 Views

After upgrading from 5.1 to 5.3, our H.264 decoder now uses 100% of one core no matter what the size of the movie it is decoding. For example, a simple 160x120 movie that normally would take less than 1% of the CPU, now take 50% as one of the cores is max'd out at 100%.

Our client has rejected the current version of the player because of this extra CPU utilization. We can't, however, go back to 5.1 as there were a couple of decode errors. So we are now in a stuck.

At least two other posts have pointed this problem out and there has been no repsonse from Intel. We need to know if this problem has been verified and when it will be fixed.

Please, please respond to this post!

0 Kudos
19 Replies
Vladimir_Dudnik
Employee
838 Views

Hello,

could you please provide max of details regarding that issue? I mean, operating system, hw platform, how do you link IPP statically or dynamically, did you change anything in UMC sources or it is used as is? It also would be helpful if you can attach here that problem stream.

Regards,
Vladimir

0 Kudos
rcpdesigns
Beginner
838 Views

Thank you for your response.

The 'simple_player.exe' that comes with the 5.3 samples will show this problem. Just run 'simple_player.exe' with a H.264 file and you will see that one core (of a dual-core system) is very, very busy, while the other core does very little. Use a video that is very easy to decode, such as a small size (160x120). All videos will show this problem, however some are more complicated so it is difficult to notice that one core is much busier than the other.

The OS is WinXP, dual core, static and dynamic produce same results. We didn't change anything in the UMC sources.

Thanks!

0 Kudos
rcpdesigns
Beginner
838 Views

Also.... If we remove the 'threading' option by setting numThreads to 1, then the CPU utilitzation returns to normal (ie; very low for a small size video). Unfortunately this is not an option for larger size files as both cores are needed.

0 Kudos
rcpdesigns
Beginner
838 Views

BUMP!!!

It has been two months now and no solutions or offers of help. I have had to sit through two demos and explain to the Navy why our product takes 100% of a core to decode a simple H264 file. I have yet another demo next week and I really don't want to go through this again.

Could someone *please* look at this problem. As posted above, you can run the ipp-sample player to see this problem.

I am running a Intel Core2 Duo (2.2GHz). I guess we are using dynamic link (we are linking with LIBs in stublib.

Running Windows XP SP 2.

With numThreads=1 the CPU ultilization is normal.

With numThreads=2 100% of one core is used and very little of the other.

0 Kudos
Vladimir_Dudnik
Employee
838 Views

Hello,

sorry for the delay. We are tried to reproduce that issue but without success. Could you please attach sample bitstream where you see that issue? It is important as we will have a chance to fix the issue (if any) within upcoming IPP update. I've created issue report for on Intel Premier Support, so you will be notified on the status of investigation.

Regards,
Vladimir

0 Kudos
rcpdesigns
Beginner
838 Views

It happens for all bitstreams.

Please note that this person sschaem@adobe.com also posted the exact same problem a couple of months ago.

We did the following:

1) Download 5.3 samples

2) Build using build32.bat and 'cl8'

3) Ran simple_player.exe on MP4 files found on Internet. There is a Harry Potter trailer on the 'net that is a small size (640x480). Runningwith one threadit uses about 2% of the CPU. Runningwith two threads it uses 52% of the CPU (100% of one core, 2% of the other).

It is important to realize that the overall performance is great, it is just one of the CPU cores is spinning at 100% (and not actually doing anything). If you play the most complicated 1920x1080p H264 file, it will still decode in real-time.

Our problem is that when the customer looks at the CPU utilization it does not meet requirements, even though it really does.

The smallest sample I have is 10MB. I uploaded the files as an attachment several times, but each time at the end of the upload it said that an error occured. Maybe it is too big?

Also, I don't know how to use Premier Support. Do you have a link I could follow to learn more?

0 Kudos
shyaki
Beginner
838 Views

I would recommend you use umc_h264_dec_con to reproduce the problem. My guessing is thatthere was something wrong insimple_player when you did your test.

0 Kudos
franknatoli
New Contributor I
838 Views

I can confirm the same observation. When VideoDecoderParams::numThreads is set to one, and an H.264 file is decoded, WinXP SP2 Task Manager shows very modest CPU utilization on a dual core CPU. But when numThreads is set to two, the same file pegs the dual core CPU utilization. Would like to paste the Task Manager screen shots but cannot figure out how to do that with this forum.


					
				
			
			
				
			
			
			
			
			
			
			
		
0 Kudos
franknatoli
New Contributor I
838 Views
I found that the audio renderer has a number of vm_time_sleep(0) spins waiting on UMC_ERR_NOT_ENOUGH_BUFFERS. While no data was flowing through the audio renderer, the CPU was pegged at 100%. Look at the source to whatever UMC library you are using. If you see any vm_time_sleep(0) or vm_time_sleep(TIME_TO_SLEEP) where TIME_TO_SLEEP is zero (true except for WinCE) then you probably have your bad boy. I've alerted Intel to this and suggested replacing all spin loops with proper synchronization objects. If you prefer to debug UMC yourself, as I did, can give you an updated Makefile that recompiles the libraries with debugging enabled. Will advise of their reply.
0 Kudos
rcpdesigns2
Beginner
838 Views

Frank,

Thank you for verify this and making a note. Hopefully something will now get done about it. We have since decided to use other codecs, but would like to go back to using IPP if this can get fixed.

I will look for the vm_time_sleep(0) calls and try to fix myself. I would like to spend more time on it, however, my client pays by the hour and they DON'T like the idea of paying me to fix something they bought. They would much rather just buy something else if the IPP isn't going to work. Much, much more cost effective for them. I'd rather go through the code and fix/optimize it. :)

0 Kudos
rcpdesigns2
Beginner
838 Views
shyaki:

I would recommend you use umc_h264_dec_con to reproduce the problem. My guessing is thatthere was something wrong insimple_player when you did your test.

As Frank confirms, this is NOT a simple_player problem, rather a problem within the H264 decoder thread system.

0 Kudos
Chao_Y_Intel
Moderator
838 Views

Hello,

I wrote a simple test code for this problem. I copied my code bellow. My test shows that this problem is related to simple_player application, not the H.264 decoder.

When I use the following test code, which use 2 threadings, to decode a QCIF H.264 file, it has a very low CPU usage. it is similar with single threading.

When I use simple player to decode the H.264 file. I got the that problem. The CPU usage almost reach to 50%

How are you measuring the decoder performance? You canuse the following code to test how it works at your system.

Regards,
Chao


#include
#include

#include "ipp.h"
#include "umc_defs.h"
#include "umc_video_decoder.h"
#include "umc_video_data.h"
#include "umc_h264_dec.h"
#include "umc_h264_video_encoder.h"
#include "umc_structures.h"
#include "umc_video_encoder.h"
#include "vm_time.h"

#define MAXFRAME 2000
#define BUFFSIZE 100000000


Ipp8u *cVideoData = NULL;
int VideoDataSize=0;

Ipp8u *cYUVData = NULL;
int imgWidth,imgHeight, imgSize, frameNumber;
struct vm_timeval TPOLD,TPCUR;


void DecodeStream()
{

UMC::MediaData * pDataIn;
UMC::VideoData * pDataOut;
UMC::Status status;
UMC::VideoDecoderParams *params = NULL;
UMC::VideoDecoder* pH264Decoder=NULL;
int exit_flag, i=0,numDecodedFrames, ret;
int Width, Height,ImageSize=0;
Ipp8u * pCurrOutput=NULL;

int nExpectedFrameNumber=0;
double elapse;

pH264Decoder = new UMC::H264VideoDecoder();
params = new UMC::VideoDecoderParams();
pDataIn = new UMC::MediaData();
pDataOut = new UMC::VideoData();

// Decoder Parameters
params->m_pData = pDataIn;
params->lFlags = UMC::FLAG_VDEC_REORDER;
params->pPostProcessing=NULL;
params->numThreads=2;
params->lpMemoryAllocator=NULL;


// INITIALIZATION
pDataIn->SetBufferPointer(cVideoData,VideoDataSize);
pDataIn->SetDataSize(VideoDataSize);

status = pH264Decoder->Init(params);
if (UMC::UMC_OK != status)
return;

status=pH264Decoder->GetInfo(params); // Getting WIDTH and HEIGHT
if(UMC::UMC_OK != status)
return;
Width=params->info.clip_info.width;
Height=params->info.clip_info.height;
ImageSize=Width*Height*1.5; // we support YUV420 here.

pDataOut->Init(Width,Height,UMC::YV12,8);
pCurrOutput=cYUVData; //We suppose that input dats is aligned. If it is not align,please do something yourself
pDataOut->SetBufferPointer(pCurrOutput,ImageSize);


numDecodedFrames= 0;
exit_flag=0;
ret=UMC::UMC_OK;
vm_time_gettimeofday( &TPOLD, NULL);
while ((numDecodedFrames < MAXFRAME)&&(exit_flag!=1) )
{
while(numDecodedFrames>nExpectedFrameNumber)
{
vm_time_sleep(10);
vm_time_gettimeofday( &TPCUR, NULL);
elapse=((double) (TPCUR.tv_sec-TPOLD.tv_s ec))*1000000 + TPCUR.tv_usec-TPOLD.tv_usec;
nExpectedFrameNumber = elapse/1000000*30;
}


ret = pH264Decoder->GetFrame(pDataIn, pDataOut);

if (ret == UMC::UMC_OK)
{
numDecodedFrames++;
pCurrOutput= pCurrOutput+ImageSize;
pDataOut->SetBufferPointer(pCurrOutput,ImageSize);
}

if ((ret !=UMC::UMC_OK)) {
exit_flag = 1;
}
}

do
{
ret = pH264Decoder->GetFrame(NULL, pDataOut);
if (ret == UMC::UMC_OK)
{
numDecodedFrames++;
//pCurrOutput= pCurrOutput+ImageSize; //All data are decoded to pCurrOutput.The system does not have large buffer
//for all decoded YUV data.
pDataOut->SetBufferPointer(pCurrOutput,ImageSize);

}
}while (ret == UMC::UMC_OK);


// memory deallocation, de-initialization
delete pH264Decoder;
delete params;
delete pDataIn;
delete pDataOut;

//save video info to store to file
imgWidth=Width; imgHeight=Height;
imgSize=ImageSize; frameNumber=numDecodedFrames;
return;
}


int ReadVideoData(char* strFilename)
{
int i=0;
FILE* fp = fopen(strFilename, "rb");
if(fp==NULL)
return 0;
while(!feof(fp)) {
i += fread(cVideoData + i, 1, 10, fp);
if(i==BUFFSIZE)
break;
}
fclose(fp);
return i;
}


int main(int argc, vm_char* argv[]) {
ippStaticInit();

cVideoData = ippsMalloc_8u(BUFFSIZE);
cYUVData = ippsMalloc_8u(BUFFSIZE);

imgWidth=176;
imgHeight=144;
imgSize = imgWidth*imgHeight*1.5;

VideoDataSize=ReadVideoData("n4.h264");
DecodeStream();

double decodetime= (TPCUR.tv_sec-TPOLD.tv_sec)+ ((double)(TPCUR.tv_usec-TPOLD.tv_usec))/1000000;
printf("time = %f ",decodetime);

return 0;
}

0 Kudos
sschaem
Beginner
838 Views

You are not using this in a real situation, you are decoding frame as fast as possible.

The player shedule the decoding , like a video player should.

Ijust dont undrestand why Intel sit on this problem month after month with no resolution.

Its real...

0 Kudos
Chao_Y_Intel
Moderator
838 Views

Hello,

The test code is not trying to decode as fast as possible. It will try to keep the frame rate at 30 fps, then watch what is the CPU usage. Please note the following code in the test code, it isused to keep the frame rate around 30 fps:

while(numDecodedFrames>nExpectedFrameNumber)
{
vm_time_sleep(10);
vm_time_gettimeofday( &TPCUR, NULL);
elapse=((double) (TPCUR.tv_sec-TPOLD.tv_sec))*1000000 + TPCUR.tv_usec-TPOLD.tv_usec;
nExpectedFrameNumber = elapse/1000000*30;
}
.....

Could you please let us know how you are measuring the performance? That will help us to understand the problem. If the code can not be posted here, could you pleaseto submit the issue to https://premier.intel.com

I can notice the issue in the simpleplayer. We are working forthe fix for that.

Thanks,
Chao

0 Kudos
franknatoli
New Contributor I
838 Views
Simple player also includes audio, which your above demo does not. I've already documented that the audio renderer [used by simple player] has vm_time_sleep(0) causing pegging the CPU to 100% on https://premier.intel.com issue 469411. I observed and suggested the following:

Take a look at basic_audio_render.cpp, ThreadProc, line 114, see vm_time_sleep(0). This is in a UMC_ERR_NOT_ENOUGH_DATA loop that pegs the CPU until data flows through the renderer. Other sources making use of this technique are umc_spl_base.cpp and avsync.cpp. Further, avsync.cpp defines macro TASK_SWITCH to invoke vm_time_sleep(0).

There are numerous examples in the API of small time spin waits for similar events, e.g. 5 or 10 milliseconds.

I must strongly recommend that all spin waits be replaced with waits on event objects. I realize that synchronization objects in general are very operating system specific, but spin waits are not acceptable.
0 Kudos
rcpdesigns
Beginner
838 Views

Frank,

I too cannot understand why Intel does not acknowledge and fix this problem. They have spent more time trying to prove it doesn't exist than they would have spent just fixing the problem. I know this because I have found and fixed the problem. It is very much related to they same issues you have talked about, however, in this case they are actually using an event to pace the thread... they didn't, however, initialize the event. The problem was someone either didn't understand (or forgot) how their dynamic array class worked. This doesn't speak very highly of the rest of the code and further investigation proved this to be correct. I guess we have to understand that these are just 'samples' to show how you might use the IPP libs. I think we will be looking at other codecs that are sold as products, not as samples to a library of primitives.

0 Kudos
Chao_Y_Intel
Moderator
838 Views

Dear All,

Our engineering teamcan identify a threading problem in the Codec. Here is a fix for this problem:

File: codech264_decsrcumc_h264_task_broker.cpp:


bool TaskBrokerTwoThread::Init(Ipp32s iConsumerNumber)

{

if (!TaskBroker::Init(iConsumerNumber))
return false;

m_nWaitingThreads = 0;

// initilaize event(s)
for (Ipp32s i = 0; i < m_iConsumerNumber; i += 1)

{ if (NULL == m_eWaiting)
{
if (false == m_eWaiting.AllocateOneMore())

return false;

if (UMC_OK != m_eWaiting->Init(0, 0))

return false;
}

}
return true;
}

Change to:

bool TaskBrokerTwoThread::Init(Ipp32s iConsumerNumber)

{

if (!TaskBroker::Init(iConsumerNumber))

return false;


m_nWaitingThreads = 0;
m_eWaiting.Init(m_iConsumerNumber);


// initilaize event(s)
for (Ipp32s i = 0; i < m_iConsumerNumber; i += 1)
{

if (UMC_OK != m_eWaiting->Init(0, 0))
return false;
}

return true;

}

Could you please check if this can fix the problem for you?
Thanks all of you again for your patience for this problem.

Regards,
Chao

0 Kudos
rcpdesigns
Beginner
838 Views

Yes, that is the problem.

I really don't understand why your team is so reluctant to admit and fix problems. The only reason this problem got fixed was because I fixed it and called them out on it. No one is perfect and everyone makes mistakes... it is quite silly to think otherwise.

0 Kudos
samv
Beginner
838 Views
This doesn't fix the problem for me.

I'm statically linking the IPP libraries on MacOS X and decoding a 720x576 standard definition H.264 stream. The hardware is a Mac Pro with 2x2.66 Dual Core Xeons, i.e. four cores.

I have code that uses QuickTime to decode the stream and it barely registers any CPU usage at all during decoding. When I substitute the Intel code the CPU usage goes to 60% on all four cores which is completely unacceptable. Most of the additional CPU usage is in the kernel.

The above change makes little or no difference to the problem. Interestingly, I can make the problem go away simply by commenting out the AutomaticUMCMutex at the top of TaskBroker::GetNextTask(). Note that this only works if the previously suggested fix _isn't_ applied. If both are applied the code will deadlock.

I haven't had the time to investigate the locking model in this code so I can't recommend my change to anyone, but it does work. What's the story with the LIGHT_SYNC preprocessor definition and why are TryRealGet() and TryToGet() commented out at the top of the file?

I don't understand why the thread synchronisation code is burning CPU cycles. If the locking is too coarse grained then threads should simply sleep when they could otherwise be doing work. Something else must be going on that's causing threads to wake up unnecessarily and sleep again all the time. I wonder if it's related to the use of recursive mutexes in vm_mutex_init. Making them non-recursive causes an immediate deadlock, so the code is certainly re-entering critical sections. This rang alarm bells for me and I wouldn't mind betting that it's related to the CPU usage problem.

If I get the time I'll look into it further, but quite frankly, I shouldn't have to. We're currently evaluating the Intel IPP libraries and my first impression is that the supporting code needs a lot of work before it can be considered production quality.
0 Kudos
Reply