why my encode application is low performance?

Jin_Shuyang · ‎03-14-2012

i write a encode applicationit real time encoding stream ,it is a console app . i have 3 simultaneous 1080i stream ,my app process is,VPP: color convert(yuy2->nv12) & (1080i->1080p) ,ENC (YUV->H264)
my cpu is 2600k, os:win7 U 32bit
my Configuration: H.264 codec, 1920x1080, 25fps, 4 Mpbs, MSDK balance quality setting,system memory surfaces
my problems and questions:

1. when 2 simultaneous 1080i input it can output 25fps 1080p (CPU utilization is 6% per stream),but 3 simultaneous the output only 18-20fps(CPU utilization is 8% per stream),when input is6 simultaneous 720*576 i 30fps it can output 30fps but 7 simultaneousthe output decrease 25fps ,8 simultaneous the output decrease 15-18 fps,why?Petter Larsson said it has great performance

2.Whether memory surfaces type influenced greatly performance?

3. whether 64bitOSis better than 32bit OS for encode performance?

4.i set bd3dAlloc=true
in follow code(pipeline_encode.cpp function Allocframes()): (1) return MFX_ERR_NONE but (2) return MFX_ERR_MEMORY_ALLOC why?how to resolve it? i set bd3dAlloc=false is no problem

EncRequest.NumFrameMin = nEncSurfNum;

EncRequest.NumFrameSuggested = nEncSurfNum;

memcpy(&(EncRequest.Info), &(m_mfxEncParams.mfx.FrameInfo), sizeof(mfxFrameInfo));

EncRequest.Type = MFX_MEMTYPE_EXTERNAL_FRAME | MFX_MEMTYPE_FROM_ENCODE;

if (m_pmfxVPP)

{

EncRequest.Type |= MFX_MEMTYPE_FROM_VPPOUT; // surfaces are shared between vpp output and encode input

}

// add info about memory type to request

EncRequest.Type |= m_bd3dAlloc ? MFX_MEMTYPE_VIDEO_MEMORY_DECODER_TARGET : MFX_MEMTYPE_SYSTEM_MEMORY;

// alloc frames for encoder

(1)sts = m_pMFXAllocator->Alloc(m_pMFXAllocator->pthis, &EncRequest, &m_EncResponse);

MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);

// alloc frames for vpp if vpp is enabled

if (m_pmfxVPP)

{

VppRequest[0].NumFrameMin = nVppSurfNum;

VppRequest[0].NumFrameSuggested = nVppSurfNum;

memcpy(&(VppRequest[0].Info), &(m_mfxVppParams.mfx.FrameInfo), sizeof(mfxFrameInfo));

VppRequest[0].Type = MFX_MEMTYPE_EXTERNAL_FRAME | MFX_MEMTYPE_FROM_VPPIN;

// add info about memory type to request

VppRequest[0].Type |= m_bd3dAlloc ? MFX_MEMTYPE_VIDEO_MEMORY_DECODER_TARGET : MFX_MEMTYPE_SYSTEM_MEMORY;

VppRequest[1].Type |= m_bd3dAlloc ? MFX_MEMTYPE_VIDEO_MEMORY_DECODER_TARGET : MFX_MEMTYPE_SYSTEM_MEMORY;

(2)sts = m_pMFXAllocator->Alloc(m_pMFXAllocator->pthis, &(VppRequest[0]), &m_VppResponse);

MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);

}

Petter_L_Intel · ‎03-14-2012

Hi Jin,

Let me try to address your questions one by one.

1. I did a quick test by modifying the Media SDK sample_encode to enable a pipeline like yours. For two concurrent streams I see ~60-70fps per stream. 4 streams: 35fps/stream. 6 streams: 25 fps/stream
Not sure what may be different in your setup. Possibly other bottlenecks such as file access, implicitsurface copies or color conversions?

One way to determine pipeline efficiency is to the Intel Graphics Performance Analyzer (GPA). This tool is free, for more info please refer to this white paper:http://software.intel.com/en-us/articles/using-intel-graphics-performance-analyzer-gpa-to-analyze-intel-media-software-development-kit-enabled-applications/

2. Yes, the memory surfaces used will have a large impact on HW accelerated path performance, especially for a coupled pipeline like yours. I used D3D surfaces only for the benchmark above. If you use system memory to store your surfaces many surfaces copes would occur.

3&4: 32bit OS only has access to a limited amount of grapics memory. Since you are running on a 32 bit OS you are likely running out of memory when using D3D surfaces (bd3dAlloc=true). For heavy loads such as yours I encourage you to use 64 bit OS instead, and plenty of RAM.
32 vs. 64 bit performance should not differ much.

Some questions:
- What driver version are you using

- Is each pipeline (VPP+Encode) hosted in separate thread?

Regards,

Petter

Jin_Shuyang · ‎03-15-2012

1.my driver version is 8.15.10.2509
2.yeah,my app is a command line app it has a pipeline thread (VPP+Encode)to process a videostream and anther thread process info like aac ,network.itprocess input stream and send to network,the network is not bottlenecks,because it in other thread and has a list as buffer.i run n apps at a computer.

i run the mchecker got report i upload it peter ,can you check it for me ?thanks
i think my VPP cost GPU too much,is it?Your test case is whether to use vpp?
my input stream is 1920*1080i yuy2
frist, i use vpp change input to 1920*1080p nv12
2nd, i use encoder encode it to h264 stream.

i am prepare move this project to x64 os.

Petter_L_Intel · ‎03-15-2012

Hi Jin,

I do not recommend using Media Checker, it's a tool that has not been updated in a while and is known to not work well in many scenarios. If you want to capture more detailed Media SDK API traces I instead recommend using the tracer tool that is part of the Media SDK package (pre frame logging provides more details).

So you are using one thread per Media SDK session?

Yes, the setup I created to replicate your scenario is YV12 interlaced -> (VPP) -> NV12 progressive -> (Encode) -> H.264
The only difference is that you are using YUY2. I do not expect any difference in performance.

Regards,

Petter

Jin_Shuyang · ‎03-15-2012

Yes,i using one thread per media sdk session

Have a great Impact On its performance?

Petter_L_Intel · ‎03-16-2012

Threading is not necessarily required for high performance, but it certainly makes it easier to implement concurrent sessions that way.
Regards,

Petter