Media (Intel® oneAPI Video Processing Library, Intel Media SDK)
Access community support with transcoding, decoding, and encoding in applications using media tools like Intel® oneAPI Video Processing Library and Intel® Media SDK
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.

Squeezing best performance from H264 encode/decode.


My configuration is as follows:

    Graphics Devices:
        Name                                         Version             State
        AMD Radeon HD 7900 Series                    16.150.2211.0       Active
        Intel(R) HD Graphics 4600                 08

    System info:
        CPU:    Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
        OS:    Microsoft Windows 10 Pro
        Arch:    64-bit

With a 1.16 session (Intel(R)_Media_SDK_2016.0.2), using the following parameters to encode H264:

    parms.AsyncDepth = 4;
    parms.mfx.CodecId = MFX_CODEC_AVC;
    parms.mfx.CodecProfile = MFX_PROFILE_AVC_MAIN;
    parms.mfx.EncodedOrder = 0;
    parms.mfx.FrameInfo.FourCC = MFX_FOURCC_NV12;
    parms.mfx.FrameInfo.ChromaFormat = MFX_CHROMAFORMAT_YUV420;
    parms.mfx.FrameInfo.PicStruct = MFX_PICSTRUCT_PROGRESSIVE;
    parms.mfx.FrameInfo.Width = 1280;
    parms.mfx.FrameInfo.Height = 720;
    parms.mfx.FrameInfo.CropX = 0;
    parms.mfx.FrameInfo.CropY = 0;
    parms.mfx.FrameInfo.CropW = 1280;
    parms.mfx.FrameInfo.CropH = 720;
    parms.mfx.GopRefDist = 3;
    parms.mfx.GopPicSize = 60;
    parms.mfx.IdrInterval = 0;
    parms.mfx.NumRefFrame = 1;
    parms.mfx.NumSlice = 0;
    parms.mfx.RateControlMethod = MFX_RATECONTROL_CBR;
    parms.mfx.TargetUsage =  MFX_TARGETUSAGE_BALANCED;
    parms.mfx.TargetKbps = 5000;

, using a D3D11FrameAllocator and MFX_IMPL_HARDWARE2 (my main graphics device is the AMD), with MFX_IMPL_HARDWARE_ANY | MFX_IMPL_VIA_D3D11.  

My encode speed was 179ms per GOP (60 frames, 1280 x 720, 5000kbps), which is around 3ms per frame.  My decode speed was 197ms per GOP which rounds to 3ms so let's call them the same.  Approximately 333 frames per second.

How do these numbers compare to theoretical maximums?  We are wanting to encode 2 streams of 1280 x 720 at 60 fps and decode 2, 3 or 4 (or more) streams simultaneously.  We have our own pipeline that further processes decoded GOPs for a scientific/industrial application, so we aren't using VPP.  Apart from TargetUsage, is there any other way of squeezing more performance out of the encoder or decoder?  I noticed when profiling that the vast majority of processor time is taken locking the D3D 11 surfaces (> 85%).  Can we speed this up in any way?  For example, is there optimised code knocking around for converting UYUV420 to NV12 (and back again) - our base format is UYUV420.  I haven't been able to find anything suitable on google.

Note the numbers I have here compare favourably with the output of sample_encode with the -calc_latency flag, so I'm thinking my impl is close to optimal, (assuming yours is!).

Thanks for any advice you can give me.

0 Kudos
3 Replies

You should be able to go above the FPS you're currently seeing.  For an estimate of what your HW peak performance is, start with sample_multi_transcode.  Unlike CPU implementations where transcode time can be estimated as decode+encode, our hardware is asynchronous with hardware blocks that can execute simultaneously.  

Not sure if I've got the right pixel format for you, but if it is close to YV12 you could use VPP right before encode in your pipeline for the color conversion.  This means you get HW accelerated color conversion. Just set YV12 as the input fourcc and NV12 as the output fourcc.

The only tricky part is mapping your data to the expected layout.  Below is an example for reading a file in  I420 format (YV12 with reversed chroma UV).  If this isn't the format you're looking for, you could probably improve performance over the SW implementation you're currently using by writing the color conversion in OpenCL.  Alternately, you could try letting VPP convert from system memory to D3D11 in one step by setting the input surface type to system memory and the output to video memory.


mfxStatus LoadRawI420Frame(mfxFrameSurface1* pSurface, FILE* fSource)
    //I420 is a 12 bpp 4:2:0 planar format, Y(WxH),U(.5W*.5H),V(.5W*.5H)
    //For VPP acceleration, treat as YV12 with U and V reversed
    size_t nBytesRead;
    mfxU16 w, h;
    mfxFrameInfo* pInfo = &pSurface->Info;
    mfxFrameData* pData = &pSurface->Data;

    w = pInfo->Width;
    h = pInfo->Height;

    // read Y (luminance) plane
    for (int i = 0; i < h; i++) {
        nBytesRead = (mfxU32) fread(pData->Y + i * pData->Pitch, 1, w, fSource);
        if (w != nBytesRead)
            return MFX_ERR_MORE_DATA;

    h /= 2;
    w /= 2;

    // read U(cb) plane
    for (int i = 0; i < h; i++) {
        nBytesRead = (mfxU32) fread(pData->U + i * pData->Pitch/2, 1, w, fSource);
        if (w != nBytesRead)
            return MFX_ERR_MORE_DATA;

    // read V(cr) plane
    for (int i = 0; i < h; i++) {
        nBytesRead = (mfxU32) fread(pData->V + i * pData->Pitch/2, 1, w, fSource);
        if (w != nBytesRead)
            return MFX_ERR_MORE_DATA;

    return MFX_ERR_NONE;


Valued Contributor I

You are using target usage 4 which isn't the fastest, try target usage 7 instead.


Thanks for your replies gentlemen.  The transcode (encode -> decode) together on the same PC may be an optimisation we can perform, though our use case mandates that the decode can happen remotely (streaming services) and on an architecture level our encode and decode pipelines really are separate processes.  

When profiling I don't see the UYUV -> NV12 impacting performance that much and may be a premature optimisation. So the main point I take away from this is that we're close to theoretical limits and not doing anything obviously wrong.  Yes changing between best, balanced and speed does change the performance (about 30ms per 60 frames between best and speed).  I'm going to stick with balanced for now.

I went digging a bit more into the D3D 11 surface lock and it appears to dispatch a compute shader, so I'm not surprised we're spending most of our time in there.