Squeezing best performance from H264 encode/decode.

Kamal_Devanga · ‎08-05-2016

My configuration is as follows:

   Graphics Devices:
       Name Version State
       AMD Radeon HD 7900 Series 16.150.2211.0 Active
       Intel(R) HD Graphics 4600 10.18.15.4256 08

   System info:
       CPU:   Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
       OS:   Microsoft Windows 10 Pro
       Arch:   64-bit

With a 1.16 session (Intel(R)_Media_SDK_2016.0.2), using the following parameters to encode H264:

    parms.AsyncDepth = 4;
    parms.IOPattern = MFX_IOPATTERN_IN_VIDEO_MEMORY;
    parms.mfx.CodecId = MFX_CODEC_AVC;
    parms.mfx.CodecProfile = MFX_PROFILE_AVC_MAIN;
    parms.mfx.EncodedOrder = 0;
    parms.mfx.FrameInfo.FourCC = MFX_FOURCC_NV12;
    parms.mfx.FrameInfo.ChromaFormat = MFX_CHROMAFORMAT_YUV420;
    parms.mfx.FrameInfo.PicStruct = MFX_PICSTRUCT_PROGRESSIVE;
    parms.mfx.FrameInfo.Width = 1280;
    parms.mfx.FrameInfo.Height = 720;
    parms.mfx.FrameInfo.CropX = 0;
    parms.mfx.FrameInfo.CropY = 0;
    parms.mfx.FrameInfo.CropW = 1280;
    parms.mfx.FrameInfo.CropH = 720;
    parms.mfx.GopRefDist = 3;
    parms.mfx.GopPicSize = 60;
    parms.mfx.IdrInterval = 0;
    parms.mfx.NumRefFrame = 1;
    parms.mfx.NumSlice = 0;
    parms.mfx.RateControlMethod = MFX_RATECONTROL_CBR;
    parms.mfx.TargetUsage =  MFX_TARGETUSAGE_BALANCED;
    parms.mfx.TargetKbps = 5000;

, using a D3D11FrameAllocator and MFX_IMPL_HARDWARE2 (my main graphics device is the AMD), with MFX_IMPL_HARDWARE_ANY | MFX_IMPL_VIA_D3D11.

My encode speed was 179ms per GOP (60 frames, 1280 x 720, 5000kbps), which is around 3ms per frame. My decode speed was 197ms per GOP which rounds to 3ms so let's call them the same. Approximately 333 frames per second.

How do these numbers compare to theoretical maximums? We are wanting to encode 2 streams of 1280 x 720 at 60 fps and decode 2, 3 or 4 (or more) streams simultaneously. We have our own pipeline that further processes decoded GOPs for a scientific/industrial application, so we aren't using VPP. Apart from TargetUsage, is there any other way of squeezing more performance out of the encoder or decoder? I noticed when profiling that the vast majority of processor time is taken locking the D3D 11 surfaces (> 85%). Can we speed this up in any way? For example, is there optimised code knocking around for converting UYUV420 to NV12 (and back again) - our base format is UYUV420. I haven't been able to find anything suitable on google.

Note the numbers I have here compare favourably with the output of sample_encode with the -calc_latency flag, so I'm thinking my impl is close to optimal, (assuming yours is!).

Thanks for any advice you can give me.

Jeffrey_M_Intel1 · ‎08-05-2016

You should be able to go above the FPS you're currently seeing. For an estimate of what your HW peak performance is, start with sample_multi_transcode. Unlike CPU implementations where transcode time can be estimated as decode+encode, our hardware is asynchronous with hardware blocks that can execute simultaneously.

Not sure if I've got the right pixel format for you, but if it is close to YV12 you could use VPP right before encode in your pipeline for the color conversion. This means you get HW accelerated color conversion. Just set YV12 as the input fourcc and NV12 as the output fourcc.

The only tricky part is mapping your data to the expected layout. Below is an example for reading a file in I420 format (YV12 with reversed chroma UV). If this isn't the format you're looking for, you could probably improve performance over the SW implementation you're currently using by writing the color conversion in OpenCL. Alternately, you could try letting VPP convert from system memory to D3D11 in one step by setting the input surface type to system memory and the output to video memory.

mfxStatus LoadRawI420Frame(mfxFrameSurface1* pSurface, FILE* fSource)
{
    //I420 is a 12 bpp 4:2:0 planar format, Y(WxH),U(.5W*.5H),V(.5W*.5H)
    //For VPP acceleration, treat as YV12 with U and V reversed
    
    size_t nBytesRead;
    mfxU16 w, h;
    mfxFrameInfo* pInfo = &pSurface->Info;
    mfxFrameData* pData = &pSurface->Data;

    w = pInfo->Width;
    h = pInfo->Height;


    // read Y (luminance) plane
    for (int i = 0; i < h; i++) {
        nBytesRead = (mfxU32) fread(pData->Y + i * pData->Pitch, 1, w, fSource);
        if (w != nBytesRead)
            return MFX_ERR_MORE_DATA;
    }

    h /= 2;
    w /= 2;


    // read U(cb) plane
    for (int i = 0; i < h; i++) {
        nBytesRead = (mfxU32) fread(pData->U + i * pData->Pitch/2, 1, w, fSource);
        if (w != nBytesRead)
            return MFX_ERR_MORE_DATA;
    }

    // read V(cr) plane
    for (int i = 0; i < h; i++) {
        nBytesRead = (mfxU32) fread(pData->V + i * pData->Pitch/2, 1, w, fSource);
        if (w != nBytesRead)
            return MFX_ERR_MORE_DATA;
    }

    return MFX_ERR_NONE;
}

MSchm21 · ‎08-06-2016

You are using target usage 4 which isn't the fastest, try target usage 7 instead.

Kamal_Devanga · ‎08-08-2016

Thanks for your replies gentlemen. The transcode (encode -> decode) together on the same PC may be an optimisation we can perform, though our use case mandates that the decode can happen remotely (streaming services) and on an architecture level our encode and decode pipelines really are separate processes.

When profiling I don't see the UYUV -> NV12 impacting performance that much and may be a premature optimisation. So the main point I take away from this is that we're close to theoretical limits and not doing anything obviously wrong. Yes changing between best, balanced and speed does change the performance (about 30ms per 60 frames between best and speed). I'm going to stick with balanced for now.

I went digging a bit more into the D3D 11 surface lock and it appears to dispatch a compute shader, so I'm not surprised we're spending most of our time in there.