- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My configuration is as follows:
Graphics Devices:
Name Version State
AMD Radeon HD 7900 Series 16.150.2211.0 Active
Intel(R) HD Graphics 4600 10.18.15.4256 08System info:
CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
OS: Microsoft Windows 10 Pro
Arch: 64-bit
With a 1.16 session (Intel(R)_Media_SDK_2016.0.2), using the following parameters to encode H264:
parms.AsyncDepth = 4; parms.IOPattern = MFX_IOPATTERN_IN_VIDEO_MEMORY; parms.mfx.CodecId = MFX_CODEC_AVC; parms.mfx.CodecProfile = MFX_PROFILE_AVC_MAIN; parms.mfx.EncodedOrder = 0; parms.mfx.FrameInfo.FourCC = MFX_FOURCC_NV12; parms.mfx.FrameInfo.ChromaFormat = MFX_CHROMAFORMAT_YUV420; parms.mfx.FrameInfo.PicStruct = MFX_PICSTRUCT_PROGRESSIVE; parms.mfx.FrameInfo.Width = 1280; parms.mfx.FrameInfo.Height = 720; parms.mfx.FrameInfo.CropX = 0; parms.mfx.FrameInfo.CropY = 0; parms.mfx.FrameInfo.CropW = 1280; parms.mfx.FrameInfo.CropH = 720; parms.mfx.GopRefDist = 3; parms.mfx.GopPicSize = 60; parms.mfx.IdrInterval = 0; parms.mfx.NumRefFrame = 1; parms.mfx.NumSlice = 0; parms.mfx.RateControlMethod = MFX_RATECONTROL_CBR; parms.mfx.TargetUsage = MFX_TARGETUSAGE_BALANCED; parms.mfx.TargetKbps = 5000;
, using a D3D11FrameAllocator and MFX_IMPL_HARDWARE2 (my main graphics device is the AMD), with MFX_IMPL_HARDWARE_ANY | MFX_IMPL_VIA_D3D11.
My encode speed was 179ms per GOP (60 frames, 1280 x 720, 5000kbps), which is around 3ms per frame. My decode speed was 197ms per GOP which rounds to 3ms so let's call them the same. Approximately 333 frames per second.
How do these numbers compare to theoretical maximums? We are wanting to encode 2 streams of 1280 x 720 at 60 fps and decode 2, 3 or 4 (or more) streams simultaneously. We have our own pipeline that further processes decoded GOPs for a scientific/industrial application, so we aren't using VPP. Apart from TargetUsage, is there any other way of squeezing more performance out of the encoder or decoder? I noticed when profiling that the vast majority of processor time is taken locking the D3D 11 surfaces (> 85%). Can we speed this up in any way? For example, is there optimised code knocking around for converting UYUV420 to NV12 (and back again) - our base format is UYUV420. I haven't been able to find anything suitable on google.
Note the numbers I have here compare favourably with the output of sample_encode with the -calc_latency flag, so I'm thinking my impl is close to optimal, (assuming yours is!).
Thanks for any advice you can give me.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You should be able to go above the FPS you're currently seeing. For an estimate of what your HW peak performance is, start with sample_multi_transcode. Unlike CPU implementations where transcode time can be estimated as decode+encode, our hardware is asynchronous with hardware blocks that can execute simultaneously.
Not sure if I've got the right pixel format for you, but if it is close to YV12 you could use VPP right before encode in your pipeline for the color conversion. This means you get HW accelerated color conversion. Just set YV12 as the input fourcc and NV12 as the output fourcc.
The only tricky part is mapping your data to the expected layout. Below is an example for reading a file in I420 format (YV12 with reversed chroma UV). If this isn't the format you're looking for, you could probably improve performance over the SW implementation you're currently using by writing the color conversion in OpenCL. Alternately, you could try letting VPP convert from system memory to D3D11 in one step by setting the input surface type to system memory and the output to video memory.
mfxStatus LoadRawI420Frame(mfxFrameSurface1* pSurface, FILE* fSource) { //I420 is a 12 bpp 4:2:0 planar format, Y(WxH),U(.5W*.5H),V(.5W*.5H) //For VPP acceleration, treat as YV12 with U and V reversed size_t nBytesRead; mfxU16 w, h; mfxFrameInfo* pInfo = &pSurface->Info; mfxFrameData* pData = &pSurface->Data; w = pInfo->Width; h = pInfo->Height; // read Y (luminance) plane for (int i = 0; i < h; i++) { nBytesRead = (mfxU32) fread(pData->Y + i * pData->Pitch, 1, w, fSource); if (w != nBytesRead) return MFX_ERR_MORE_DATA; } h /= 2; w /= 2; // read U(cb) plane for (int i = 0; i < h; i++) { nBytesRead = (mfxU32) fread(pData->U + i * pData->Pitch/2, 1, w, fSource); if (w != nBytesRead) return MFX_ERR_MORE_DATA; } // read V(cr) plane for (int i = 0; i < h; i++) { nBytesRead = (mfxU32) fread(pData->V + i * pData->Pitch/2, 1, w, fSource); if (w != nBytesRead) return MFX_ERR_MORE_DATA; } return MFX_ERR_NONE; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You are using target usage 4 which isn't the fastest, try target usage 7 instead.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your replies gentlemen. The transcode (encode -> decode) together on the same PC may be an optimisation we can perform, though our use case mandates that the decode can happen remotely (streaming services) and on an architecture level our encode and decode pipelines really are separate processes.
When profiling I don't see the UYUV -> NV12 impacting performance that much and may be a premature optimisation. So the main point I take away from this is that we're close to theoretical limits and not doing anything obviously wrong. Yes changing between best, balanced and speed does change the performance (about 30ms per 60 frames between best and speed). I'm going to stick with balanced for now.
I went digging a bit more into the D3D 11 surface lock and it appears to dispatch a compute shader, so I'm not surprised we're spending most of our time in there.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page