Solved: Hi Surbhi,

Hamza_U_1 · ‎03-24-2016

Hi,

On running sample_vpp app we're noticing that memcpy is very slow when reading frame from output vaapi surface to sysmem. For a 4k(3840x2160) NV12 frame memcpy takes roughly 100 milliseconds. Memcpy from host to vmem takes 6-7 milliseconds. We want to to do 4k @ 60fps. Is there a optimized version of memcpy for this transfer ? What could be going wrong ?

I'm using E3-1245 v5, with centos-release-7-2.1511.el7 64-bit with latest version of MediaSDK and MediaSamples_Linux_2016. Numbers are same on Skylake and E3-V3.

Have attached source files,par file and modified sample_vpp_utils.cpp code, where i just replaced fwrite with memcpy.

Following command was used: sudo -E ./__bin/sample_vpp -composite par_4k_nv12_rgb.txt -lib hw -vaapi -o out.yuv

Roman_T_ · ‎03-24-2016

Hi, Hamza,

Copying from vmem to sysmem and from sysmem to vmem are weak points of MSDK in general.

A great ability of MSDK is elimination of copying from sysmem to vmem during decoding and playback because we can perform decoding in video memory and later we can display decompressed image very fast, because it's in video memory already.
But in your case output data should be placed in system memory due to VPP procedure.

One of possible solutions is to use system memory for both input and output of video decoder before VPP.

Best regards,
Roman

View solution in original post

Roman_T_ · ‎03-24-2016

Hi, Hamza,

Copying from vmem to sysmem and from sysmem to vmem are weak points of MSDK in general.

A great ability of MSDK is elimination of copying from sysmem to vmem during decoding and playback because we can perform decoding in video memory and later we can display decompressed image very fast, because it's in video memory already.
But in your case output data should be placed in system memory due to VPP procedure.

One of possible solutions is to use system memory for both input and output of video decoder before VPP.

Best regards,
Roman

Hamza_U_1 · ‎03-24-2016

Hi Roman,

Thanks for quick reply. Our use case involves reading from file(sysmem), decode bitstream, use VPP to do some processing, then get NV12 frame out to sysmem and give to SDI card/driver. SDI driver needs buffer from sysmem, that's the reason behind copying from vmem to sysmem.

On linux blending of NV12+RGB works only when VAAPI surfaces are used. So i am forced to use vmem.

Can you suggest a solution ?

James_B_9 · ‎03-24-2016

Hi,

Have you read

https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers

I tried this a while back in a simple decode with a VPP. The speed was almost as good as making the VPP input surface MFX_IOPATTERN_IN_VIDEO_MEMORY and the output surface MFX_IOPATTERN_IN_SYSTEM_MEMORY. I didn't have your restriction of having to use MFX_IOPATTERN_IN_SYSTEM_MEMORY for the VPP output surfaces, so I abandoned the approach mentioned in the paper but you may find it useful.

James.

Surbhi_M_Intel · ‎03-24-2016

Hi Hamza,

memcpy is one of the ways showed by Intel to copy the content, but definitely is not the most efficient way of doing so. Can you please tell what IOpattern are using, .like James said in their last post if you want to copy the output to system memory it is best to do through IO Pattern, VPP in should video memory(MFX_IOPATTERN_IN_VIDEO_MEMORY) so that blending of NV12+RGB4 is being done on vaapi surfaces(better performance) and out to system memory(MFX_IOPATTERN_OUT_SYSTEM_MEMORY). This has been designed as an efficient path of doing copies from video to system memory, hopefully you will see better performance.

Thanks,
Surbhi

Hamza_U_1 · ‎03-25-2016

Hi James and Surbhi,

On changing output IOPATTERN to system memory i see improvement.

Thanks for the suggestions.

Now I'm facing an issue while using MFXVideoVPP_Reset,

Initially 5 surfaces are allocated, then using Reset() i'm dynamically changing it. Once the number of surfaces is reduced(say 4), changing it back to 5 gives MFX_ERR_INCOMPATIBLE_VIDEO_PARAM. Which means it needs additional memory. Shouldn't VPP keep track of maximum surfaces instead of using last num_surfaces ?

Surbhi_M_Intel · ‎03-25-2016

Hi Hamza,

Great, glad you see improvements. Regarding changing number of surfaces, which VPP operation are you doing and can you please provide reason for changing number of surfaces, were you seeing any issue before dynamic adjustments? MFX_ERR_INCOMPATIBLE_VIDEO_PARAM means video params are invalid, if returned by reset which means cannot process the specified configuration with existing structures and frame buffers and the incompatability cannot be resolved. Application should be able to query the number of surfaces using MFXVideoVPP_QueryIOSurf and other pieces in the pipeline, for understanding allocation please refer to Surface Pool Allocation in Media SDK Manual

Thanks,
Surbhi

Hamza_U_1 · ‎03-25-2016

Hi Surbhi,

> which VPP operation are you doing
>>I'm doing blending and using composite with PixelAlphaEnable=1, (NV12+RGB). Number of surfaces is changed using "NumInputStream" parameter of composite structure. Please note that, at initialization, while allocating surfaces, it's set to maximum value of surfaces that'd be used at any point of time.

> and can you please provide reason for changing number of surfaces,
>> The number of graphics and their postion dynamically change in our use case. However, there's a max limit, so that many surfaces are requested at Init using MFXVideoVPP_QueryIOSurf() and Alloc().
I'm using one surface for each graphic, so as the number graphics increase/reduce number of surfaces changes, not exceeding initial max number.
Another option was to load multiple non-overlapping graphics onto a single surface and then have a fixed number of such surfaces, this approach cannot be used as it needs clearing whole surface for each frame as the graphics can be animations too. Memset takes too long to be real time.

> were you seeing any issue before dynamic adjustments?
>> As i said in previous post, I see issue only when increasing the surfaces, changing x,y dynamically keeping number of surfaces same works.

I use the below function to change x,y dynamically, and another similar function which changes numStreams as well. (sample_vpp/src/sample_vpp_parser.cpp, similar to vppParseInputString())
mfxStatus amgSetParams(sInputParams* pParams)
{
static int count = 2;
pParams->numStreams   = 2;
pParams->inFrameInfo[VPP_IN].nWidth   = 3840;
pParams->inFrameInfo[VPP_IN].nHeight = 2160;
pParams->inFrameInfo[VPP_IN].CropX    = 0;
pParams->inFrameInfo[VPP_IN].CropY    = 0;
pParams->inFrameInfo[VPP_IN].CropW    = 3840;
pParams->inFrameInfo[VPP_IN].CropH    = 2160;
pParams->inFrameInfo[VPP_IN].FourCC       = MFX_FOURCC_NV12;
pParams->inFrameInfo[VPP_IN].PicStruct    = MFX_PICSTRUCT_PROGRESSIVE;
pParams->inFrameInfo[VPP_IN].dFrameRate   = 25;

pParams->inFrameInfo[1].nWidth   = 494;
pParams->inFrameInfo[1].nHeight = 390;
pParams->inFrameInfo[1].CropX    = 0;
pParams->inFrameInfo[1].CropY    = 0;
pParams->inFrameInfo[1].CropW    = 494;
pParams->inFrameInfo[1].CropH    = 390;
pParams->inFrameInfo[1].FourCC       = MFX_FOURCC_RGB4;
pParams->inFrameInfo[1].PicStruct    = MFX_PICSTRUCT_PROGRESSIVE;
pParams->inFrameInfo[1].dFrameRate   = 25;

pParams->outFrameInfo.nWidth   = 3840;
pParams->outFrameInfo.nHeight = 2160;
pParams->outFrameInfo.CropX    = 0;
pParams->outFrameInfo.CropY    = 0;
pParams->outFrameInfo.CropW    = 3840;
pParams->outFrameInfo.CropH    = 2160;
pParams->outFrameInfo.FourCC       = MFX_FOURCC_NV12;
pParams->outFrameInfo.PicStruct    = MFX_PICSTRUCT_PROGRESSIVE;
pParams->outFrameInfo.dFrameRate   = 25;

/* Video Enhancement Algorithms */
//sDIParam            deinterlaceParam;
//sDenoiseParam       denoiseParam;
//sDetailParam        detailParam;
//sProcAmpParam       procampParam;
//sVideoAnalysisParam vaParam;
//sIStabParam         istabParam;
sCompositionStreamInfo streamInfo[MAX_INPUT_STREAMS];
msdk_char             streamName[MSDK_MAX_FILENAME_LEN];
mfxVPPCompInputStream compStream;
pParams->compositionParam.mode                          = VPP_FILTER_ENABLED_CONFIGURED;
pParams->compositionParam.streamInfo[0].compStream.DstX = 0;
pParams->compositionParam.streamInfo[0].compStream.DstY = 0;
pParams->compositionParam.streamInfo[0].compStream.DstW = 3840;
pParams->compositionParam.streamInfo[0].compStream.DstH = 2160;
pParams->compositionParam.streamInfo[0].compStream.GlobalAlphaEnable = 0;
pParams->compositionParam.streamInfo[0].compStream.GlobalAlpha = 0;
pParams->compositionParam.streamInfo[0].compStream.PixelAlphaEnable = 1;
pParams->compositionParam.streamInfo[0].compStream.LumaKeyEnable = 0;
pParams->compositionParam.streamInfo[0].compStream.LumaKeyMin = 0;
pParams->compositionParam.streamInfo[0].compStream.LumaKeyMax = 0;
switch (count)
{
          case 0:
                  pParams->compositionParam.streamInfo[1].compStream.DstX = 0;
                  pParams->compositionParam.streamInfo[1].compStream.DstY = 0;
                  count++;
                  break;
          case 1:
                  pParams->compositionParam.streamInfo[1].compStream.DstX = 100;
                  pParams->compositionParam.streamInfo[1].compStream.DstY = 100;
                  count++;
                  break;
          case 2:
                  pParams->compositionParam.streamInfo[1].compStream.DstX = 200;
                  pParams->compositionParam.streamInfo[1].compStream.DstY = 200;
                  count++;
                  break;
          case 3:
                  pParams->compositionParam.streamInfo[1].compStream.DstX = 300;
                  pParams->compositionParam.streamInfo[1].compStream.DstY = 300;
                  count++;
                  break;
          case 4:
                  pParams->compositionParam.streamInfo[1].compStream.DstX = 400;
                  pParams->compositionParam.streamInfo[1].compStream.DstY = 400;
                  count++;
                  break;
          case 5:
                                                                                                                                                          pParams->compositionParam.streamInfo[1].compStream.DstX = 500;
                  pParams->compositionParam.streamInfo[1].compStream.DstY = 500;
                  count++;
                  break;
          case 6:
                  pParams->compositionParam.streamInfo[1].compStream.DstX = 600;
                  pParams->compositionParam.streamInfo[1].compStream.DstY = 600;
                  count = 0;
                  break;
}
pParams->compositionParam.streamInfo[1].compStream.DstW = 494;
pParams->compositionParam.streamInfo[1].compStream.DstH = 390;
pParams->compositionParam.streamInfo[1].compStream.GlobalAlphaEnable = 0;
pParams->compositionParam.streamInfo[1].compStream.GlobalAlpha = 0;
pParams->compositionParam.streamInfo[1].compStream.PixelAlphaEnable = 1;
pParams->compositionParam.streamInfo[1].compStream.LumaKeyEnable = 0;
pParams->compositionParam.streamInfo[1].compStream.LumaKeyMin = 0;
pParams->compositionParam.streamInfo[1].compStream.LumaKeyMax = 0;

// flag describes type of memory
// true - frames in video memory (d3d surfaces),
// false - in system memory
pParams->memType = VAAPI_MEMORY;

// required implementation of MediaSDK library
pParams->impLib = MFX_IMPL_HARDWARE;

/* Use extended API (RunFrameVPPAsyncEx) */
bool use_extapi = false;
bool need_plugin = false;
return MFX_ERR_NONE;
}

Hamza_U_1 · ‎04-05-2016

Hi Surbhi,

Do you have any update on this ?

Surbhi_M_Intel · ‎04-06-2016

Hi Hamza,

Sorry I missed your thread, went over the details I see couple of concern in your application, discussing this with composition developer. will get back to you soon on this.

Thanks,
Surbhi

Surbhi_M_Intel · ‎04-10-2016

Hi again,

With current MSDK, composition doesn't support dynamic change of number of surfaces or number of streams. So every time parameters are changed, application needs to call Reset with new updated parameters including composition ext buffer. Please keep in mind that Reset has few limitations to work with:

The number of surfaces must be less or equal to the number provided with params on Init
Resolutions must be less or equal to the resolutions provided with params on Init

If Reset is not able to handle new params, it returns error and application needs to call Close, then Init with the new params.

One example of changing parameters is if application needs to change number of composed surfaces from 5 to 4, it could look like:
As a regular pipeline

Prepare params for 5->1 composition
Call QueryIOSurf with params and allocated requested amount of surface
Init
RunFrameAsync 5 times for input layers until ERR_NONE is received
Call SyncOperation. Now output is ready.

Now let's say app needs to change number of composed streams to 4 or change the layout of the composed streams

prepare params for the new comfiguration
call Reset with new params. If Reset returns ERR_NONE, app can start working using 4 layers now

No memory re-allocaton is needed in this case, previously allocated surfaces could be reused.

Hope this clarifies some of your application issue. Let us know if we can clarify anything more.

Thanks,
Surbhi

Hamza_U_1 · ‎04-13-2016

Hi Surbhi,

Thank you for the details.

I'm doing exactly as you mentioned, issue that i'm facing is that,

Prepare params for 5->1 composition
Call QueryIOSurf with params and allocated requested amount of surface
Init
RunFrameAsync 5 times for input layers until ERR_NONE is received
Call SyncOperation. Now output is ready.
Prepare params for 4->1 compostion
Call Reset. (This reset works fine)
Process few frames with these params and call Sync
Prepare params for 5->1 compostion
Call Reset.
Above reset call fails with MFX_ERR_INCOMPATIBLE_VIDEO_PARAM.

Basically once i reduce number of surfaces through reset, once again changing to the number of surfaces at Init() fails.

Rephrasing, reducing number of surfaces through Reset() works fine, but increasing back to initial number of surfaces always fail.

Hamza_U_ · ‎04-19-2016

Hi Surbhi,

Were you able to reproduce/confirm the issue at your end ?

Memcpy from Vmem to Sysmem is very slow