The VPP performance

marina_golovkina · ‎04-21-2011

Hello,

We encode stream with your AVC Encoder (Media SDK 2.0.10). We implement hardware-acceleration. Color space conversion from YV12 to NV12 is made by copying bytes from a source buffer into a buffer of the encoder. But now we have added using of your VPP for conversion. Performance has considerably decreased.

For example:
frame size 720x480, color format YV12
not use VPP - 584 fps
use VPP - 346 fps

In what there can be a problem?

Thanks,
Marina

IDZ_A_Intel · ‎04-22-2011

Hi Marina,

Could you elaborate a bit on your solution. Is it based on the Media SDK samples such as "sample_encode" or the H.264 direct show sample filter?

What is you input? Raw YV12 data from file or from other source?

What platform are you running you test on?

Regards,

Petter

marina_golovkina · ‎04-24-2011

Hi Petter,

It is based on the "sample_encode".
Input is raw YV12 data from file.
Platform: Windows 7 64-bit
CPU: Sandy Bridge D2
Processor: Intel Core i7-2600 CPU

The example of using:

// Add frame
int32_t PutFrame(uint8_t *pb_src, int32_t src_line_size, int32_t src_width, int32_t src_height, uint32_t fourcc)
{
mfxStatus sts = MFX_ERR_NONE;
CTask *task = NULL;

pTaskPool->GetFreeTask(&task);

int32_t surface_idx = GetFreeSurface(pEncSurfacePool, nSurfacePoolSize);
mfxFrameSurface1 * pEncSurface = pEncSurfacePool + surface_idx;

surface_idx = GetFreeSurface(pVppSurfacePool, nSurfacePoolSize);
mfxFrameSurface1 * pVPPSurface = pVppSurfacePool + surface_idx;

//copy input data (YV12 format) into surface for VPP
//-----------------------------------------------------
mfxU8 *ptrY = pVPPSurface->Data.Y;
uint8_t * src_ptr = pb_src;
for (int i = 0; i < src_height; i++)
{
memcpy(ptrY, src_ptr, src_width);
ptrY += pVPPSurface->Data.Pitch;
src_ptr += src_line_size;
}

int chroma_h = src_height >> 1;
int chroma_w = src_width >> 1;
int chroma_stride = src_line_size >> 1;

uint8_t *pSrcV = pb_src + src_line_size*src_height;
uint8_t *pSrcU = pSrcV + chroma_stride*src_height/2;

mfxU8 *ptrV = pVPPSurface->Data.V;
mfxU8 *ptrU = pVPPSurface->Data.U;

for (int i = 0; i < chroma_h; i++)
{
memcpy(ptrV, pSrcV, chroma_w);
memcpy(ptrU, pSrcU, chroma_w);

pSrcV += chroma_stride;
pSrcU += chroma_stride;
ptrU += pVPPSurface->Data.Pitch >> 1;
ptrV += pVPPSurface->Data.Pitch >> 1;
}
//---Copying END----------------------------------------

sts = MFXVideoVPP_RunFrameVPPAsync(mfx_session, pVPPSurface, pEncSurface, NULL, &task->EncSyncP);
if (sts < MFX_ERR_NONE)
return H264ERROR_FAILED;

for (;;)
{
sts = MFXVideoENCODE_EncodeFrameAsync(mfx_session, NULL, pEncSurface, &task->mfxBS, &task->EncSyncP);
if (MFX_ERR_NOT_ENOUGH_BUFFER == sts)
{
sts = AllocateMFXBufstream(mfx_session, &task->mfxBS);
if (sts < MFX_ERR_NONE)
return H264ERROR_FAILED;
}
else if (sts == MFX_WRN_DEVICE_BUSY)
{
MCSleep(5);
}
else
{
break;
}
}

if (sts < MFX_ERR_NONE && sts != MFX_ERR_MORE_DATA)
return H264ERROR_FAILED;

if (task->EncSyncP != NULL)
pTaskPool->PushNewTask(task);
else
pTaskPool->FreeTask(task);

return H264ERROR_NONE;
}

// Get coded data
ThreadFunc()
{
do
{
CTask * pTask;
pTaskPool->GetNextTask(&pTask);
if (pTask == NULL)
MCThreadReturn;

mfxStatus ret;
do
{
// WAIT_INTERVAL = 10000
ret = MFXVideoCORE_SyncOperation(mfx_session, pTask->EncSyncP, WAIT_INTERVAL);
} while (ret == MFX_WRN_IN_EXECUTION);

if (ret != MFX_ERR_NONE)
{
pTaskPool->FreeTask(pTask);
continue;
}

//Here is writing data (pTask->mfxBS) into a output file...
//.........
//--------------------------------------------------------

pTask->EncSyncP = NULL;
pTask->mfxBS.DataLength = 0;
pTask->mfxBS.DataOffset = 0;
pTaskPool->FreeTask(pTask);

}while (1);

ThreadReturn
}

Thanks,
Marina

IDZ_A_Intel · ‎04-25-2011

Hi Marina,

Thanks for providing some more details.

It is expected that the performance using VPP (even if it is HW accelerated) in your pipeline will impact overall performance. However, the difference in performance you report is quite large which leads me to believe there might be something else going on.

Are you using D3D or system memory surfaces? For the case when VPP was not used, was the NV12 surface read from file and managed in the same way as the YV12 surface in your code?

It would be interesting to know how much the mem copy impacts performance. Did you have a similar copy segment for the case of using NV12 surface?

Also, for the case of using VPP, make sure that Media SDK does not fall back on SW implementation. This could happen for certain initialization parameter settings. You can check the selected implementation by calling "MFXQueryIMPL" after completing Encoder and VPP component initialization.

Thanks,

Petter

marina_golovkina · ‎04-26-2011

Hi Petter,

I am using system memory surfaces.
I check the selected implementation by calling "MFXQueryIMPL". Implementation has not changed on SW.

I think that you have not correctly understood me. For the case when VPP was not used, the input data is YV12 too. YV12 data was converted to NV12 without the VPP.

The example of using for the case when VPP was not used:

// Add frame
int32_t PutFrame(uint8_t *pb_src, int32_t src_line_size, int32_t src_width, int32_t src_height, uint32_t fourcc)
{
mfxStatus sts = MFX_ERR_NONE;
CTask *task = NULL;

pTaskPool->GetFreeTask(&task);

int32_t surface_idx = GetFreeSurface(pEncSurfacePool, nSurfacePoolSize);
mfxFrameSurface1 * pEncSurface = pEncSurfacePool + surface_idx;

// Conversion YV12 to NV12
mfxU8 *ptrY = pEncSurface->Data.Y;
uint8_t * src_ptr = pb_src;
for (int i = 0; i < src_height; i++)
{
memcpy(ptrY, src_ptr, src_width);
ptrY += pEncSurface->Data.Pitch;
src_ptr += src_line_size;
}

int chroma_h = src_height >> 1;
int chroma_w = src_width >> 1;
int chroma_stride = src_line_size >> 1;

uint8_t *pSrcV, *pSrcU;
pSrcV = pb_src + src_line_size*src_height;
pSrcU = pSrcV + chroma_stride*src_height/2;

mfxU8 *ptrV = pEncSurface->Data.V;
mfxU8 *ptrU = pEncSurface->Data.U;

for (int i = 0; i < chroma_h; i++)
{
for(int j = 0; j < chroma_w; j++)
{
ptrV[j*2] = pSrcV;
ptrU[j*2] = pSrcU;
}
pSrcV += chroma_stride;
pSrcU += chroma_stride;

ptrU += pEncSurface->Data.Pitch;
ptrV += pEncSurface->Data.Pitch;
}
//---Conversion END----------------------------------------

for (;;)
{
sts = MFXVideoENCODE_EncodeFrameAsync(mfx_session, NULL, pEncSurface, &task->mfxBS, &task->EncSyncP);
if (MFX_ERR_NOT_ENOUGH_BUFFER == sts)
{
sts = AllocateMFXBufstream(mfx_session, &task->mfxBS);
if (sts < MFX_ERR_NONE)
return H264ERROR_FAILED;
}
else if (sts == MFX_WRN_DEVICE_BUSY)
{
MCSleep(5);
}
else
{
break;
}
}

if (sts < MFX_ERR_NONE && sts != MFX_ERR_MORE_DATA)
return H264ERROR_FAILED;

if (task->EncSyncP != NULL)
pTaskPool->PushNewTask(task);
else
pTaskPool->FreeTask(task);

return H264ERROR_NONE;
}

// Get coded data
ThreadFunc()
{
do
{
CTask * pTask;
pTaskPool->GetNextTask(&pTask);
if (pTask == NULL)
MCThreadReturn;

mfxStatus ret;
do
{
// WAIT_INTERVAL = 10000
ret = MFXVideoCORE_SyncOperation(mfx_session, pTask->EncSyncP, WAIT_INTERVAL);
} while (ret == MFX_WRN_IN_EXECUTION);

if (ret != MFX_ERR_NONE)
{
pTaskPool->FreeTask(pTask);
continue;
}

//Here is writing data (pTask->mfxBS) into a output file...
//.........
//--------------------------------------------------------

pTask->EncSyncP = NULL;
pTask->mfxBS.DataLength = 0;
pTask->mfxBS.DataOffset = 0;
pTaskPool->FreeTask(pTask);

}while (1);

ThreadReturn
}

Thanks,
Marina

IDZ_A_Intel · ‎04-26-2011

Hi Marina,

Thanks for clarifying the way you handle the case when not using VPP.

The key difference between the two cases is that for the case when not using VPP you are performing copy and convert in the same operation. While for the VPP case you are performing copy then convert separately, which will naturally impact performance compared to combined copy/convert.

There is not really a way around this unless you have the ability to read from file straight into the surface buffer.

To assess the impact of the copy before VPP try to remove it and measure the performance again (I know this will result in garbage frame data but it will give you a sense of the copy performance impact).

For the case of using HW acceleration I suggest you try using D3D surfaces instead of surfaces in system memory. This will eliminate some internal Media SDK copies from system memory to D3D surface. I do not know the greater context or our implementation, so I'm not sure that approach is feasible for you, but if possible I suggest exploring this option.

Thanks,

Petter

marina_golovkina · ‎05-03-2011

Hi Petter,

I apologize for the long answer.

According to your advice I removed the copy before VPP and measured the performance again. But it has not given any results. The problem with VPP remains.
For the case when not using VPP I tried to do copying then converting separately. But it has not given any results too.
The impact of the copy is insignificant.

You suggested us try using D3D surfaces instead of surfaces in system memory. But it for us will not approach.

Thanks,
Marina

IDZ_A_Intel · ‎05-05-2011

Hi Marina,

I did some experiments on my side and in your case I think the best solution is to handle color conversion yourself and not use VPP.The reason behind this is that your method allows you to efficiently handle copy and convert in one operation. Compared to the case of using VPP, whereyou would need to copy YV12 to MSDK surface (some internal MSDK surface copies, more on that below) and transfer to/from HW. If you do not plan on using other VPPpreprocessing operation such as scaling, then using your own color conversion routine may be the most efficeint option.

I'd like to expand a bit on the implicit difference between using D3D vs. memory surfaces with Media SDK, and specifically using encode.

Correct me if I'm wrong, but my assumption is that you are comparing the following two scenariosboth using HW encode and system memory surfaces:

1. Not using VPP but your own copy convert routine:

raw file -> raw data -> copy/convert from YV12 to NV12 -> MSDK:Copy surface from sysmem to D3D -> MSDK:HW encode -> MSDK:Copy BS from D3D to sysmem

2. Using VPP:

raw file -> raw data -> copy raw data to YV12 surface -> MSDK:Copy surface from sysmem to D3D -> MSDK:HW VPP -> Copy surface from D3D to sysmem -> NV12 surface (sys mem) -> MSDK:Copy surface from sysmem to D3D -> MSDK:HW encode -> MSDK:Copy BS from D3D to sysmem

As you can see, due to using system memory with HW target, there are several additional copies between system memory and D3D memory (required since HW only operates on D3D surfaces) in the VPP case.These additional copies are quite costly and impact performance.

Compare the above to the scenario below that is using D3D surfaces with HW VPP:

raw file -> raw data -> copy raw data to YV12 surface -> MSDK:HW VPP -> NV12 surface (D3D) -> MSDK:HW encode -> MSDK:Copy BS from D3D to sysmem

(note that the above tries to illustrate your setup, copy raw data to YV12 surface can be eliminated if read from file directly)

Are you sure using D3D surfaces is not an option for you? It should not affect the way you feed or extract data to/from MSDK, and usinf D3D would speedup your solution using your own copy-convert too.

Thanks,

Petter