Solved: Problems understanding Decode + VPP workflow for JPEG -> RGB4

Jesse_K_ · ‎09-09-2016

Hello!

I'm attempting to utilize Intel Media SDK 2016.0.2 in decoding full HD JPEG images into RGB4. The reason for this is to research if it's possible to implement a JPEG decompressor system for certain cameras that output their live feed in JPEG, and to see if it's faster than SIMD optimized libjpeg-turbo. I've built my program by basing it on the sample_decode example, removing the portions I don't need to keep the program small. For my test, I'm feeding the program a single JPEG image file with no JFIF information (included below). My CPU is Intel i5-4430 @ 3.00 GHz on Windows 8.1, and my GPU is NVidia GTX 1060 6GB. No Intel GPU is present. For now I'm forcing the SDK to run in software mode, but ultimately I'd like to utilize the hardware acceleration on a computer that's supports it.

I'm having issues understanding several key concepts about using the encoder pipeline together with VPP.

1) Is it possible to decode JPEG directly in RGB4? Several posts I found on these forums suggest it's not, and I need to use VPP to further process NV12 into RGB4. I attempted customizing the
simple_decode example so it outputs RGB4, but I always received NV12 in the output. I wrote my own NV12 -> RGB4 converter, but it's relatively slow even multithreaded and there are some artifacts.

2) When allocating mfxFrameSurface1 instances for MFXVideoDECODE and MFXVideoVPP, do I require 2 or 3 pools? My understanding is this:

* Allocate 1 pool for decoder input and set its type to whatever MFXVideoDECODE_DecodeHeader finds and choose NV12 as chroma format.
* Allocate 1 pool for VPP input into which the decoder writes its output in the DecodeFrameAsync() call. This is also the input for VPP RunFrameVPPAsync()
* Allocate 1 pool for VPP output into which RunFrameVPPAsync() writes its data. This is the final output surface which will contain RGBA4 data.

3) How many times do I need to call my memory allocator when using both DECODE and VPP? As VPP requires two mfxFrameAllocRequests, can I use the instances from this for DECODE as well, or do I need to call my allocator separately for decoder specific data?

4) How do I properly use the sync points? My understanding about the asynchronous functions is that they work this way:

* Call DecodeFrameAsync(..., ..., syncPointA);
* Call RunFrameVPPAsync(..., ..., syncPointB);
* Call session.SyncOperation(syncPointB, time);

In this example, the FrameVPPAsync function won't start working at all until syncPointA is triggered, and syncOperation will wait until syncPointB is done. Is this correct?

In the linked source code the main problem is that the output buffer does not contain the RGB4 data as expected. Instead, a crash happens when the buffers are being read. Secondly, if I skip the VPP pipeline and instead try to output the YUV NV12 format data when DecodeFrameAsync is done, the Y and UV buffers appear to contain garbage data.

Source: http://plantmonster.net/koodailut/cplusplus/intel_media_sdk/main.cpp
Test JPEG: http://plantmonster.net/koodailut/cplusplus/intel_media_sdk/sample.jpg

Jeffrey_M_Intel1 · ‎09-11-2016

In general you can assume that the codecs work in NV12. VPP is there to help with color conversions.

You only need 2 surface pools.

decode out, VPP in
VPP out

Two pools are required because VPP in and out in this pipeline have different formats.

You can use the simple_6_decode_vpp_postproc tutorial for reference. Tutorials can be downloaded here: https://software.intel.com/en-us/intel-media-server-studio-support/code-samples.

Just change the VPP out configuration to do the color conversion and not the resize in the original:

    VPPParams.vpp.Out.FourCC = MFX_FOURCC_RGB4;
    VPPParams.vpp.Out.ChromaFormat = MFX_CHROMAFORMAT_YUV420;
...
    VPPParams.vpp.Out.CropW = VPPParams.vpp.In.CropW;   
    VPPParams.vpp.Out.CropH = VPPParams.vpp.In.CropH;

and the VPP out surface allocations:

   // Allocate surfaces for VPP Out
    // - Width and height of buffer must be aligned, a multiple of 32
    // - Frame surface array keeps pointers all surface planes and general frame info
    width = (mfxU16) MSDK_ALIGN32(VPPRequest[1].Info.Width);
    height = (mfxU16) MSDK_ALIGN32(VPPRequest[1].Info.Height);
    bitsPerPixel = 32;      // NV12 format is a 12 bits per pixel format
    surfaceSize = width * height * bitsPerPixel / 8;
    mfxU8* surfaceBuffers2 = (mfxU8*) new mfxU8[surfaceSize * nSurfNumVPPOut];

    mfxFrameSurface1** pmfxSurfaces2 = new mfxFrameSurface1 *[nSurfNumVPPOut];
    MSDK_CHECK_POINTER(pmfxSurfaces2, MFX_ERR_MEMORY_ALLOC);
    for (int i = 0; i < nSurfNumVPPOut; i++) {
        pmfxSurfaces2 = new mfxFrameSurface1;
        memset(pmfxSurfaces2, 0, sizeof(mfxFrameSurface1));
        memcpy(&(pmfxSurfaces2->Info), &(VPPParams.vpp.Out), sizeof(mfxFrameInfo));
        pmfxSurfaces2->Data.B = &surfaceBuffers[surfaceSize * i];
        pmfxSurfaces2->Data.G = pmfxSurfaces2->Data.B + 1;
        pmfxSurfaces2->Data.R = pmfxSurfaces2->Data.B + 2;
        pmfxSurfaces2->Data.A = pmfxSurfaces2->Data.B + 3;
        pmfxSurfaces2->Data.Pitch = width*4;
    }

Output after VPP should now be in RGB4.

In case it helps, here is a function you can add to output your RGB data. You can test with 'ffmpeg -s {geom} -pix_fmt bgra RGB4_file'

mfxStatus WriteRawFrameRGB(mfxFrameSurface1* pSurface, FILE* fSink)
{
    mfxFrameInfo* pInfo = &pSurface->Info;
    mfxFrameData* pData = &pSurface->Data;
    mfxU32 i, j, h, w;
    mfxStatus sts = MFX_ERR_NONE;

    w = pInfo->Width;
    h = pInfo->Height;
    for (mfxU16 i = 0; i < h; i++) {
        fwrite(pSurface->Data.B + i * pSurface->Data.Pitch,
            1, w * 4, fSink);
    }

    return sts;
}

View solution in original post

Kamal_Devanga · ‎09-09-2016

Something I noticed is you're calling Lock on your surfaces when you allocate them. You only need to call lock/unlock if you want to read or write from them. i.e. for encode when you're passing a new frame in to be encoded and on decode when a new frame is ready for you to copy. Also a lock must have a corresponding Unlock, though I'm unsure whether this makes a difference with a general allocator.

Jeffrey_M_Intel1 · ‎09-11-2016

In general you can assume that the codecs work in NV12. VPP is there to help with color conversions.

You only need 2 surface pools.

decode out, VPP in
VPP out

Two pools are required because VPP in and out in this pipeline have different formats.

You can use the simple_6_decode_vpp_postproc tutorial for reference. Tutorials can be downloaded here: https://software.intel.com/en-us/intel-media-server-studio-support/code-samples.

Just change the VPP out configuration to do the color conversion and not the resize in the original:

    VPPParams.vpp.Out.FourCC = MFX_FOURCC_RGB4;
    VPPParams.vpp.Out.ChromaFormat = MFX_CHROMAFORMAT_YUV420;
...
    VPPParams.vpp.Out.CropW = VPPParams.vpp.In.CropW;   
    VPPParams.vpp.Out.CropH = VPPParams.vpp.In.CropH;

and the VPP out surface allocations:

   // Allocate surfaces for VPP Out
    // - Width and height of buffer must be aligned, a multiple of 32
    // - Frame surface array keeps pointers all surface planes and general frame info
    width = (mfxU16) MSDK_ALIGN32(VPPRequest[1].Info.Width);
    height = (mfxU16) MSDK_ALIGN32(VPPRequest[1].Info.Height);
    bitsPerPixel = 32;      // NV12 format is a 12 bits per pixel format
    surfaceSize = width * height * bitsPerPixel / 8;
    mfxU8* surfaceBuffers2 = (mfxU8*) new mfxU8[surfaceSize * nSurfNumVPPOut];

    mfxFrameSurface1** pmfxSurfaces2 = new mfxFrameSurface1 *[nSurfNumVPPOut];
    MSDK_CHECK_POINTER(pmfxSurfaces2, MFX_ERR_MEMORY_ALLOC);
    for (int i = 0; i < nSurfNumVPPOut; i++) {
        pmfxSurfaces2 = new mfxFrameSurface1;
        memset(pmfxSurfaces2, 0, sizeof(mfxFrameSurface1));
        memcpy(&(pmfxSurfaces2->Info), &(VPPParams.vpp.Out), sizeof(mfxFrameInfo));
        pmfxSurfaces2->Data.B = &surfaceBuffers[surfaceSize * i];
        pmfxSurfaces2->Data.G = pmfxSurfaces2->Data.B + 1;
        pmfxSurfaces2->Data.R = pmfxSurfaces2->Data.B + 2;
        pmfxSurfaces2->Data.A = pmfxSurfaces2->Data.B + 3;
        pmfxSurfaces2->Data.Pitch = width*4;
    }

Output after VPP should now be in RGB4.

In case it helps, here is a function you can add to output your RGB data. You can test with 'ffmpeg -s {geom} -pix_fmt bgra RGB4_file'

mfxStatus WriteRawFrameRGB(mfxFrameSurface1* pSurface, FILE* fSink)
{
    mfxFrameInfo* pInfo = &pSurface->Info;
    mfxFrameData* pData = &pSurface->Data;
    mfxU32 i, j, h, w;
    mfxStatus sts = MFX_ERR_NONE;

    w = pInfo->Width;
    h = pInfo->Height;
    for (mfxU16 i = 0; i < h; i++) {
        fwrite(pSurface->Data.B + i * pSurface->Data.Pitch,
            1, w * 4, fSink);
    }

    return sts;
}

Jesse_K_ · ‎09-11-2016

RobinsonJ wrote:

Something I noticed is you're calling Lock on your surfaces when you allocate them. You only need to call lock/unlock if you want to read or write from them. i.e. for encode when you're passing a new frame in to be encoded and on decode when a new frame is ready for you to copy. Also a lock must have a corresponding Unlock, though I'm unsure whether this makes a difference with a general allocator

This is from the sample_decode where the general memory allocator implements a "lock" function, but the contents of that function don't seem to do anything related to locking. Instead, the function performs some kind of memory alignment, but I didn't fully understand what it did. Not performing that "lock" results in a MFX_ERR_INVALID_HANDLE in either of async calls. We thought that the function is probably called "lock" because it's a derivative of the abstracted memory allocator, and for some allocators the lock performs some kind of locking.

Jeffrey M. (Intel) wrote:

In general you can assume that the codecs work in NV12. VPP is there to help with color conversions.

You only need 2 surface pools.

decode out, VPP in
VPP out

Two pools are required because VPP in and out in this pipeline have different formats.

Good to know.

Jeffrey M. (Intel) wrote:

You can use the simple_6_decode_vpp_postproc tutorial for reference. Tutorials can be downloaded here: https://software.intel.com/en-us/intel-media-server-studio-support/code-samples.

Oh, I didn't notice that there were tutorials. I only got the "samples". I'll spend a while going through the tutorials now that I found them and see if they help.

Jeffrey M. (Intel) wrote:

In case it helps, here is a function you can add to output your RGB data. You can test with 'ffmpeg -s {geom} -pix_fmt bgra RGB4_file'

Useful, thanks.

Jesse_K_ · ‎09-12-2016

With the example "simple_decode_vpp_pp" I was able to get this running properly. Thank you for the help.

Jesse_K_ · ‎09-16-2016

While I was able to get the software implementation working just fine, I'm experiencing difficulties decoding with the hardware implementation. I've been looking at the "simple_6_encode_vmem_vpp_preproc" tutorial as well as "sample_decode" from the samples package. From those, I've copied the memory allocator and hardware initialization functions. One new feature is that I've enabled the integrated Intel GPU on my motherboard, meaning my current computer has the Intel HD 4600 integrated GPU as well as Geforce GTX 1060. If I don't have the Intel GPU enabled, MFXInit() fails with MFX_ERR_UNSUPPORTED. I've not tried removing the NVidia GPU. From searching these forums, I found some people solving the problem by moving their monitor to the their Intel GPU on the motherboard plug, but I haven't tried this as I don't currently have a DVI monitor handy. I've installed the latest Intel GPU drivers yesterday. My compiler is Visual Studio 2013, 64-bit.

If I use D3D11, my problems start at the MFXVideoVPP::QueryIOSurf() call. With both the unmodified "simple_6_encode_vmem_vpp_preproc" as well as my own version adapted for decoding, this function throws an exception somewhere inside the SDK (I'm not able to catch it). This exception is, according to the Visual Studio debugger, "_com_error at memory location...", but I'm not able to get more details about it as it happens inside the function call. QtCreator, in which I develop the decoder, tells me the exception is "d3d11!ThrowFailure" which I assume means there's some problem in handling exceptions themselves.

The second batch of exceptions happens in MFXVideoVPP::Init(), which outputs 3 more of the same exceptions.

In my decoding program, If I ignore these exceptions and let the program continue, everything proceeds without errors through DecodeFrameAsync() and RunFrameVPPAsync(), but SyncOperation() outputs "MFX_ERR_DEVICE_FAILED" as well as the same exception as above. With "simple_6_encode_vmem_vpp_preproc", no errors happen during the processing loop and the encoding proceeds normally (I'm using the simulated input, so I'm not sure if the encoder actually works).

Attempting to use Direct X 9 results in D3DERR_INVALIDCALL (0x8876086c) in IDirect3DDevice9X::CreateDeviceEx(), both with "simple_6_encode_vmem_vpp_preproc" and my decoder adapted for DX9. According to Microsoft documentation this error is "The method call is invalid. For example, a method's parameter may not be a valid pointer." The installed version of Direct X SDK is "Microsoft DirectX SDK (June 2010)".

Then, assuming these get fixed, I have a question about initializing the surfaces. In the software implementation it was necessary to set the surface data pointers like so:

    for (int i = 0; i < nSurfNumDecVPP; i++)
    { 
      pmfxSurfaces = new mfxFrameSurface1;
      if (!pmfxSurfaces)
      { 
        LOG_ERROR << "Failed allocating surfaces for decompressor";
        return false;
      }
      memset(pmfxSurfaces, 0, sizeof(mfxFrameSurface1));
      memcpy(&(pmfxSurfaces->Info), &(jpegParams.mfx.FrameInfo), sizeof(mfxFrameInfo));
      pmfxSurfaces->Data.MemId = mfxResponse.mids;
      
      // Is this needed for vram?
      pmfxSurfaces->Data.Y = &surfaceBuffers[surfaceSize * i]; 
      pmfxSurfaces->Data.U = pmfxSurfaces->Data.Y + width * height;
      pmfxSurfaces->Data.V = pmfxSurfaces->Data.U + 1;
      pmfxSurfaces->Data.Pitch = width;
    }

    ...

    for (int i = 0; i < nSurfNumVPPOut; i++)
    {
      pmfxSurfaces2 = new mfxFrameSurface1;
      if (!pmfxSurfaces2)
      {
        LOG_ERROR << "Failed allocating surfaces for VPP";
        return false;
      }
      memset(pmfxSurfaces2, 0, sizeof(mfxFrameSurface1));
      memcpy(&(pmfxSurfaces2->Info), &(VPPParams.vpp.Out), sizeof(mfxFrameInfo));

      // Is this needed for vram?
      pmfxSurfaces2->Data.MemId = mfxResponseVPPOut.mids;
      pmfxSurfaces2->Data.B = &surfaceBuffers[surfaceSize * i];
      pmfxSurfaces2->Data.G = pmfxSurfaces2->Data.B + 1;
      pmfxSurfaces2->Data.R = pmfxSurfaces2->Data.B + 2;
      pmfxSurfaces2->Data.A = pmfxSurfaces2->Data.B + 3;
      pmfxSurfaces2->Data.Pitch = width * 4;
    }

Is this required when using Direct X?

The code: http://plantmonster.net/koodailut/cplusplus/intel_media_sdk/hardware_decoder.cpp
The test JPEG image I'm using: http://plantmonster.net/koodailut/cplusplus/intel_media_sdk/sample.jpg
DxDiag: http://plantmonster.net/koodailut/cplusplus/intel_media_sdk/DxDiag.txt

Jeffrey_M_Intel1 · ‎09-18-2016

Do you see the errors you described when running the samples and tutorials too?

For video memory, you're correct -- setting up the YUV and BGRA pointers isn't necessary. We've set up the tutorials so you can look at the system and video memory implementations in an editor for comparing code differences to quickly see the implementation requirements for both modes.