Media (Intel® Video Processing Library, Intel Media SDK)
Access community support with transcoding, decoding, and encoding in applications using media tools like Intel® oneAPI Video Processing Library and Intel® Media SDK
Announcements
The Intel Media SDK project is no longer active. For continued support and access to new features, Intel Media SDK users are encouraged to read the transition guide on upgrading from Intel® Media SDK to Intel® Video Processing Library (VPL), and to move to VPL as soon as possible.
For more information, see the VPL website.

Some questions about the H.264 encoder sample

AaronL
Beginner
413 Views

There are some aspects of how the H.264 encoder sample (sample_encode) is implemented that are unclear to me.

1.  Why _doesn't_ the sample use VPP for color conversion from YV12 to NV12?  There are two supported color formats for the input file, YV12 and NV12, with YV12 being the default.  However, regardless of the input file color format, it configures VPP in and out to use the NV12 color format.  When it gets to the following lines:

    if (pParams->nWidth  != pParams->nDstWidth ||
        pParams->nHeight != pParams->nDstHeight ||
        m_mfxVppParams.vpp.In.FourCC != m_mfxVppParams.vpp.Out.FourCC)
    {
        m_pmfxVPP = new MFXVideoVPP(m_mfxSession);
        MSDK_CHECK_POINTER(m_pmfxVPP, MFX_ERR_MEMORY_ALLOC);
    }

which are used to decide whether or not to use VPP, it only creates an instance of MFXVideoVPP if the destination width and/or height are different from the input weight and height.  There is no situation in the code when the VPP in/out FourCC values are different.  Instead of using VPP for color conversion, it does the color conversion in software, in the CSmplYUVReader::LoadNextFrame (in sample_utils.cpp).  Line 185 deals with the case that the file input FourCC value is YV12 while the frame is expected to have a FourCC value of NV12, so it does a conversion in software.  However, it clearly has been implemented to go from YV12->YV12 per the code that starts on line 238, but this code is never used in the sample_encode sample.

Why isn't VPP being used for color format conversion in this case?  I made slight alterations to the code to use VPP for color conversion, and while there is no noticeable difference when using sample_encode.exe on the test_stream.yuv file found in the sample video content ZIP, it does result in a noticeable performance increase for a 40 second 720p59.94 YV12 raw .yuv file that I generated, going from roughly 8.8 seconds to 7.2 seconds for the full encode.  At the very least, it seems like this ought to be exposed via an option in the sample.  But, perhaps there is a good reason for not using VPP in this case as well.

2.  Why doesn't it OR MFX_IMPL_VIA_D3D9 with impl in the case that Direct3D 9 is being used?  On line 936 of pipeline_encode.cpp, it ORs MFX_IMPL_VIA_D3D11 in the case that Direct3D 11 is used (only relevant for Windows 8 and above though), but it does nothing if Direct3D 9 is used, which, according to the documentation, means that MFX_IMPL_VIA_ANY will be used.  It is unclear from the documentation what precisely happens when MFX_IMPL_VIA_ANY is used.

3.  In CD3D9Device::Init() (in d3d_device.cpp), why does it call CreateDeviceEx() with the D3DCREATE_SOFTWARE_VERTEXPROCESSING flag?  The Media SDK documentation discusses using the D3DCREATE_MULTITHREADED and D3DCREATE_FPU_PRESERVE flags but makes no mention of any of the vertex processing flags.  There are two other options, D3DCREATE_HARDWARE_VERTEXPROCESSING and D3DCREATE_MIXED_VERTEXPROCESSING that may be used.  One of these must be used according to the documentation, but why choose the software-only option over the rest?

Thanks,

Aaron Levinson

 

0 Kudos
7 Replies
ME
New Contributor I
413 Views

#2. The MFX_IMPL_VIA_D3D9 ... values are sequential.  ORing is not correct.  It's a bug in the sample(s).  You might like the turorial samples better.  They are a lot simpler.

0 Kudos
Sravanthi_K_Intel
413 Views

Hey Aaron,

Thanks for reporting your numbers for implementing the color conversion in VPP for the samples. Yes, implementing color conversion using VPP can be faster. My understanding is that the current implementation was used to simplify (!) the samples and/or use that across multiple samples. AND, the samples are not by any means production quality, they are there to help the developers get started.

In any case, thanks for the input. I am going to implement the color conversion in VPP and check the performance. And I like the idea of making that an option for execution. If you would like to share you code, that would be great too.

Reg (2) & (3), will get back to you. I like your observations in (1) - thanks again!

0 Kudos
AaronL
Beginner
413 Views

betlet wrote:

#2. The MFX_IMPL_VIA_D3D9 ... values are sequential.  ORing is not correct.  It's a bug in the sample(s).  You might like the turorial samples better.  They are a lot simpler.

I took a look at the tutorial.  It uses ORing as well--I don't think there is anything wrong with the sample in this case.  Here's the code from the tutorial, which comes from common\common_utils_window.cpp:

mfxStatus Initialize(mfxIMPL impl, mfxVersion ver, MFXVideoSession* pSession, mfxFrameAllocator* pmfxAllocator, bool bCreateSharedHandles)

{
    mfxStatus sts = MFX_ERR_NONE;

#ifdef DX11_D3D
    impl |= MFX_IMPL_VIA_D3D11;
#endif

Aaron

0 Kudos
AaronL
Beginner
413 Views

SRAVANTHI K. (Intel) wrote:

Thanks for reporting your numbers for implementing the color conversion in VPP for the samples. Yes, implementing color conversion using VPP can be faster. My understanding is that the current implementation was used to simplify (!) the samples and/or use that across multiple samples. AND, the samples are not by any means production quality, they are there to help the developers get started.

In any case, thanks for the input. I am going to implement the color conversion in VPP and check the performance. And I like the idea of making that an option for execution. If you would like to share you code, that would be great too.

I've attached a patch that provides two new options:

  • -cchw:  when turned on, causes VPP to be used for color conversion, if appropriate
  • -duration:  when turned on, reports the duration of the encode operation.

I also made a few minor changes to clean up some comments and an extra (benign) semi-colon that I noticed.

Here's some data for running sample_encode.exe with various options.  This is for a 32-bit binary of sample_encode.exe run on an Ivy Bridge (HD4000) Windows 7 64-bit system for a 40 second raw YV12 720p59.94 clip.  Here's the basic command-line:

sample_encode.exe h264 -i raw_video_stream_yv12.yuv -o raw_video_stream.h264 -hw -w 1280 -h 720 -f 59.94 -duration

So, it will always use hardware encoding for the following examples:

  • Software buffers + software color format conversion (no additional options):  ~6.5-6.9 s
  • Software buffers + VPP for color format conversion (-cchw):  ~9.2-9.4 s
  • Hardware buffers (Direct3D 9) + software color format conversion (-d3d):  ~8.6-8.9 s
  • Hardware buffers (Direct3D 9) + VPP for color format conversion (-d3d -cchw):  ~7.0-7.2 s

I don't have the ability to test with Direct3D 11 (-d3d11), as this is not supported on Windows 7.

Yesterday, when I first reported my results, I hadn't tried with software buffers only, so I had been under the impression that the best results were possible with hardware buffers plus VPP for color conversion.  But, that is clearly not the case, since I saw the best results with just software buffers and software color conversion.

I admit that I would have expected the results with hardware buffers to be superior to those with software buffers, but this isn't at all what happened.  Perhaps the fact that it has to load from and write to files accounts for this, even if it is loading directly into or from hardware buffers.  This probably warrants further investigation.

Aaron

0 Kudos
ME
New Contributor I
413 Views
I withdraw that comment, then. ORing in this case - where the high word is known to be 0 - would work as expected. My attempted point was to not combine these values; MFX_IMPL_VIA_D3D9 | MFX_IMPL_VIA_D3D11 would be incorrect. One aspect you may not be considering is CPU vs GPU use. It may be faster, and if that is the goal, better. It may be the goal is to free the CPU for other tasks. And then there is still this is demo code. Plus, doing any sort of file I/O with these huge sizes dwarfs the time spent everywhere else. Premature optimization is ... whatever Abrash said it was.
0 Kudos
AaronL
Beginner
413 Views

betlet wrote:

I withdraw that comment, then. ORing in this case - where the high word is known to be 0 - would work as expected. My attempted point was to not combine these values; MFX_IMPL_VIA_D3D9 | MFX_IMPL_VIA_D3D11 would be incorrect.

One aspect you may not be considering is CPU vs GPU use. It may be faster, and if that is the goal, better. It may be the goal is to free the CPU for other tasks.

And then there is still this is demo code. Plus, doing any sort of file I/O with these huge sizes dwarfs the time spent everywhere else.

Premature optimization is ... whatever Abrash said it was.

Regarding your comment about file sizes, I think that's a concern the first time sample_encode.exe is run for the particular input file, which is many times slower than subsequent runs, because at least some of the data is cached by the OS for the following runs.  So, the numbers I reported had to do with subsequent runs, not the initial run.  But, regardless of the options that are chosen, it has to load the file and write it out (I commented out the writing of the file and the impact was negligible though).  So, by comparing the relative times, it can be ascertained which options result in a faster color conversion and encode.  I'm using the sample as a proxy to determine which choices will make the most sense for the software that I've been developing.  I currently use all software buffers and software color conversion and wanted to get an idea if changing would be better.  This gives me a better idea.

0 Kudos
ME
New Contributor I
413 Views
Look at it this way, and stop me when where I am wrong, but the demo is doing the YUV to NV12 as a simple byte-move de-interleave. This is naive (straight-forword), and surely anything could do it faster. SIMD, for example. The Y plane can stay where it is; deinterleave the UV plane to U and V planes in a half-sized work buffer, and copy it back to the original (less moving, less memory). Even without SIMD it could be faster right there. Then, one could move on to also do half the job in each of two threads (assuming that helps the bottom line). And then do it using SIMD (if it's worth the pain) and really get things moving a lot faster. Then there's doing it in Vpp. I would presume it does YUV to NV12 as well it can. Maybe not as faster as SIMD, but if you are already using video memory, it may be overall, plus your CPU is more available. So, basically, somehow, you are seeing the most obviously slow way being faster than ... the other way you tested. The formal samples are done by more than one person (for the same program), and look to be the result of adding on until it does all the things they could think it ever could. The tutorial samples do things simply. Better to see how to best do things when you only see what you need to. "Premature optimization..." was apparently Knuth, not Abrash. I think Abrash was the "benchmark it" guy. Measure, measure, measure. 8254 and all.
0 Kudos
Reply