Media SDK Hardware Implementation Decoding Slow Down on Ultrabook

Ronny_Bodach · ‎05-28-2012

Hello to all the members,

I implemented Intel Media SDK in my own application wich only decodes frames and save them in single files and testet on the new Intel Ultrabook.

I measured performance to see how Intel SDK will improve decoding capabilities.
So I was wondering about the results.

I tested both Software and Hardware implementation with both system and D3D memory allocation.

Why I was wondering about the results? The software implementation with system memory usage is as fast as possible, much faster than the decoding routines I used before.

But If I use hardware implementation the performance will slow down as double as software implementation and also much slower than my before used routine.
I tested hardware implementation with both system and d3d memory allocation. An with system memory allocation the decoding is faster than with d3d memory.

At last I used software implementation with d3d memory witch is slower than hardware implementation with system memory.

For d3d memory allocation I adpot the d3dallocator from SDK sample_decode project.

To eliminate some application hints from my own appplication I decided to measure decoding also with SDK sample_decode project witch ended in the same results as described above.

For measurment I used GPA trace to only measure the whole decoding process (Decode_FrameAsync) and frame reading process separatly.

Did I go fail with my consideration that hardware implementation decoding is faster than software implementation decoding?

Will hardware implementation only improve applications that view the results on graphic devices?

So I hope someone can explain me the underlaying implementations a little bit more.

Best regards

Ronny

Petter_L_Intel · ‎05-29-2012

Hi Ronny,

First, what decoder performance are you measuring? MPEG2, VC1 or H.264? H.264 will give you the best acceleration vs. SW.

What resolution is your content? Note that at lower resolutions, the HW acceleration benefits diminish due to the relative CPU-GPU transfer overhead.

If you are using the Media SDK "sample_decode" sample please note the very large overhead of writing the raw content to file.If you want to assess the true performance please remove raw file writing (or reading) when you perform benchmarking.

To achieve best performance it is recommended to use D3D surfaces when using HW acceleration and system memory surfaces when using SW. This to reduce the overhead of copying between system-GPU memory.

So, if you adhere to the above suggestions you should observe that HW accelerated decode workloads completes faster than if processed by the processor.

Regards,

Petter

Ronny_Bodach · ‎05-29-2012

Hi Petter,

Thanks for your answer.

Its clearly logical to use D3D surfeaces for HW acceleration and system memory surfaces for SW.

But I measured truely the whole Decoding time with GPA trace points. I inserted trace mark before DecodeFrameAsync and after this to analyze the performance. The other tasks I eliminated with this.

And so i found out HW acceleration is double time slower than SW?
I cannot imagine why?
The interesting is the fact, that using HW acceleration with system memory is faster than using HW acceleration with D3D surfaces?

I used a MPEG Stream, wich maybe to low on resolution. I will check with a higher resolution tomorrow.

Maybe it is a configuration issue? Ultrabook HD Driver or Intel System driver? (But all on last version)

Best regards

Ronny

Petter_L_Intel · ‎05-29-2012

Hi Ronny,

Measuring the performance using GPA in the way you describe may give you very wide performance span due to the asynchronous nature of the operations. I suggest instead to execute decoding of video content with at least 1000 frames or more and compare the total time to completion. In this way you'll get a better measurement of throughput (or average fps).

I quickly ran a similar workload using slightly modified sample_decode (with file writing omitted), decoding a 1080p MPEG2 stream. HW decode is about 1.5-2x faster vs. SW on the machine I used.
For decode of lower resolution content you will see that HW speedup over SW decreases due to the overhead, as stated earlier. It is likely that you will measure SW decode to be faster than HW decode for low resolution content.

Regards,

Petter

Ronny_Bodach · ‎06-05-2012

Thanks Petter,

I was able to test again with HD Video files and measured the whole time for sample_decode project on ultrabook and yours right, the hardware acceleration maybe is as double faster as software implementation.

But unfortunatly the whole decoding call Decode_FrameAsync is slower on hardware as on software acceleration. But in the fact the maintainance around this call is as faster, the goal was reached.

It is just a fact how the measurement and measurement marker are set:-)

Best regards

Ronny