- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello to all the members,
I implemented Intel Media SDK in my own application wich only decodes frames and save them in single files and testet on the new Intel Ultrabook.
I measured performance to see how Intel SDK will improve decoding capabilities.
So I was wondering about the results.
I tested both Software and Hardware implementation with both system and D3D memory allocation.
Why I was wondering about the results? The software implementation with system memory usage is as fast as possible, much faster than the decoding routines I used before.
But If I use hardware implementation the performance will slow down as double as software implementation and also much slower than my before used routine.
I tested hardware implementation with both system and d3d memory allocation. An with system memory allocation the decoding is faster than with d3d memory.
At last I used software implementation with d3d memory witch is slower than hardware implementation with system memory.
For d3d memory allocation I adpot the d3dallocator from SDK sample_decode project.
To eliminate some application hints from my own appplication I decided to measure decoding also with SDK sample_decode project witch ended in the same results as described above.
For measurment I used GPA trace to only measure the whole decoding process (Decode_FrameAsync) and frame reading process separatly.
Did I go fail with my consideration that hardware implementation decoding is faster than software implementation decoding?
Will hardware implementation only improve applications that view the results on graphic devices?
So I hope someone can explain me the underlaying implementations a little bit more.
Best regards
Ronny
I implemented Intel Media SDK in my own application wich only decodes frames and save them in single files and testet on the new Intel Ultrabook.
I measured performance to see how Intel SDK will improve decoding capabilities.
So I was wondering about the results.
I tested both Software and Hardware implementation with both system and D3D memory allocation.
Why I was wondering about the results? The software implementation with system memory usage is as fast as possible, much faster than the decoding routines I used before.
But If I use hardware implementation the performance will slow down as double as software implementation and also much slower than my before used routine.
I tested hardware implementation with both system and d3d memory allocation. An with system memory allocation the decoding is faster than with d3d memory.
At last I used software implementation with d3d memory witch is slower than hardware implementation with system memory.
For d3d memory allocation I adpot the d3dallocator from SDK sample_decode project.
To eliminate some application hints from my own appplication I decided to measure decoding also with SDK sample_decode project witch ended in the same results as described above.
For measurment I used GPA trace to only measure the whole decoding process (Decode_FrameAsync) and frame reading process separatly.
Did I go fail with my consideration that hardware implementation decoding is faster than software implementation decoding?
Will hardware implementation only improve applications that view the results on graphic devices?
So I hope someone can explain me the underlaying implementations a little bit more.
Best regards
Ronny
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ronny,
First, what decoder performance are you measuring? MPEG2, VC1 or H.264? H.264 will give you the best acceleration vs. SW.What resolution is your content? Note that at lower resolutions, the HW acceleration benefits diminish due to the relative CPU-GPU transfer overhead.
If you are using the Media SDK "sample_decode" sample please note the very large overhead of writing the raw content to file.If you want to assess the true performance please remove raw file writing (or reading) when you perform benchmarking.
To achieve best performance it is recommended to use D3D surfaces when using HW acceleration and system memory surfaces when using SW. This to reduce the overhead of copying between system-GPU memory.
So, if you adhere to the above suggestions you should observe that HW accelerated decode workloads completes faster than if processed by the processor.
Regards,
To achieve best performance it is recommended to use D3D surfaces when using HW acceleration and system memory surfaces when using SW. This to reduce the overhead of copying between system-GPU memory.
So, if you adhere to the above suggestions you should observe that HW accelerated decode workloads completes faster than if processed by the processor.
Regards,
Petter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Petter,
Thanks for your answer.
Its clearly logical to use D3D surfeaces for HW acceleration and system memory surfaces for SW.
But I measured truely the whole Decoding time with GPA trace points. I inserted trace mark before DecodeFrameAsync and after this to analyze the performance. The other tasks I eliminated with this.
And so i found out HW acceleration is double time slower than SW?
I cannot imagine why?
The interesting is the fact, that using HW acceleration with system memory is faster than using HW acceleration with D3D surfaces?
I used a MPEG Stream, wich maybe to low on resolution. I will check with a higher resolution tomorrow.
Maybe it is a configuration issue? Ultrabook HD Driver or Intel System driver? (But all on last version)
Best regards
Ronny
Thanks for your answer.
Its clearly logical to use D3D surfeaces for HW acceleration and system memory surfaces for SW.
But I measured truely the whole Decoding time with GPA trace points. I inserted trace mark before DecodeFrameAsync and after this to analyze the performance. The other tasks I eliminated with this.
And so i found out HW acceleration is double time slower than SW?
I cannot imagine why?
The interesting is the fact, that using HW acceleration with system memory is faster than using HW acceleration with D3D surfaces?
I used a MPEG Stream, wich maybe to low on resolution. I will check with a higher resolution tomorrow.
Maybe it is a configuration issue? Ultrabook HD Driver or Intel System driver? (But all on last version)
Best regards
Ronny
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ronny,
Measuring the performance using GPA in the way you describe may give you very wide performance span due to the asynchronous nature of the operations. I suggest instead to execute decoding of video content with at least 1000 frames or more and compare the total time to completion. In this way you'll get a better measurement of throughput (or average fps).
I quickly ran a similar workload using slightly modified sample_decode (with file writing omitted), decoding a 1080p MPEG2 stream. HW decode is about 1.5-2x faster vs. SW on the machine I used.
For decode of lower resolution content you will see that HW speedup over SW decreases due to the overhead, as stated earlier. It is likely that you will measure SW decode to be faster than HW decode for low resolution content.
Regards,
Measuring the performance using GPA in the way you describe may give you very wide performance span due to the asynchronous nature of the operations. I suggest instead to execute decoding of video content with at least 1000 frames or more and compare the total time to completion. In this way you'll get a better measurement of throughput (or average fps).
I quickly ran a similar workload using slightly modified sample_decode (with file writing omitted), decoding a 1080p MPEG2 stream. HW decode is about 1.5-2x faster vs. SW on the machine I used.
For decode of lower resolution content you will see that HW speedup over SW decreases due to the overhead, as stated earlier. It is likely that you will measure SW decode to be faster than HW decode for low resolution content.
Regards,
Petter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Petter,
I was able to test again with HD Video files and measured the whole time for sample_decode project on ultrabook and yours right, the hardware acceleration maybe is as double faster as software implementation.
But unfortunatly the whole decoding call Decode_FrameAsync is slower on hardware as on software acceleration. But in the fact the maintainance around this call is as faster, the goal was reached.
It is just a fact how the measurement and measurement marker are set:-)
Best regards
Ronny
I was able to test again with HD Video files and measured the whole time for sample_decode project on ultrabook and yours right, the hardware acceleration maybe is as double faster as software implementation.
But unfortunatly the whole decoding call Decode_FrameAsync is slower on hardware as on software acceleration. But in the fact the maintainance around this call is as faster, the goal was reached.
It is just a fact how the measurement and measurement marker are set:-)
Best regards
Ronny

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page