CPU load growth with D3D memory

itv-axxon · ‎06-18-2013

Hi,

I integrate MediaSDK into video surveillance system.
The Idea is to avail HD Graphics for video channels decoding. I decided to start with MediaSDKs sample_decode.
Our system design requires the decoded frame to reside in system memory.
So, initially, I used SYSTEM_MEMORY for decoded surfaces.
While benchmarking I feed H.264 frames with constant framerate.
What I see? The MediaSDK with hardware implementation has higher CPU load than FFMPEG H.264 decoder.
Both decoders have single copying output frame from decoder buffer to target buffer.

Then, I found, that the maximum performance can be acheved with D3D memory.
So, I switched memory to D3D9_MEMORY and got CPU load ... increased about twice :(

All allocators are original drom sample_decode sample.
Why CPU load has grown? What can be wrong?

Another observation is very strange for me:
Initially I feed compressed frames to decoder until first decoded frame received at output.
Then, I stop to feed frames. I just drop data instead feeding it without closing decoding session.
CPU load remains high. Even higher than while decoding. So, decoding is less expensive than idle.

How is it possible?

Yaroslav.

Petter_L_Intel · ‎06-18-2013

Hi Yaroslav,

How are you using the sample_decode sample project? By default, the sample code writes the decoded frames to disk, which naturally has a large impact on performance and CPU load. To assess true HW acceleration performance you need to eliminate writing raw frames to disk. For the decode case there is also a sample part of the Media SDK tutorial (http://software.intel.com/en-us/articles/intel-media-sdk-tutorial) which provides and easy way to benchmarking by disabling output.

If you still see high CPU load even after eliminating raw data writes, then I suspect the issue is that your workload is not HW accelerated, If so, please make sure you have recent Intel graphics driver installed and make sure your system supports HW acceleration.

Regards,
Petter

itv-axxon · ‎06-18-2013

1. based on sample_decode means that I use allocators from there
2. data output to file is off. Decoder is integrated into system, which has own decoded stream handling
I compare HDGraphics with FFMPEG in the same system/
3. I built binary in one machine but run on another one. So, software implementation is disabled.
I attach mediasdk_sys_analyzer report, which confirms it.

Petter_L_Intel · ‎06-18-2013

Hi Yaroslav,

Thanks for providing details. As you say it seems that HW acceleration is working fine.

I suspect there is something else impacting the load such as mem copy, color space conversion overhead.

Is the SW (ffmpeg) pipeline identical to the Media SDK HW pipeline with regards to the memory copies, color space conversions etc.?

Keep in mind that HW acceleration always imply an overhead vs, the SW path in case your surfaces reside in system memory, due to the required task of copying to D3D surface for HW processing (performed either explicitly by application or implicitly via Media SDK).

So depending on the pipeline configuration, if the frame resolution is small, then it may be more efficient executing decode operations using SW codec. However, for multi-channel usages and for low power footprint it may still be beneficial to execute on HW.

Regards,
Petter

itv-axxon · ‎06-18-2013

Petter,

Thank you for responses.
Now i can ask my questions differently.

1. Taking into account, that I need decoded frame to be placed into system memory (without displaying), is there any sense
to decode video into D3D memory?
If to compare decoding into system memory against decoding into D3D memory - in which memory decoded frame will appear earlier?
2. I still wonder, why CPU load increases if to stop feed compressed data to decoder. I configured surveillance system for 9 channels and monitor CPU load with process explorer.
In case I feed compressed data with constant framerate CPU load is about 35%
If I stop data feeding after firat frame is dedoded - CPU load increases up to 65%
Do you have any idea why it can happen?

Petter_L_Intel · ‎06-19-2013

Hi Yaroslav,

1. If you require decoded surfaces in sys mem then you should configure Media SDK so that the output surfaces are of sys mem type (Media SDK will handle the required D3D->sysmem copy internally)

2. Not sure. The only thing I can think of is the impact of the pipeline draining stage where frames residing in decoder are being flushed at the end of a workload (as illustrated in the Media SDK samples). But that would not explain a greater load on your processor compared to the previous pipeline stage. I suggest you explore your implementation, making sure there are no undesired computations at this stage.

Regards,
Petter