Software implementation(without "-hw"): 1497 fps.
Hardware implementation(with "-hw"): 556 fps.
My question is that whythe performance of using hardware accelerationis slow thanusing software implementation.
Thank you in advance!
Actually, the testing resultsof both settings with system memory are very fast, because the results is based on only counting decoding functions( DecodeFrameAsync, and m_mfxSession.SyncOperation(syncp, DEC_WAIT_INTERVAL)), not include any stream read/write function. It is tested only for the performance of decoding function.
Using hw acceleration with system memory the decoder performance has a little bitof dropping down, i think, because the decoder need to copy the decoding datafrom GPU memory to system memory. The memory of copy operation may affect the performance. Am i right?
I also tested the decoder with both hw andd3d options, the perfomance is far faster than any other options.
Because our decoder application has to use system memory and HW acceleration (can not using D3D surface ), we still need to improve the performance. Any suggestions?
Sorry for the delay. I was actually writing the response to the original as you posted the 2nd message. To answer your question, yes the decode must copy the data from the GPU back to system memory. There will be an inherent performance penalty when not using D3D surfaces.
Thanks for your answer.How doesthe decoder with HW acceleration copy data from GPU to system memory (one time total frame data copy or many time MACROBLOCK data copy)?
Thank you in advance!
I'm considering adding support for DXVA decoding to my DirectShow video renderer. Due to quality concerns, I need to do the chroma upsampling and color conversion myself, though. Because of that I fear I have to transfer the decoded data from GPU memory back to system memory. I've done some benchmarks and found that the GPU RAM > system RAM transfer runs with only about 5fps (H55) up to 20fps (G45) for 1080p content. I've done the transfer by doing a simple LockRect(READ_ONLY) on the DXVA NV12 D3D surface. Now I'm wondering if the Media SDK uses a faster method for GPU RAM -> system RAM transfer? If so, how fast can the Media SDK transfer decoded 1080p frames with Intel GPUs? And is the Media SDK transfer method available for use outside of the Media SDK, too?
(Just for your information: DxvaNv12Surface.LockRect(READ_ONLY) is painfully slow with ATI GPUs, too, while I get up to 600fps with NVidia GPUs. Weird stuff...)
If this true, that any postprocessing in transcoding pipeline realized as GPU kernel with zero-buffer copy (or near-zero:)?
And, if its true, how I made load own postprocessing GPU kernels into this pipeline with same zero-buffer copy approach?
I have been testing memcpy() from video memory to system memory after H264 decoding(1080P).
Howerver cpu occupancy are very varied as following stream counts(window 7 32bit, i5-2400).
In 1~5 streams, cpu occupancy is 1~2%.
In more than 6 stream, cpu coopuancy is 90%.(I think HW decoding is change to SW decoding)
I saw your answered messages that copyframe and copybuffer are less than memcpy() in cpu's occupancy.
copyframe and copybuffer are used mfxcoreinterface of mfxplugin.
How can I use copyframe and copybuffer useless of mfxplugin.