Media (Intel® Video Processing Library, Intel Media SDK)
Access community support with transcoding, decoding, and encoding in applications using media tools like Intel® oneAPI Video Processing Library and Intel® Media SDK
Announcements
The Intel Media SDK project is no longer active. For continued support and access to new features, Intel Media SDK users are encouraged to read the transition guide on upgrading from Intel® Media SDK to Intel® Video Processing Library (VPL), and to move to VPL as soon as possible.
For more information, see the VPL website.

HW acceleration slower than software implementation about H264 decoder on Sandbridge system

lshang_2000
Novice
984 Views
I use sample_decode.exe to decode H264(720P) with/without the "-hw" option on Sandbridge system(3GHz, 8GB, windows 7 64Bit). The test uses the system memory(not 3D3 surface). The results are as the followings:
Software implementation(without "-hw"): 1497 fps.
Hardware implementation(with "-hw"): 556 fps.
My question is that whythe performance of using hardware accelerationis slow thanusing software implementation.

Thank you in advance!

Regards
lshang
0 Kudos
11 Replies
lshang_2000
Novice
984 Views
No any answers. Maybe the questions is not very clear.

Actually, the testing resultsof both settings with system memory are very fast, because the results is based on only counting decoding functions( DecodeFrameAsync, and m_mfxSession.SyncOperation(syncp, DEC_WAIT_INTERVAL)), not include any stream read/write function. It is tested only for the performance of decoding function.

Using hw acceleration with system memory the decoder performance has a little bitof dropping down, i think, because the decoder need to copy the decoding datafrom GPU memory to system memory. The memory of copy operation may affect the performance. Am i right?

I also tested the decoder with both hw andd3d options, the perfomance is far faster than any other options.

Because our decoder application has to use system memory and HW acceleration (can not using D3D surface ), we still need to improve the performance. Any suggestions?

Thanks,
lshang

0 Kudos
Eric_S_Intel
Employee
984 Views

Hi,

Sorry for the delay. I was actually writing the response to the original as you posted the 2nd message. To answer your question, yes the decode must copy the data from the GPU back to system memory. There will be an inherent performance penalty when not using D3D surfaces.

Eric

0 Kudos
lshang_2000
Novice
984 Views
Hi,

Thanks for your answer.How doesthe decoder with HW acceleration copy data from GPU to system memory (one time total frame data copy or many time MACROBLOCK data copy)?

Thank you in advance!

lshang
0 Kudos
Eric_S_Intel
Employee
984 Views

Your welcome! The Media SDK always uses complete frames when copying data. Hope this helps.

-Eric

0 Kudos
lshang_2000
Novice
984 Views
Thanks.

lshang
0 Kudos
madshi_net
Beginner
984 Views
Hi Eric,

I'm considering adding support for DXVA decoding to my DirectShow video renderer. Due to quality concerns, I need to do the chroma upsampling and color conversion myself, though. Because of that I fear I have to transfer the decoded data from GPU memory back to system memory. I've done some benchmarks and found that the GPU RAM > system RAM transfer runs with only about 5fps (H55) up to 20fps (G45) for 1080p content. I've done the transfer by doing a simple LockRect(READ_ONLY) on the DXVA NV12 D3D surface. Now I'm wondering if the Media SDK uses a faster method for GPU RAM -> system RAM transfer? If so, how fast can the Media SDK transfer decoded 1080p frames with Intel GPUs? And is the Media SDK transfer method available for use outside of the Media SDK, too?

(Just for your information: DxvaNv12Surface.LockRect(READ_ONLY) is painfully slow with ATI GPUs, too, while I get up to 600fps with NVidia GPUs. Weird stuff...)

Thanks, Mathias.
0 Kudos
Nina_K_Intel
Employee
984 Views
Hi Mathias,

Intel Media SDK does use an own accelerated method for GPU RAM -> system RAM transfer on Intel architectures. This method is accessible throughmfxCoreInterface::CopyFrame function which is a part of the specific API extension for user plugins. Please check out the mediasdkusr_man.pdf under \doc for the details on how to get access and use this function.

As for performance figures - they may vary so I suggest you benchmark on your particular system. All I can say it will be way faster than LockRect.
Best regards,
Nina
0 Kudos
madshi_net
Beginner
984 Views
Thank you, Nina, that's very helpful! :)
0 Kudos
aegor
Beginner
984 Views
Nina, if I understand correctly, HW enc/dec on SNB exists in a form, similar to HW enc/dec on Atom 6xx family, i.e. as strong HW core, such as POVERVR VXE IP?
If this true, that any postprocessing in transcoding pipeline realized as GPU kernel with zero-buffer copy (or near-zero:)?
And, if its true, how I made load own postprocessing GPU kernels into this pipeline with same zero-buffer copy approach?
0 Kudos
Nina_K_Intel
Employee
984 Views
With SandyBridge it's not possible to write custom processing filters on GPU.

With future Intel platforms the access to GPU acceleration will become available through Intel OpenCL.

Regards,
Nina
0 Kudos
nrson
Beginner
984 Views

Hi, Nina

I have been testing memcpy() from video memory to system memory after H264 decoding(1080P).

Howerver cpu occupancy are very varied as following stream counts(window 7 32bit, i5-2400).

In 1~5 streams, cpu occupancy is 1~2%.

In more than 6 stream, cpu coopuancy is 90%.(I think HW decoding is change to SW decoding)

I saw your answered messages that copyframe and copybuffer are less than memcpy() in cpu's occupancy.

copyframe and copybuffer are used mfxcoreinterface of mfxplugin.

How can I use copyframe and copybuffer useless of mfxplugin.

Best regards,
nrson

0 Kudos
Reply