I am using simple_decode_d3d example from MediaSDK tutorials for decoding 4K H.264 video.
Usage Example: simple_decode_d3d -i Test4KVideo.h264
The sample_decode_d3d gives 300fps (without saving/copying the output).
However, when I copy the decoded frame in NV12 format (size of the raw frame ~12MB) to local buffer using memcpy, the fps drops to <10fps.
On searching on internet, I found Copying Accelerated Video Decode Frame Buffers article and tried to use the CopyFrame function. Using the CopyFrame gave 25fps.
300fps without copying to 25fps with copying is bad.
Can anyone suggest better way to copy decoded frame from video surface to system memory? Any example or link will be helpful..
System Configuration: i7 4770, 8GB DDR3 RAM
Video: H.264, 3840x2160 (4K)
Thanks for the information you have given, but there are few things I would need to debug this issue.
>>What is the driver number you are using? You can check that with the help of system analyzer tool, describe here
>>since you are trying to copy YUV frames, which are huge in size considering you are using 4k input. That's why you see huge drop in fps
Can I ask you what is your pipeline? Where do you want to feed the decode o/p?(this will help me to suggest a better method)
>>Have you tried writing the o/p to a file? what fps do you get?
>>Any particular reason of not using using d3d11?
Thanks for the reply.
- Driver Version: 10.18.10.3345, Intel(R) HD Graphics 4600. Will updating the driver help?
- But 300 fps to 10 fps using memcpy(), Is such a drop expected? memcpy() on RAM is fast, we can eaisly copy 1GB/s..
- Pipeline: We have a series of Video Processing Filters, from the decoder filter we want to feed the raw frames to our own filter which does further processing (not as costly as decoding, like applying overlays etc..) on RAM.
- Writing o/p to a file.. I will measure and let u know soon..
- No reason, I can try using d3d11 and updte the performance... Will it help?
- I updated my driver to 10.18.10.3496, there is no change in performance.
- I used d3d11, memcpy() to copy raw frame from video surface to RAM, it improved the performance. With directx9 it was 8fps, with directx11 it is ~40fps. However, CopyFrame() from link didn't improve the performance that much.. from ~30 fps to 40 fps.
After discussing about this issue internally, few things we can recommend -
--writing a YUV to RAM is a huge overhead, specially when you are writing 4K YUV. This will fill up the system RAM quickly. It would be best if your filters can use video memory. You can check decode_vpp tutorial to see how they can share the video memory surface. If you want to use any other memory, then you can make use of OpenCL, OpenGL to implement surface sharing.
--I don't know what the end o/p you want or what kind of filters you have, but just a suggestion do you think you can do transcoding instead. Decode ---> Filters ----> Encode. This way you are not writing YUV which is way lesser overhead for CPU and memory. Please check trancode_vpp_vmem tutorial or multi_transcode sample.
--use system memory if you want to write it back to the disk. This should cut the over head of writing from video memory to RAM. You might want to measure the performance here. Since your system has graphics then it should use video memory and then does optimize copy into system.(this is Media SDK internal implementation)
Just to clear, our tutorials and samples are great place to start but they are not optimized for best performance.
Also, the latest driver for your system is 10.18.10.4080 which you can get from here(this won't give better decoder performance diff, but just so you have latest).
Hope this will help as a starting point.