I've done a few decoding tests with samples from the tutorial on the same machine (i7 4770K, HD 4600) on Linux (Ubuntu 12.04 Server, kernel 3.8.0-23) and Windows 8.1 and got the following results. (Numbers in cells is fps, big_buck_bunny sequence is used, output file writing is turned off)
System memory Video memory
sw hw sw hw
480p 2500 1800 1350 5000
1080p 530 770 230 1060
System memory Video memory
sw hw sw hw
480p ---- 1070 ------ 1240
1080p ---- 250 ----- 570
Linux results unpleasantly surprised me, and I checked to remove from consideration file reading operations. At first I read the whole file to the system memory and then measured the time. It gave some benefits (about 10%), but the results were still much worse than the Windows results.
After that I tested other sample (sample_decode_drm) and got the following results (with file reading).
System memory Video memory (-vaapi) Video memory + async 15
sw hw sw hw hw
480p ---- 1310 ------ 3630 4660
1080p ---- 320 ----- 920 940
So, it looks like simple_2_decode_vmem sample uses the video memory in the ineffective way. What is the fundamental difference between simple_2_decode_vmem and sample_decode_drm samples? Is it a known issue?
P.S. Experts, do you know a good vaapi manual or some useful articles for better Linux graphics stack understanding? Is mediasdk-manual.pdf (Intel® Media Software Development Kit Reference Manual) enough to getting started?
Thanks for sharing your analysis with us - the performance numbers you are seeing on Linux for the tutorial is not expected. We expect the sample and tutorial numbers to be close since the tutorials are much simpler implementations of the samples. I will investigate this issue and get back to you.
In the meantime, if you can provide the command lines you used to run each of these, it will be helpful for reproduction purpose.
I have some updates regarding comparison of simple_2_decode_vmem and sample_decode_drm, and why the performance differs.
The performance difference between these two applications can be attributed to how many asynchronous operations they perform before requiring explicit synchronization. In simple_2_decode_vmem, the synchronization is performed after EVERY call to the RunAsync function. In the sample_decode, the default AsyncDepth is set to 5, meaning, synchronizations occurs after every 5 RunAsync operations. Increasing the number of asynchronous operations before synchronization gives a good performance boost, thus is always recommended to use.
I hope this answers your question. In the meantime, I will verify the performance numbers on Windows versus Linux and get back to you. As mentioned in my previous post, it would be helpful if you could share the command line you used for the analysis.
I apologize for the delay. Here is command lines, I used. There is a decoding speed in parentheses.
System memory: simple_decode.exe big_buck_bunny_480p.264 (1800 fps)
simple_decode.exe big_buck_bunny_1080p.264 (770 fps)
Video memory: simple_decode_d3d.exe big_buck_bunny_480p.264 (5000 fps)
simple_decode_d3d.exe big_buck_bunny_1080p.264 (1060 fps)
System memory: ./simple_decode big_buck_bunny_480p.264 (1070 fps)
./simple_decode big_buck_bunny_1080p.264 (250 fps)
Video memory: ./simple_decode_vmem big_buck_bunny_480p.264 (1240 fps)
./simple_decode_vmem big_buck_bunny_1080p.264 (570 fps)
System memory: ./sample_decode_drm h264 -hw -i big_buck_bunny_480p.264 (1310 fps)
./sample_decode_drm h264 -hw -i big_buck_bunny_1080p.264 (320 fps)
Video memory: ./sample_decode_drm h264 -hw -i big_buck_bunny_480p.264 -vaapi (3630 fps)
./sample_decode_drm h264 -hw -i big_buck_bunny_1080p.264 -vaapi (920 fps)
./sample_decode_drm h264 -hw -i big_buck_bunny_480p.264 -vaapi -async 15 (4660 fps)
./sample_decode_drm h264 -hw -i big_buck_bunny_1080p.264 -vaapi -async 15 (940 fps)
Thank you for sending the command lines you used. It has been a little swamped here, hence a little late response. I will do the experiments and get back to you soon.
If there is any pressing question in the meantime, let me know. Also, regarding your question on perf difference between simple_decode versus sample_decode, I already answered in my previous post. Hope that was helpful to you.
In my self experiments on the decode and transcode performance (using your command lines), the performance gap between Linux and Windows was much smaller. Something to note is that the variation between individual runs was also non-negligible.
In general, the performance experiments you (and I) performed had the following limitations - (1) the test streams were short, (2) such test cases do not stress the system enough for stable performance numbers (OS scheduling, power management, freq turbo enabling), (3) the load is quite small and does not result in turbo-ing the freq. In short, for performance comparison, test streams should be longer to achieve stable state and stress the underlying system.
Having said that, we are doing some internal experiments, and will share our observations when they are ready.