Media (Intel® Video Processing Library, Intel Media SDK)
Access community support with transcoding, decoding, and encoding in applications using media tools like Intel® oneAPI Video Processing Library and Intel® Media SDK
Announcements
The Intel Media SDK project is no longer active. For continued support and access to new features, Intel Media SDK users are encouraged to read the transition guide on upgrading from Intel® Media SDK to Intel® Video Processing Library (VPL), and to move to VPL as soon as possible.
For more information, see the VPL website.

Performance: Linux vs Windows.

Timophey_S_
Beginner
1,152 Views

Hi,

I've done a few decoding tests with samples from the tutorial on the same machine (i7 4770K, HD 4600) on Linux (Ubuntu 12.04 Server, kernel 3.8.0-23) and Windows 8.1 and got the following results. (Numbers in cells is fps, big_buck_bunny sequence is used, output file writing is turned off)

Windows:

                      System memory                                         Video memory 

                 sw                      hw                                    sw                     hw

480p         2500                   1800                                 1350                  5000   

1080p        530                      770                                  230                   1060

 

Linux:

                    System memory                                         Video memory 

                 sw                      hw                                    sw                     hw

480p           ----                     1070                                 ------                  1240   

1080p          ----                     250                                  -----                    570

Linux results unpleasantly surprised me, and I checked to remove from consideration file reading operations. At first I read the whole file to the system memory and then measured the time. It gave some benefits (about 10%), but the results were still much worse than the Windows results.

After that I tested other sample (sample_decode_drm) and got the following results (with file reading).

                    System memory                                         Video memory (-vaapi)                 Video memory + async 15          

                 sw                      hw                                    sw                     hw                                       hw

480p           ----                     1310                                 ------                  3630                                    4660

1080p          ----                     320                                  -----                    920                                      940

So, it looks like simple_2_decode_vmem sample uses the video memory in the ineffective way. What is the fundamental difference between simple_2_decode_vmem and sample_decode_drm samples? Is it a known issue?

P.S. Experts, do you know a good vaapi manual or some useful articles for better Linux graphics stack understanding? Is mediasdk-manual.pdf (Intel® Media Software Development Kit Reference Manual) enough to getting started?

0 Kudos
6 Replies
Sravanthi_K_Intel
1,152 Views

Hello Timophey,

Thanks for sharing your analysis with us - the performance numbers you are seeing on Linux for the tutorial is not expected. We expect the sample and tutorial numbers to be close since the tutorials are much simpler implementations of the samples. I will investigate this issue and get back to you.

In the meantime, if you can provide the command lines you used to run each of these, it will be helpful for reproduction purpose.

0 Kudos
Sravanthi_K_Intel
1,152 Views

Hello Timophey,

I have some updates regarding comparison of simple_2_decode_vmem and sample_decode_drm, and why the performance differs.

The performance difference between these two applications can be attributed to how many asynchronous operations they perform before requiring explicit synchronization. In simple_2_decode_vmem, the synchronization is performed after EVERY call to the RunAsync function. In the sample_decode, the default AsyncDepth is set to 5, meaning, synchronizations occurs after every 5 RunAsync operations. Increasing the number of asynchronous operations before synchronization gives a good performance boost, thus is always recommended to use.

I hope this answers your question. In the meantime, I will verify the performance numbers on Windows versus Linux and get back to you. As mentioned in my previous post, it would be helpful if you could share the command line you used for the analysis.

0 Kudos
Timophey_S_
Beginner
1,152 Views

Hello, Sravanthi!

I apologize for the delay. Here is command lines, I used. There is a decoding speed in parentheses.

Windows:

    System memory:  simple_decode.exe big_buck_bunny_480p.264    (1800 fps)

                               simple_decode.exe big_buck_bunny_1080p.264   (770 fps)

    Video memory:     simple_decode_d3d.exe big_buck_bunny_480p.264   (5000 fps)

                               simple_decode_d3d.exe big_buck_bunny_1080p.264 (1060 fps)

Linux:

  Tutorials:

    System memory:  ./simple_decode big_buck_bunny_480p.264      (1070 fps)

                               ./simple_decode big_buck_bunny_1080p.264    (250 fps)

    Video memory:     ./simple_decode_vmem big_buck_bunny_480p.264    (1240 fps)

                               ./simple_decode_vmem big_buck_bunny_1080p.264   (570 fps)

  SDK Samples:

    System memory:  ./sample_decode_drm h264 -hw -i big_buck_bunny_480p.264               (1310 fps)    

                               ./sample_decode_drm h264 -hw -i big_buck_bunny_1080p.264             (320 fps)

    Video memory:     ./sample_decode_drm h264 -hw -i big_buck_bunny_480p.264 -vaapi     (3630 fps)

                               ./sample_decode_drm h264 -hw -i big_buck_bunny_1080p.264 -vaapi    (920 fps)

                               ./sample_decode_drm h264 -hw -i big_buck_bunny_480p.264 -vaapi     -async 15  (4660 fps)

                               ./sample_decode_drm h264 -hw -i big_buck_bunny_1080p.264 -vaapi   -async 15  (940 fps)

Thanks,

Timophey.

0 Kudos
Sravanthi_K_Intel
1,152 Views

Hello Timophey,

Thank you for sending the command lines you used. It has been a little swamped here, hence a little late response. I will do the experiments and get back to you soon.

If there is any pressing question in the meantime, let me know. Also, regarding your question on perf difference between simple_decode versus sample_decode, I already answered in my previous post. Hope that was helpful to you.

0 Kudos
Timophey_S_
Beginner
1,152 Views

Hi!

Are there any updates? Were you able to reproduce the problem? Can I help you, sending any additional info?

I found the same problem with the transcoding samples.

 

0 Kudos
Sravanthi_K_Intel
1,152 Views

In my self experiments on the decode and transcode performance (using your command lines), the performance gap between Linux and Windows was much smaller. Something to note is that the variation between individual runs was also non-negligible.

In general, the performance experiments you (and I) performed had the following limitations - (1) the test streams were short, (2) such test cases do not stress the system enough for stable performance numbers (OS scheduling, power management, freq turbo enabling), (3) the load is quite small and does not result in turbo-ing the freq. In short, for performance comparison, test streams should be longer to achieve stable state and stress the underlying system.

Having said that, we are doing some internal experiments, and will share our observations when they are ready.

0 Kudos
Reply