VPP NV12->RGB4 performance

Mihail_P_ · ‎10-27-2015

Hi.

I`m working on multithreaded decoder app. I`m using Intel INDE Professional 2015 Update 2 on Core i7-4790 with HD Graphics 4600, driver version 10.18.14.4264, Windows 8.1

My pipeline is looking like this:

h264 stream --> decoder (video memory) --> render (DirectX) --> VPP (NV12->RGB4) --> system memory

Without VPP part performance is about 500 fps (20 fullHD streams at 25 fps each) but as soon as I enable VPP performance drops to something near 100 fps. I wonder, if it is expected that VPP operation is so heavy in terms of performance?

Sravanthi_K_Intel · ‎10-27-2015

Hi Mihail,

Couple of points. System memory is not as performant as video memory -> so if you VPP stage is operating on videoIN-systemOUT surfaces, there will be some performance loss due to video->system copy. Although, can you clarify something i your pipeline -> Are you rendering (or playing back) the stream using D3D surface and then applying VPP filters? You probably are aware that specifying video memory for decode (in and out) with d3d surfaces means all surfaces will be available in video memory (or directly accessible to d3d surfaces), after which you can extend the media pipeline by calling vpp color conversions. This pipeline should not drop the fps as much as you are seeing. You can test a sample pipeline such as this in simple_6_decode_vpp_postproc_vmem tutorial from https://software.intel.com/en-us/intel-media-server-studio-support/training.

Let me know if this helps. Else, can you please modify one of the tutorials to reproduce your pipeline and send us the code for us to reproduce the behavior?

Mihail_P_ · ‎10-28-2015

Hi Sravanthi,

"Although, can you clarify something i your pipeline -> Are you rendering (or playing back) the stream using D3D surface and then applying VPP filters?"

Yes, I perform D3D rendering (renderer code is based on sample_decode with some modifications for multithreaded environment), after that I apply VPP if it is needed for current thread.

I will try to modify tutorial code, it`ll take some time, will get back with reproducer.

In the meantime I have one question. My pipeline is split between 2 threads. First thread is for decode, second is for rendering and optional VPP (like in sample_decode but with call to RunVPPFrameAsync + SyncOperation in DeliverOutput). Maybe this could be the cause of performane drop?

Sravanthi_K_Intel · ‎11-06-2015

Hi Mihail - apologies for the delayed response.

Cannot comment fully on if threading is dropping performance, surely depends on the implementation. But if you want to evaluate the performance of a media pipeline with and without VPP (to ensure the VPP stage is not the culprit), may I suggest you use our sample_multi_transcode sample for experimentation - https://software.intel.com/en-us/intel-media-server-studio-support/code-samples? You can setup any pipeline (using decode, encode or vpp) using sample_multi_transcode and observe the performance impact. The readme file inside the sample folder is very helpful. Let me know if that helps you.

Mihail_P_ · ‎12-04-2015

Hi Sravanthi,

Sorry for long delay. After some experiments we found that main bottleneck was not VPP itself, but vmem-to-sysmem transfer (see our pipeline config in first message). We were able to achieve desired performance by downscaling the image during VPP phase, thus reducing transfer volume.

Thank`s for the assist, thread could be closed.

Roman_T_ · ‎12-04-2015

Mihail P. wrote:

...we found that main bottleneck was not VPP itself, but vmem-to-sysmem transfer...

Hi all!

Does above sentence mean that all Media SDK advantages are multiplied by 0 during vmem-to-sysmem transfer?

Best regards,
Roman

Mihail_P_ · ‎12-04-2015

Roman, that depends on your task. For example, our task couldn`t be solved without MSDK advantages...

Sravanthi_K_Intel · ‎12-04-2015

Hi Roman, As Mihail mentioned it depends on the applcation and your requirements. Using system memory instead of video can cause some performance degradation (~10-15% performance drop in my experiments) depending on the system you have, the pipeline you are executing and the streams/codecs being used.