Media (Intel® oneAPI Video Processing Library, Intel Media SDK)
Access community support with transcoding, decoding, and encoding in applications using media tools from Intel. This includes Intel® oneAPI Video Processing Library and Intel® Media SDK.

VTune+MediaSDK, EU stalled/idle, bottleneck at Streamer

OTorg
New Contributor III
927 Views

Hi,

I had noticed oddities in GPU video encoding/transcoding performance.

VTune studies have shown the following:

- EU are in stalled/idle state 96% of time;
- Command Streamer load is 90-94%.

OTorg_0-1615478635596.pngOTorg_1-1615478648241.pngOTorg_2-1615478660796.png

A similar situation is observed on different GPUs/CPUs (HD Graphics 530, UHD Graphics 630, Iris Plus Graphics 640, Iris Xe, and other), different applications, different Windows versions (7, 10 Enterprise 2016 LTSB, 10 Pro 2004, etc).

I'm talking about genuine encoding, not about QuickSync Fixed Function (low-power HW) mode. Encodings/transcodings are performed using Intel Media SDK library, under Windows.

 

Am I correct in understanding that the bottleneck is the Command Streamer, and only a small part of the GPU's computing power is involved?

Is this a typical situation for video encoding class tasks, or something wrong with implementation?

0 Kudos
1 Solution
OTorg
New Contributor III
589 Views

Hi,

 

>> Your first question is more related to vtune than media sdk...

"First" - it wasn't a question. I was simply explaining why I misunderstood VTune's readings, and thus misnamed the topic of this forum. That's all, let's move on.

 

>> In video encode/decode the GPU utilization would depend mainly on the input video...

H'm. I understand that utilization depends on input video complexity and dimensions. It also depends on encoding quality, number of reference frames, etc. But why does this prevent you from answering the question: is there a relationship with the number of EU or not?

 

Your attempts to provide answers gave me a basis for googling. And I've found a high-quality answers:

https://github.com/Intel-Media-SDK/MediaSDK/blob/master/doc/MFE-Overview.md#problem-statement

https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-skl-vol08-media_vdbox.pdf

They are from the Linux world, but perfectly explain what is happening in the hardware.


Thank you!

View solution in original post

8 Replies
ArunJ_Intel
Moderator
819 Views

Hi 

 

With intel media SDK how are you performing encode and decode, are you writing custom code or are you trying out with any of the samples provided. If you are using any of media SDK samples let us know which sample are you trying this out with.

 

One of the possible reasons of low performance/ less GPU utilization could be due to  implicit copies and synchronous implementation. In the below sample video memory is used and multiple encode tasks are in flight simultaneously. Could you try out this sample and see if this improves GPU utilization of encode.

 

https://github.com/Intel-Media-SDK/MediaSDK/tree/master/tutorials/simple_3_encode_vmem_async

 

We need to check with concerned teams if command streamer could be a bottleneck, for that we would need information on how you have implemented encode/decode, please share a reproducer if yours is a custom implementation.

 

Thanks

Arun

 

OTorg
New Contributor III
814 Views

I've experimented with both the official examples (sample_encode.exe, vpl-encode.exe) and my own code (hw surfaces, asyncdepth = 4). Results are nearly identical.

Today i've tested simple_encode_d3d11_async.exe from your link. Cmdline arguments was: -hw -g 720x576 -b 2500 -f 25/1 _input.i420 NUL:

Result is still the same:

OTorg_0-1615567080641.png

 

OTorg_1-1615567132679.png

 

OTorg_2-1615567214738.png

 

OTorg_3-1615567266298.png

 

OTorg_4-1615567291207.png

 

OTorg_5-1615567328807.png

 

OTorg_6-1615567359990.png

 

OTorg_7-1615567388034.png

 

OTorg_8-1615567413621.png

 

OTorg_9-1615567430393.png

 

Today's tests was performed on UHD Graphics 630 (i7-10700), driver version 27.20.100.9316 (latest), Windows 10 Pro 2004. VTune version is 2021.1.1.

 

OTorg
New Contributor III
764 Views

Hi,

I want to add one more consideration.

VTune report is also confirmed by my own performance measurements. 

HD Graphics 530 and Iris Plus Graphics 640 can encode the SAME number of live SD-signals, they cannot encode more in real time. Although 640 should be more productive than 530 (by the number of EUs).

That also suggests the idea of a bottleneck.

ArunJ_Intel
Moderator
650 Views

Hi,

 

Encoding and decoding in GPU is done by video codec. The video codec is also called Fixed Function, this is a separate hardware from General GPU part.

 

In vTune report screen capture, you are referring to GPGPU statistics. GPGPU statistics doesn’t include Fixed Function. In order to see the video codec(fixed function) metric in your vtune report change your view to gpu offload. This will give you video codec gpu utilization percentage and you could see the video codec utilization markers in vtune timeline as well. 

<vtune_screenshot attached>

In windows machine an alternative to read gpu performance would be through task manager, launch Task Manager and go to performance tab, select “GPU” section, you can see decode and other indicator, you can’t view encoder directly but you can see the performance of 3D to get an idea of your gpu utilization while media sdk code is running.

 

<task_manager_screenshot attached>

 

IN the last response you had mentioned, "in your own performance measurements HD Graphics 530 and Iris Plus Graphics 640 can encode the SAME number of live SD-signals". Could you please provide more information on how you carried out these tests , which samples did you use to try to reach this conclusion, the command line used.

Could you also verify if you could see a change in gpu utilization by video codec when trying out on different hardware.

 

 

Thanks

Arun

 

OTorg
New Contributor III
637 Views

ArunJ, thanks for your answer.

It directed my experiments in the right direction, which I didn't understand before.

 

 

1.

I made new VTune-measurements: with h264-encoding of 1, 4 and 7 live 720x576_50i signals. In all cases load of "Render/GPGPU Command Streamer" was 92-95% (strange). But it was a funny misapprehension, caused by VTune charts resolution.

 

The original approach/view is:

OTorg_0-1615901162498.png

 

And here's what we get after zooming in to the smallest detail:

OTorg_1-1615902919994.png

 

Yes, Render/GPGPU Command Streamer is loaded on 90-100%, but not all the time:) When there is work (frame/frames), it is loaded to the maximum, otherwise the load is 0. On a large scale, hovering the mouse just doesn't take "rests" into account (shows maximum, not middle). Ok, we'll keep that in mind in the future.

 

 

2.

You said GPU-encoding/decoding is performed by Fixed Function video codec. But in your screenshot and my measurements, I can see that "Render and GPGPU" entity is also involved. Utilization of both grows with the number of coding channels:

OTorg_0-1615907055152.png

 

OTorg_1-1615907065375.png

 

OTorg_2-1615907075404.png

 

And I see that "Video Codecs" utilization is far from 100%. So, my next question is: does productivity of "Render and GPGPU" depend on number of EU? Or, does it depend on processor class (i3/i7/xeon)?

 

 

3.

Is it possible to read "Video Codec", "Video Codec 2", "Render and GPGPU" utilization levels in (my) software under Windows? Or they are a VTune-specific measurements?

 

 

Kind regards,

Oleksandr

ArunJ_Intel
Moderator
596 Views

Hi,

 

Your first question is more related to vtune than media sdk and could be better addressed if this was raised in the analyzer forum. Please find link to analyzers forum 

https://community.intel.com/t5/Analyzers/bd-p/analyzers

 

For question 2) In video encode/decode the GPU utilization would depend mainly on the input video. As the utilization would depend on the video , It would be difficult to comment if increasing EU or upgrading processor would help for a certain scenario. As the answer varies on case to case basis depending on the video input.

 

 

Thanks

Arun

 

 

OTorg
New Contributor III
590 Views

Hi,

 

>> Your first question is more related to vtune than media sdk...

"First" - it wasn't a question. I was simply explaining why I misunderstood VTune's readings, and thus misnamed the topic of this forum. That's all, let's move on.

 

>> In video encode/decode the GPU utilization would depend mainly on the input video...

H'm. I understand that utilization depends on input video complexity and dimensions. It also depends on encoding quality, number of reference frames, etc. But why does this prevent you from answering the question: is there a relationship with the number of EU or not?

 

Your attempts to provide answers gave me a basis for googling. And I've found a high-quality answers:

https://github.com/Intel-Media-SDK/MediaSDK/blob/master/doc/MFE-Overview.md#problem-statement

https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-skl-vol08-media_vdbox.pdf

They are from the Linux world, but perfectly explain what is happening in the hardware.


Thank you!

ArunJ_Intel
Moderator
556 Views

As you have accepted your solution, we take it you have this issue solved. And for your vtune queries we could see you have a posted a new query in Analysers forum which should be addressed soon.

 

https://community.intel.com/t5/Analyzers/GPU-Engines-utilization-readings/m-p/1265534#M20166

 

If you need any additional information here, please submit a new question as this thread (current one in Media SDK forum)  will no longer be monitored.

 

 

Thanks

Arun

 

Reply