Vasily,

VASILY_V_ · ‎05-18-2015

Hi,

I've start testing hardware transcoding based on Media SDK (2015 R5) on Ubuntu 12.04 with patched vanilla kernel 3.14.5 LibVA and LibDRM also patched.

I'm testing on this processor http://ark.intel.com/ru/products/75107/Intel-Core-i3-4010U-Processor-3M-Cache-1_70-GHz and I have a question about performance of transcoding process.

I've tried FFmpeg with QuickSync codec implementation and also small wrapper around API for definitely hardware decoding, video processing and encoding video. FFmpeg I think uses software decoding and video filtering, so I'm asking only about sample API wrapper uses hardware capabilities. I see that even with whole hardware based transcoding pipeline I have a huge CPU load! Such transcoding video file from mpeg-ts container (h264) to mpeg-ts (h264) with sample util like simple_transcode_opaque_async_vppresize loads my CPU to 60% (in sum; top shows me about 190%-220% usage): more than half (my options are -hw -b 2000 -f 25/1 - nothing unusual)

And I can't figure out why so much CPU using? How can I decrease it? What processes load CPU if everything included muxing-demuxing flows on GPU? Is such performance "OK" or is it too slow?

Also, if I use software only transcoding implementation (via media SDK API also) the transcoding speed became 3 times slower. So hardware implementation gives me only 3 times faster transcoding - maybe it's too small or not?

Thank you!

Surbhi_M_Intel · ‎05-18-2015

Hi Vasily,

Few questions just to make sure I am getting your pipeline correct -

You are using Media SDK for HW transcoding and using ffmpeg for demuxing?
If you are interested in using hw decoding, you can use video memory instead opaque memory.
Are you checking GPU activity for the test you are doing?
"What processes load CPU if everything included muxing-demuxing flows on GPU?" can you please explain how you muxing-demuxing is happening on GPU?

One test you can do is to run the sample & check how much CPU utilization and can then add the ffmpeg to see if the utilization goes high? This test could confirm the bottle neck.

Thanks,
Surbhi

VASILY_V_ · ‎05-18-2015

Hi,

I'm using Media SDK for encoding as FFmpeg codec. So FFmpeg SW decoding stream, and muxing, and demuxing it. I think the main problem in decoding process, because I've tried to use your sample (multi_transcode) and see that CPU utilization reduces for 4 times compare to FFMpeg with same encoding parameters.

Ok, I've understand that mux-demux process should be on CPU, but multi_transcode sample do not mux stream - only demux as I understand and this process takes about 12% CPU in summary (50% top command show) -- is it ok? I mean here I get h264 stream without muxing it into TS container for example. So do you think that such high utilization related to demuxing process or something else?

As for GPU -- I see that if I use HW accelerated GPU transcoding the GPU load became 100% until transcoding process not exit. And if there are no transcoding process the GPU load is 0%. So it look likes this: 0% then in two seconds it rapidly became 70% than 100% and then while transcoding process run it about 98%-100%. And this behavior not depend on how many transcoding processes I'm running - so I can't figure out how can I measure GPU load because it primarily has two states: "work hard" and "idle". Can you comment on this?

Alexey_F_Intel · ‎05-18-2015

Vasily, can you add details which tool do you use to watch % GPU load? is it Metrics Monitor or something else?

Thanks, Alexey

VASILY_V_ · ‎05-18-2015

Alexey Fadeev (Intel) wrote:

Vasily, can you add details which tool do you use to watch % GPU load? is it Metrics Monitor or something else?

Thanks, Alexey

Metrics_monitor or self-written tool uses API, so results are same:

RENDER usage: 0.00,   VIDEO usage: 0.00,   VIDEO_E usage: 0.00
RENDER usage: 21.00,   VIDEO usage: 21.00,   VIDEO_E usage: 0.00
RENDER usage: 100.00,   VIDEO usage: 100.00,   VIDEO_E usage: 0.00
RENDER usage: 100.00,   VIDEO usage: 100.00,   VIDEO_E usage: 0.00
RENDER usage: 100.00,   VIDEO usage: 100.00,   VIDEO_E usage: 0.00
RENDER usage: 100.00,   VIDEO usage: 100.00,   VIDEO_E usage: 0.00
RENDER usage: 100.00,   VIDEO usage: 100.00,   VIDEO_E usage: 0.00
RENDER usage: 100.00,   VIDEO usage: 100.00,   VIDEO_E usage: 0.00
RENDER usage: 100.00,   VIDEO usage: 100.00,   VIDEO_E usage: 0.00
RENDER usage: 67.00,   VIDEO usage: 67.00,   VIDEO_E usage: 0.00
RENDER usage: 0.00,   VIDEO usage: 0.00,   VIDEO_E usage: 0.00

Surbhi_M_Intel · ‎05-19-2015

Vasily,

sample_multi_transcode take a bitstream input and gives a bitstream o/p, so it doesn't do any muxing or demuxing. Whereas there is anoter sample - sample_full transcode which does muxing &* demuxing for which you need a particular version of ffmpeg(mentioned in the readme & Media Samples guide). CPU %age will be seen for muxing, demuxing and reading & writing the bitstream to a file.

For GPU load- I have seen multiple states in the past when I have worked on it. It shouldn't be idle & 100% occupied only. I will check that again and will get back to you.

Thanks,
Surbhi

VASILY_V_ · ‎05-21-2015

Hello, any news?

Surbhi_M_Intel · ‎05-21-2015

Hi Vasily,

I saw the behavior which you have mentioned and this is because that the encoding/transcoding process is making use of all the GPU resources, therefore 100% video usage.

There is a parameter async depth in Media SDK which specifies how many asynchronous operations perform before synchronizing, if you make this parameter equal to 1 (mean no asynchronous operations), you would notice the video usage to be be less than 100% and 700fps(lets say) which means there are no operations running asynchronously, therefore videos usage didn't reach to max point.
Now change the value of async depth to be 4 or 5, you should see an increase in video usage to be 100% making use of all the available GPU and fps > 700fps. If you limit transcoding speed to be 30fps(real time) then you shouldn't see 100% GPU usage.

Also, you can see better results by setting sampling period in metrics monitor application to be small but that will increase the overhead.

My past experiments with metrics monitor was with HEVC which is not fully hardware accelerated and uses partial hardware acceleration for transcoding due to which I see the less GPU load and high CPU load.

Hope that answers your question, let us know if you have any query about this?

Thanks,
Surbhi

VASILY_V_ · ‎06-01-2015

Hi Surbhi!

I've tried to limit transcoding speed to 25 fps, but metrics_monitor return me these values:

RENDER usage: 97.00,	VIDEO usage: 93.00,	VIDEO_E usage: 97.00
RENDER usage: 100.00,	VIDEO usage: 97.00,	VIDEO_E usage: 100.00
RENDER usage: 100.00,	VIDEO usage: 94.00,	VIDEO_E usage: 100.00
RENDER usage: 100.00,	VIDEO usage: 95.00,	VIDEO_E usage: 100.00
RENDER usage: 97.00,	VIDEO usage: 94.00,	VIDEO_E usage: 97.00
RENDER usage: 93.00,	VIDEO usage: 95.00,	VIDEO_E usage: 93.00
RENDER usage: 100.00,	VIDEO usage: 96.00,	VIDEO_E usage: 100.00
RENDER usage: 100.00,	VIDEO usage: 96.00,	VIDEO_E usage: 100.00
RENDER usage: 100.00,	VIDEO usage: 96.00,	VIDEO_E usage: 100.00
RENDER usage: 100.00,	VIDEO usage: 96.00,	VIDEO_E usage: 100.00
RENDER usage: 100.00,	VIDEO usage: 97.00,	VIDEO_E usage: 95.00
RENDER usage: 100.00,	VIDEO usage: 96.00,	VIDEO_E usage: 100.00
RENDER usage: 97.00,	VIDEO usage: 93.00,	VIDEO_E usage: 97.00
RENDER usage: 90.00,	VIDEO usage: 96.00,	VIDEO_E usage: 90.00

And also I've a two questions:

1) Can you explain more precisely what does it mean "all the GPU resources"? Why 100%? What resources, how does SDK utilize them, etc...

2) If GPU loaded whole time 100% how can I understand can I load it more to achieve better compression quality or can I transcode sound on GPU and there are available resources for that?

Surbhi_M_Intel · ‎06-01-2015

Hi Vasily,

Can you please copy the cmd for the test if you are using any of the existing samples? How are you limiting the transcoding speed
For transcode only, you shouldn't see Video_E usage. Video_E usage should reflect occupancy if you have any VPP pipeline.

If you see the metrics monitor manual Page 5, you will see what hardware units(GPU resolurces) like Render, Video & Video_E usage means and which should be busy at what time. If you want to see detailed view of different hardware units, you can try Intel Vtune Amplifier.

To see how many transcodes or how much you can stress GPU, you should check output fps like one of the experiments you can do-> Let's say 1 transcode of 500 frames(resolution: 3840x2160) completes in 6 seconds i.e 83.3fps. You can increase the number of transcodes untill you reach to 30fps(real time). Another method could be to limit pipeline speed(which you are doing).
There is a correlation of pipeline speed & GPU usage. If you see small GPU usage that would mean the task is small enough to stress the system(hardware units) or it means transcoding pipeline is not using hardware unit effectively specially video_usage, which is dedicated for video decoding & PAK. I hope that help to estimate the no. of transcodes you can achieve comfortably. Also you can check the GPU frequency of your system which is another factor to account for performance.

-Surbhi

CPU usage with QuickSync transcoding