Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Liu__Chao
Beginner
324 Views

QSV CPU usage

Hi,

Based on my experiences, QSV hardware decoding / encoding still use quite a few CPU and it uses multiple threads internally, which makes it difficult to measure the real CPU usage.

So my questions are:

- What are so much CPU used for? Memory copy?

- Is there any option to make it only use one thread?

FYI, I use QSV by running ffmpeg on Linux.

0 Kudos
13 Replies
Surbhi_M_Intel
Employee
324 Views

Hi, 

I am hoping you are referring to h264 encode/decode. CPU usage will depend upon the application you are using to access hardware acceleration. H264 decode happens fully on the fixed function block, known as MFX codec or VDbox reside inside GPU. Except reading and writing the output which will involve copies, I don't expect CPU utilization here. For the h264 encode, execution happens on EU(execution unit), VME engine and MFX codec/VD box. Again for reading yuvs and writing the output memcpy will be involved. 
In MSDK API, there is an option to keep the entire pipeline of transcode on video memory by choosing the MFX.IO.Pattern.In and Out. which provides performance boost in customer applications but when writing the o/p then again one will see CPU utilization. 

Let me know if I didn't understand your query correctly or need more details of any point. 

Thanks,
Surbhi

Liu__Chao
Beginner
324 Views

Hi,

Thanks for the detailed information. That's very useful!

I am trying to figure out whether there is still some optimization space in ffmpeg.

I use ffmpeg on Linux/i3-4030U to transcode a 25fps 3megapixel h264 video.

The CPU usage is about 20%. Do you think the number makes sense?

Liu__Chao
Beginner
324 Views

Another test result:

Use ffmpeg on Linux/i5-5575R (with Iris Pro 6200) to transcode a 30fps 1080p h264 video, CPU usage is 30%.

This is a bit higher than I expected. I haven't dived into related ffmpeg code yet. Will do that if you think this is definitely higher than necessary.

Liu__Chao
Beginner
324 Views

I got the CPU profiling using valgrind. Majority of the CPU is used in libmfxhw64-p.so.1.17. Without the source code, I don't know what's going on. But note, memcpy only takes a very small fraction of the CPU usage. It looks like there is sth. inside libmfxhw64-p.so.1.17 that could be optimized a bit..

614,933,220  ???:0x0000000000547cf0 [/4s/chao/bin/mfx/libmfxhw64-p.so.1.17]
231,278,286  ???:0x000000000056ba90 [/4s/chao/bin/mfx/libmfxhw64-p.so.1.17]
 40,020,376  /build/eglibc-oGUzwX/eglibc-2.19/string/../sysdeps/x86_64/memset.S:memset [/lib/x86_64-linux-gnu/libc-2.19.so]
 15,532,413  ???:0x000000000025a9a0 [/4s/chao/bin/drivers/iHD_drv_video.so]
 14,893,756  ???:do_bo_emit_reloc [/4s/chao/bin/libdrm_intel.so.1.0.0]
 14,283,758  /home/chao/projects/ffmpeg/libavformat/avc.c:ff_avc_find_startcode [/4s/chao/bin/ffmpeg_g]
 12,652,344  ???:0x0000000000225e40 [/4s/chao/bin/mfx/libmfxhw64-p.so.1.17]
 10,387,725  /build/eglibc-oGUzwX/eglibc-2.19/string/../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:__memmove_ssse3_back [/lib/x86_64-linux-gnu/libc-2.19.so]

 

Surbhi_M_Intel
Employee
324 Views

Hi there, 

When you are seeing 30% CPU utilization, what is the test scenario? In the Media SDK encode and decode shouldn't take 30% CPU. Are you including muxing, demuxing and I/O related activities? If you are unsure, you can use Vtune Amplifier to see the threads occupancy and see if those can be optimized any further. 
One quick test you might be able to do - run transcode using MSDK framework and see the utilization and can compare with ffmpeg run for the same test. 

-Surbhi

 

Liu__Chao
Beginner
324 Views

Hi Surbhi,

The 30% CPU usage includes everything, I/O, demuxing, decoding, encoding and muxing. However, according to the profiling I posted before, the CPU is mostly used by some functions inside MFX.

I haven't done the test you suggest yet. I was reading ffmpeg code and compare with sample_decode. It looks to me the ffmpeg code is pretty messed up. For example, it opens files like /dev/dri/renderD w/o doing anything to the opened files.

More suspicious is the following decoding loop:

        do {
            ret = MFXVideoDECODE_DecodeFrameAsync(q->session, flush ? NULL : &bs,
                                                  insurf, &outsurf, &sync);
            if (ret != MFX_WRN_DEVICE_BUSY)
                break;
            av_usleep(500);
        } while (1);

In sample_decode, it calls SyncOutputSurface after DecodeFrameAsyn returns MFX_WRN_DEVICE_BUSY

I guess many (if not most) users use ffmpeg to call QSV. Why don't Intel help ffmpeg guys to maintain/improve the related code?

 

Liu__Chao
Beginner
324 Views

The CPU usage I got for ffmpeg was based on live streams. It's not easy to run media sdk samples on live streams. To compare sample_decode with ffmpeg, I save the 3mp live stream to a h264 file and then run sample_decode and ffmpeg on the file.

-  sample_decode h264 -i test.h264

- ffmpeg -c:v h264_qsv -async_depth 10 -i test.h264 -c:v rawvideo -f null /dev/null

Both sample_decode and ffmpeg use 100% CPU (one core), sample_decode's speed is much faster though, 370 VS 170

Also, for both cases, callgrind shows that majority of the CPU is used by libmfxhw64-p.so:

1,175,917,806  < ???:0x000000000055ad90 (2658x) [/opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.17]
1,175,917,806  *  ???:0x000000000056ba90 [/opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.17]

      518,403  < ???:0x000000000021adc0 (4x) [/opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.17]
   81,533,711  < ???:0x0000000000224de0 (1439x) [/opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.17]
   81,702,935  *  ???:0x0000000000225e40 [/opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.17]

   26,678,539  < ???:0x00000000002259a0 (1417x) [/opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.17]
   26,678,539  *  ???:0x0000000000225e20 [/opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.17]

          366  < ???:sInputParams::sInputParams() (1x) [/4s/chao/projects/samples/samples/__bin/sample_decode]
          138  < ???:CDecodingPipeline::CDecodingPipeline() (6x) [/4s/chao/projects/samples/samples/__bin/sample_decode]
   18,015,084  < ???:0x00000000001b7620 (97287x) [/opt/intel/mediasdk/lib64/iHD_drv_video.so]
        6,209  < /build/eglibc-oGUzwX/eglibc-2.19/malloc/malloc.c:calloc (99x) [/lib/x86_64-linux-gnu/libc-2.19.so]
      748,598  < ???:0x00000000001b6cf0 (86x) [/opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.17]
           33  < ???:0x00000000001b7810 (2x) [/opt/intel/mediasdk/lib64/iHD_drv_video.so]
      269,793  < ???:0x00000000001b44d0 (31x) [/opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.17]
       30,490  < ???:0x0000000000215280 (1332x) [/opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.17]
       23,613  < ???:0x0000000000099a70 (8x) [/opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.17]
           15  < ???:0x00000000001b21c0 (1x) [/opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.17]
           30  < ???:void init_ext_buffer<mfxExtThreadsParam>(mfxExtThreadsParam&) (1x) [/4s/chao/projects/samples/samples/__bin/sample_decode]
       98,432  < ???:0x00000000002259a0 (1417x) [/opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.17]
       25,308  < ???:0x00000000002156d0 (1332x) [/opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.17]
        1,866  < ???:0x00000000000b1610 (5x) [/opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.17]
           52  < ???:CDecodingPipeline::Init(sInputParams*) (2x) [/4s/chao/projects/samples/samples/__bin/sample_decode]
           30  < ???:0x00000000001d3860 (2x) [/opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.17]
      493,626  < ???:0x00000000001b75a0 (9444x) [/opt/intel/mediasdk/lib64/iHD_drv_video.so]
           66  < ???:CDecodingPipeline::AllocFrames() (3x) [/4s/chao/projects/samples/samples/__bin/sample_decode]
           79  < ???:drmMalloc (5x) [/4s/chao/bin/libdrm.so.2.4.0]
          128  < ???:sPluginParams::sPluginParams() (1x) [/4s/chao/projects/samples/samples/__bin/sample_decode]
   19,713,956  *  /build/eglibc-oGUzwX/eglibc-2.19/string/../sysdeps/x86_64/memset.S:memset [/lib/x86_64-linux-gnu/libc-2.19.so]

 

So, there might indeed be some problems in ffmpeg's implementation. I'll report that to ffmpeg. Put that aside, even for sample_decode, 1 cpu core for decoding 370fps 3mp video is still pretty expensive for me, which is about 3ms per frame. Do you think this is expected?

Liu__Chao
Beginner
324 Views

 I run command "sample_decode h264 -vaapi -i test.h264"

This time, the CPU usage is about 30% (0.3 core), and I got 700fps.

So looks like the cpu usage is caused by copying the data from GPU to CPU. Based on what I got, it's >2ms per 2048x1536 frame. The testing CPU is i3-4030U 1.9GHz with HD 4400. Does this number make sense?

Kamal_Devanga
Beginner
324 Views

These are about the same numbers I get and sound about right.  In my profiling the majority of time is indeed spent in D3D 11 at the transfer boundary, though it looked to me as if some compute shaders were being dispatched, not just memory copies.  Even a sparse implementation (which mine was, i.e. not testing with sample_encode, no disk read/write), didn't improve matters much.  Performance improves about 20% between profiles (speed, quality).

 

 

Liu__Chao
Beginner
324 Views

If the cpu usage is legitimate, I am curious what has caused that. One thing I have noticed is that seems only decoding has this high cpu usage problem. I couldn't find similar things on encoding. So does this mean copying to system memory is expensive, while copying from it is not. This sounds not very right.. Could anyone from Intel help me to understand it more?

Bono
Beginner
324 Views

yjl wrote:

If the cpu usage is legitimate, I am curious what has caused that. One thing I have noticed is that seems only decoding has this high cpu usage problem. I couldn't find similar things on encoding. So does this mean copying to system memory is expensive, while copying from it is not. This sounds not very right.. Could anyone from Intel help me to understand it more?

Hello,

we tried new build (Intel Media SDK 2017) also and we are seeing increased gpu and cpu load, in comparation to previous builds Intel Media SDK 2016 we see 2x cpu usage and 2.5x increase in gpu load.
Everything is tested on same source, while on old version cpu load is 16% on new SDK 2017 we see load of 37%, also gpu on previous build was ~4% and on new one is around 10% it even jumps to 11%.

Did you managed to find something in meanwhile to lower usage?
Everything is tested on Xeon E3-1275 v5 and on Linux. Only reason why we are testing new version is because we are experiencing instability in time frame from 80-90 days where server is up and running but new transconding could not be started. This happens on multiple servers.

Regards
 

Liu__Chao
Beginner
324 Views

Hi Bono,

I didn't find anything after I have posted here. As I wrote before in this thread, the CPU usage happens in libmfxhw64-p.so, which is close sourced. Looks like this thread already fell off Intel guys' radar, or maybe they just don't think this is important...

david123456
Beginner
324 Views

We use mss under linux for 8 HDMI live HD encoder and find a strange case:

When using mss2016 at i5 4590, CPU usage is about 32%. When using mss2017 at i5 6500, CPU usage is about 60%.