Solved: high cpu and pcie usage with hardware transcode on intel dGPUs

zsc_IM · ‎03-18-2023

Hello,

we got high cpu usage with hardware transcode on intel dGPUs (dg1/A380/A770).

Several details of this issue are:

1. The currently checked system is limited to Windows.

2. It should be noted that CPU and PCIE usage is positively correlated with transcoding FPS, so the simpler the video (e.g. 360P 200kbitrate) transcoding speed (hundreds of FPS), the higher the CPU and PCIE usage.

3. It appears in two of the most commonly used builds: ffmpeg and QSVEnc, and Intel's sample_encode.exe

4. I have reviewed the NVIDIA technical documentation (https://developer.nvidia.com/blog/nvidia-ffmpeg-transcoding-guide/), which mentions "Adding the -hwaccel cuvid option means the raw decoded frames will not be copied and the transcoding will be faster and use less system resources". I think that's why. If I enable it, the transcoding use only 1% CPU and PCIE bandwidth with NVIDIA GPU. If it is disabled, it will us may 5% CPU and 1GB/S read and write PCIE bandwidth. However, I can't get real-time PCIE traffic on Intel GPUs(There is no "nvidia-smi"), but when I reduce the number of DG1's pcie-lines from 8 to 1, the CPU consumption is higher and the FPS is lower.

5. I checked the documentation for ffmpeg and intel and found a parameter called "gpucopy". But enabling it didn't solve the problem.

6. The following is a low bit rate video I provided to you for testing (in order to obtain a high transcoding FPS) and possible test code. Known test results include:

CPU	GPU	CPU usage	FPS
intel 8700K/AMD 5500	DG1	20%	900
INTEL 1200K	A380	5%	960
AMD 5600G	A770	20%	1200

sample_encode.exe h265 -dGfx -timeout 86400 -lowpower:on -hw -icq 28 -u speed -i 3.mkv -o 3.mp4 -w 1280 -h 720 -gpucopy::on

zsc_IM · ‎04-24-2023

Hi AlekhyaV_Intel,

I'm sorry for my late reply. I have tried your solution and get correct result.

I agree and thank you for teaching me this.

Before we explain you what all we tried, Let's understand the transcoding process flow:

There must be CPU usage involved because of FILE I/O for reading encoded stream and writing encoded stream.
There's no CPU involved during the processing (encode -> decode -> encode) and only GPU resources are utilized.

Besides, let me summarize my findings:

1. CPU utilization is positively correlated with transcoding FPS, regardless of whether it is on an NVIDIA or INTEL GPU.

2. The CPU utilization displayed in Windows Task Manager is inaccurate. I was able to obtain the correct CPU utilization using HWMonitor and HWinfo. Typically, a transcoding task running on dg1 at 1000 FPS corresponds to approximately 5% utilization of epyc 7d12 CPU.

3. However, at the same transcoding FPS, the CPU utilization of INTEL GPU (dg1) is approximately 50% higher than that of NVIDIA GPU (tesla p4).

I apologize for hastily relying on the CPU usage displayed in Task Manager when I raised this question earlier. Thank you for your patient and continued support. I hope that my findings can help you achieve better "CPU efficiency" in video transcoding tasks beyond NVIDIA in the future.

In summary, this issue stemmed from a misunderstanding. But I was unsure which answer button to click to best help others.

Thanks!

View solution in original post

zsc_IM · ‎03-18-2023

The hardware transcoding optimization diagram and settings provided by NVIDIA are concise and useful. It is a natural idea to process the decoded frames directly in the GPU, but I have searched the INTEL official website and the github documentation and have not found anything similar.

AlekhyaV_Intel · ‎03-20-2023

Hi,

Thank you for posting in Intel communities. We have contacted the admin team to answer your questions. We will get back to you soon with an update.

Regards,

Alekhya

zsc_IM · ‎03-26-2023

Hi,

Have you made any progress on this issue?

Thank you

AlekhyaV_Intel · ‎03-27-2023

Hi,

Thank you for your patience. Please find the update below:

sample_encode.exe is expected have high CPU usage since it is copying frames from CPU memory to GPU memory for encoding which plays a dominant role in overall performance. For performance analysis, running experiments using sample_multi_transcode.exe tool is recommended.
1-N and N-N transcoding can perform even better than the single pipeline sample_multi_transcode. These pipelines can be implemented using 'join' feature of onevpl. it joins session with other session(s), by default sessions are not joined.
In Linux, parallel_encoding feature can be used which can improve the performance of transcoding even further. Some examples can be found here.
For ffmpeg-qsv, command lines need to be set up correctly to keep HW surfaces in video memory:
Sample Command line for ffmpeg-qsv:

ffmpeg -hwaccel qsv -c:v h264_qsv -i input.mp4 -c:v h264_qsv -b:v 5M -look_ahead 1 output.mp4

If set up properly, sample_multi_transcode on Intel dGPU should not need use much CPU at all – just enough to provide bitstream I/O and synchronization. All the steps of sample_multi_transcode e.g., Decode, VPP, and encode should use the GPU resources.

If this resolves your issue, make sure to accept this as solution. This helps others with similar issues. Thanks you!

Regards,

Alekhya

zsc_IM · ‎03-27-2023

Hi,

I tested sample_multi_transcode.exe and the result is still the same. It consumes 6% of the CPU at around 350 FPS. I can't reproduce "All the steps of sample_multi_transcode e.g., Decode, VPP, and encode should use the GPU resources. ".

My test code is as follows, where 3.264 is from 3.mkv in the previously uploaded 3. zip. If my code is incorrect, please help me correct it.

sample_multi_transcode.exe -i::h264 3.264 -o::h264 3.mp4 -hw -dGfx 1 -lowpower:on -b 3000 -u speed

I am sure I have configured ffmpeg correctly. The following code will consume 25% CPU at round 700 FPS. I also can't "keep HW surfaces in video memory".

ffmpeg -v verbose -hwaccel qsv -hwaccel_output_format qsv -qsv_device 3 -vcodec h264_qsv -i 3.mkv -c:v hevc_qsv -low_power 1 -preset veryfast -profile:v rext -global_quality 28 -y 3.mp4 -async 1

As you can see, I used - v verbose to print more detailed logs. Important information has been bold. I think this indicates that both decoding and encoding are done on the GPU, but the high CPU usage indicates that this is not the case.

ffmpeg -v verbose -hwaccel qsv -hwaccel_output_format qsv -qsv_device 0 -vcodec h264_qsv -i 3.mkv -c:v hevc_qsv -low_power 1 -preset veryfast -profile:v rext -global_quality 28 -y 3.mp4 -async 1
ffmpeg version N-110014-ga6e9d01f88-20230315 Copyright (c) 2000-2023 the FFmpeg developers
built with gcc 12.2.0 (crosstool-NG 1.25.0.90_cf9beb1)
configuration: --prefix=/ffbuild/prefix --pkg-config-flags=--static --pkg-config=pkg-config --cross-prefix=x86_64-w64-mingw32- --arch=x86_64 --target-os=mingw32 --enable-gpl --enable-version3 --disable-debug --disable-w32threads --enable-pthreads --enable-iconv --enable-libxml2 --enable-zlib --enable-libfreetype --enable-libfribidi --enable-gmp --enable-lzma --enable-fontconfig --enable-libvorbis --enable-opencl --disable-libpulse --enable-libvmaf --disable-libxcb --disable-xlib --enable-amf --enable-libaom --enable-libaribb24 --enable-avisynth --enable-chromaprint --enable-libdav1d --enable-libdavs2 --disable-libfdk-aac --enable-ffnvcodec --enable-cuda-llvm --enable-frei0r --enable-libgme --enable-libkvazaar --enable-libass --enable-libbluray --enable-libjxl --enable-libmp3lame --enable-libopus --enable-librist --enable-libssh --enable-libtheora --enable-libvpx --enable-libwebp --enable-lv2 --disable-libmfx --enable-libvpl --enable-openal --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenh264 --enable-libopenjpeg --enable-libopenmpt --enable-librav1e --enable-librubberband --enable-schannel --enable-sdl2 --enable-libsoxr --enable-libsrt --enable-libsvtav1 --enable-libtwolame --enable-libuavs3d --disable-libdrm --disable-vaapi --enable-libvidstab --enable-vulkan --enable-libshaderc --enable-libplacebo --enable-libx264 --enable-libx265 --enable-libxavs2 --enable-libxvid --enable-libzimg --enable-libzvbi --extra-cflags=-DLIBTWOLAME_STATIC --extra-cxxflags= --extra-ldflags=-pthread --extra-ldexeflags= --extra-libs=-lgomp --extra-version=20230315
libavutil 58. 3.100 / 58. 3.100
libavcodec 60. 6.101 / 60. 6.101
libavformat 60. 4.100 / 60. 4.100
libavdevice 60. 2.100 / 60. 2.100
libavfilter 9. 4.100 / 9. 4.100
libswscale 7. 2.100 / 7. 2.100
libswresample 4. 11.100 / 4. 11.100
libpostproc 57. 2.100 / 57. 2.100
[AVHWDeviceContext @ 0000023dbb94e4c0] Defaulting child_device_type to AV_HWDEVICE_TYPE_D3D11VA for oneVPL.Please explicitly set child device type via "-init_hw_device" option if needed.
[AVHWDeviceContext @ 0000023dbb949bc0] Using device 8086:4908 (Intel(R) Iris(R) Xe Graphics).
[AVHWDeviceContext @ 0000023dbb94e4c0] Use Intel(R) oneVPL to create MFX session, API version is 2.8, the required implementation version is 1.3
[AVHWDeviceContext @ 0000023dbb94e4c0] Initialize MFX session: implementation version is 2.8
[h264 @ 0000023dbd5ce6c0] Reinit context to 1280x720, pix_fmt: yuv420p
Input #0, matroska,webm, from '3.mkv':
Metadata:
ENCODER : ShanaEncoder
Duration: 00:29:30.86, start: 0.000000, bitrate: 781 kb/s
Stream #0:0: Video: h264 (Baseline), 1 reference frame, yuv420p(progressive, left), 1280x720 [SAR 1:1 DAR 16:9], 25 fps, 25 tbr, 1k tbn (default)
Metadata:
DURATION : 00:29:30.800000000
Stream #0:1: Audio: aac (LC), 48000 Hz, stereo, fltp (default)
Metadata:
DURATION : 00:29:30.858000000
Stream mapping:
Stream #0:0 -> #0:0 (h264 (h264_qsv) -> hevc (hevc_qsv))
Stream #0:1 -> #0:1 (aac (native) -> aac (native))
Press [q] to stop, [?] for help
[h264_qsv @ 0000023dcbb700c0] Decoder: output is video memory surface
[h264_qsv @ 0000023dcbb700c0] Use Intel(R) oneVPL to create MFX session with the specified MFX loader
[h264_qsv @ 0000023dcbb700c0] Decoder: output is video memory surface
[h264_qsv @ 0000023dcbb700c0] Use Intel(R) oneVPL to create MFX session with the specified MFX loader
[graph_1_in_0_1 @ 0000023dbd599cc0] tb:1/48000 samplefmt:fltp samplerate:48000 chlayout:stereo
[graph 0 input from stream 0:0 @ 0000023dbd599ec0] w:1280 h:720 pixfmt:qsv tb:1/1000 fr:25/1 sar:1/1
[hevc_qsv @ 0000023dcbf3f2c0] Using input frames context (format qsv) with hevc_qsv encoder.
[hevc_qsv @ 0000023dcbf3f2c0] Encoder: input is video memory surface
[hevc_qsv @ 0000023dcbf3f2c0] Use Intel(R) oneVPL to create MFX session with the specified MFX loader
[hevc_qsv @ 0000023dcbf3f2c0] Using the intelligent constant quality (ICQ) ratecontrol method
[hevc_qsv @ 0000023dcbf3f2c0] profile: hevc rext; level: 40
[hevc_qsv @ 0000023dcbf3f2c0] GopPicSize: 65535; GopRefDist: 1; GopOptFlag:; IdrInterval: 1
[hevc_qsv @ 0000023dcbf3f2c0] TargetUsage: 7; RateControlMethod: ICQ
[hevc_qsv @ 0000023dcbf3f2c0] ICQQuality: 28
[hevc_qsv @ 0000023dcbf3f2c0] NumSlice: 1; NumRefFrame: 1
[hevc_qsv @ 0000023dcbf3f2c0] RateDistortionOpt: unknown
[hevc_qsv @ 0000023dcbf3f2c0] RecoveryPointSEI: unknown
[hevc_qsv @ 0000023dcbf3f2c0] VDENC: ON
[hevc_qsv @ 0000023dcbf3f2c0] NalHrdConformance: OFF; VuiNalHrdParameters: OFF
[hevc_qsv @ 0000023dcbf3f2c0] FrameRateExtD: 1; FrameRateExtN: 25
[hevc_qsv @ 0000023dcbf3f2c0] IntRefType: 0; IntRefCycleSize: 0; IntRefQPDelta: 0
[hevc_qsv @ 0000023dcbf3f2c0] MaxFrameSize: 0; MaxSliceSize: 0
[hevc_qsv @ 0000023dcbf3f2c0] BitrateLimit: unknown; MBBRC: OFF; ExtBRC: OFF
[hevc_qsv @ 0000023dcbf3f2c0] Trellis: auto
[hevc_qsv @ 0000023dcbf3f2c0] RepeatPPS: OFF; NumMbPerSlice: 0; LookAheadDS: unknown
[hevc_qsv @ 0000023dcbf3f2c0] AdaptiveI: unknown; AdaptiveB: unknown; BRefType:off
[hevc_qsv @ 0000023dcbf3f2c0] MinQPI: 10; MaxQPI: 51; MinQPP: 10; MaxQPP: 51; MinQPB: 10; MaxQPB: 51
[hevc_qsv @ 0000023dcbf3f2c0] DisableDeblockingIdc: 0
[hevc_qsv @ 0000023dcbf3f2c0] SkipFrame: no_skip
[hevc_qsv @ 0000023dcbf3f2c0] PRefType: simple
[hevc_qsv @ 0000023dcbf3f2c0] GPB: ON
[hevc_qsv @ 0000023dcbf3f2c0] TransformSkip: ON
[hevc_qsv @ 0000023dcbf3f2c0] IntRefCycleDist: 0
[hevc_qsv @ 0000023dcbf3f2c0] LowDelayBRC: OFF
[hevc_qsv @ 0000023dcbf3f2c0] MaxFrameSizeI: 0; MaxFrameSizeP: 0
[hevc_qsv @ 0000023dcbf3f2c0] ScenarioInfo: 0
[hevc_qsv @ 0000023dcbf3f2c0] NumTileColumns: 1; NumTileRows: 1
Output #0, mp4, to '3.mp4':
Metadata:
encoder : Lavf60.4.100
Stream #0:0: Video: hevc, 1 reference frame (hev1 / 0x31766568), qsv(tv, progressive, left), 1280x720 (0x0) [SAR 1:1 DAR 16:9], q=2-31, 1000 kb/s, 25 fps, 12800 tbn (default)
Metadata:
DURATION : 00:29:30.800000000
encoder : Lavc60.6.101 hevc_qsv
Side data:
cpb: bitrate max/min/avg: 0/0/1000000 buffer size: 0 vbv_delay: N/A
Stream #0:1: Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, delay 1024, 128 kb/s (default)
Metadata:
DURATION : 00:29:30.858000000
encoder : Lavc60.6.101 aac
[in#0/matroska,webm @ 0000023dbd5fae40] EOF while reading inputrate=1038.9kbits/s speed=29.4x
[in#0/matroska,webm @ 0000023dbd5fae40] Terminating demuxer thread
[h264_qsv @ 0000023dcbb700c0] A decode call did not consume any data: expect more data at input (-10)
Last message repeated 2 times
No more output streams to write to, finishing.
[out#0/mp4 @ 0000023dbd66eec0] All streams finished
[out#0/mp4 @ 0000023dbd66eec0] Terminating muxer thread
[AVIOContext @ 0000023dbd5d3e80] Statistics: 230991698 bytes written, 24 seeks, 898 writeouts
frame=44270 fps=734 q=-0.0 Lsize= 225578kB time=00:29:30.83 bitrate=1043.5kbits/s speed=29.4x
video:195957kB audio:28276kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.599755%
Input file #0 (3.mkv):
Input stream #0:0 (video): 44270 packets read (136683362 bytes); 44270 frames decoded;
Input stream #0:1 (audio): 83008 packets read (35381580 bytes); 83007 frames decoded (84999168 samples);
Total: 127278 packets (172064942 bytes) demuxed
Output file #0 (3.mp4):
Output stream #0:0 (video): 44270 frames encoded; 44270 packets muxed (200659606 bytes);
Output stream #0:1 (audio): 83007 frames encoded (84999168 samples); 83008 packets muxed (28954920 bytes);
Total: 127278 packets (229614526 bytes) muxed
[aac @ 0000023dbd5bcf40] Qavg: 262.611
[AVIOContext @ 0000023dbd5f8ec0] Statistics: 172966959 bytes read, 0 seeks

I would appreciate it if you could actually test the above code.

Thanks!

AlekhyaV_Intel · ‎04-12-2023

Hi,

Before we explain you what all we tried, Let's understand the transcoding process flow:

There must be CPU usage involved because of FILE I/O for reading encoded stream and writing encoded stream.
There's no CPU involved during the processing (encode -> decode -> encode) and only GPU resources are utilized.

We did run same command as you did, but CPU usage was always < 3%.

Configuration
- Platform: TGL + DG1
- GFX Driver: Intel® Arc™ & Iris® Xe Graphics - WHQL - Windows* (gfx_win_101.4255.exe)
- oneVPL: https://github.com/oneapi-src/oneVPL - master branch
Command
- sample_multi_transcode.exe -i::h264 3.h264 -o::h264 3_new.h264 -hw -dGfx 1 -lowpower:on -b 3000 -u speed

We couldn't reproduce your issue. We would like to get your graphics driver details for further debugging. And you could try again with the latest Gfx driver that we used and let us know if the issue persists.

Regards,

Alekhya

AlekhyaV_Intel · ‎04-24-2023

Hi,

Has the solution provided helped? Could you please give us an update regarding this issue?

Regards,

Alekhya

zsc_IM · ‎04-24-2023

Hi AlekhyaV_Intel,

I'm sorry for my late reply. I have tried your solution and get correct result.

I agree and thank you for teaching me this.

Before we explain you what all we tried, Let's understand the transcoding process flow:

There must be CPU usage involved because of FILE I/O for reading encoded stream and writing encoded stream.
There's no CPU involved during the processing (encode -> decode -> encode) and only GPU resources are utilized.

Besides, let me summarize my findings:

1. CPU utilization is positively correlated with transcoding FPS, regardless of whether it is on an NVIDIA or INTEL GPU.

2. The CPU utilization displayed in Windows Task Manager is inaccurate. I was able to obtain the correct CPU utilization using HWMonitor and HWinfo. Typically, a transcoding task running on dg1 at 1000 FPS corresponds to approximately 5% utilization of epyc 7d12 CPU.

3. However, at the same transcoding FPS, the CPU utilization of INTEL GPU (dg1) is approximately 50% higher than that of NVIDIA GPU (tesla p4).

I apologize for hastily relying on the CPU usage displayed in Task Manager when I raised this question earlier. Thank you for your patient and continued support. I hope that my findings can help you achieve better "CPU efficiency" in video transcoding tasks beyond NVIDIA in the future.

In summary, this issue stemmed from a misunderstanding. But I was unsure which answer button to click to best help others.

Thanks!

AlekhyaV_Intel · ‎05-04-2023

Hi,

Thank you for trying out our solution and understandings. We would consider your findings as a valuable feedback. As your issue is resolved, can we close this thread?

And regarding the answer button, you would find an "Accept as solution" button below every response. You could click on it to mark it as a solution.

Regards,

Alekhya

AlekhyaV_Intel · ‎05-11-2023

Hi,

Thank you for accepting a solution. Glad to know that your issue is resolved. If you need any further assistance, please post a new question as this thread will no longer be monitored by Intel.

Regards,

Alekhya