Re:GPU hangs when decoding 2 HEVC UHD streams 444 10 bits (Y410 pixel format).

njean · ‎11-21-2022

In Linux, when we decode 2 streams HEVC UHD 444 10 bits on an I5 Alder Lake (also reproduce on a I7 Tiger Lake), we have a GPU hangs.

We reproduce the bug with the sample app, with the following command:
sample_decode h265 -i ./test-444-10.h265 -hw -vaapi -o /dev/null & sample_decode h265 -i ./test-444-10.h265 -hw -vaapi -o /dev/null

This bug is a GPU hang in Intel Gfx stack.

i915 kernel messages:

kernel: [26204.741232] i915 0000:00:02.0: [drm] Resetting vcs1 for preemption time out
kernel: [26204.741258] i915 0000:00:02.0: [drm] sample_decode[36374] context reset due to GPU hang
kernel: [26204.775905] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:10:28fffffd, in sample_decode [36374]
kernel: [26213.573237] i915 0000:00:02.0: [drm] Resetting vcs1 for preemption time out
kernel: [26213.573256] i915 0000:00:02.0: [drm] sample_decode[36373] context reset due to GPU hang
kernel: [26213.613847] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:10:28fffffd, in sample_decode [36373]

Version: OneVPL GPU Runtime 2022Q3 release - 22.5.4. (Released on Oct 14):

oneVPL GPU Runtime: https://github.com/oneapi-src/oneVPL-intel-gpu/releases/tag/intel-onevpl-22.5.4
oneVPL Dispatcher and Samples: https://github.com/oneapi-src/oneVPL/releases/tag/v2022.2.2
Driver: https://github.com/intel/media-driver/releases/tag/intel-media-22.5.4
Gmmlib: https://github.com/intel/gmmlib/releases/tag/intel-gmmlib-22.2.0
libva: https://github.com/intel/libva/releases/tag/2.16.0
libva-utils: https://github.com/intel/libva-utils/releases/tag/2.16.0

Ubuntu 22.04 LTS
Kernel:
- I5 alder lake : 5.18.0-051800rc1-generic
Type: N/A Mobo: Intel model: PELM12HBI516 v: M47315-301 serial: BTHB2120092B UEFI: Intel v: HBADL357.0038.2022.0310.0956 date: 03/10/2022
CPU: 10-core (2-mt/8-st) 12th Gen Intel Core i5-1235U (-MST AMCP-)
speed/min/max: 614/400/4400:3300 MHz Kernel: 5.18.0-051800-generic x86_64 Up: 6d 20h 41m
Mem: 748.1/15577.8 MiB (4.8%) Storage: 238.47 GiB (12.0% used) Procs: 233 Shell: Bash

- I7 tiger lake : 5.15.0-46 generic
Type: Desktop Mobo: ASRock model: NUC-TGL serial: M8P-DC000400058
UEFI: American Megatrends LLC. v: P1.10 date: 12/24/2020
CPU: quad core 11th Gen Intel Core i7-1165G7 (-MT MCP-) speed/min/max: 775/400/2701 MHz
Kernel: 5.15.0-46-generic x86_64 Up: 18h 34m Mem: 2971.0/15332.4 MiB (19.4%)
Storage: 232.89 GiB (22.4% used) Procs: 210 Shell: Bash

RemyaP_Intel · ‎11-22-2022

Hi,

Thank you for posting in Intel Communities,

Thanks for sharing the details with us. We are trying to reproduce the issue. Could you please also share the input file used?

Regards,

Remya Premdas

njean · ‎11-22-2022

Hi,

Here is the input file used.

Thanks,

Nicolas

RemyaP_Intel · ‎11-28-2022

Hi,

Thanks for sharing the input file. We are trying to reproduce your issue. We'll get back soon with an update.

Regards,

Remya Premdas

RemyaP_Intel · ‎12-07-2022

Hi,

We are still working on your issue. Sorry for the delay.

Regards,

Remya Premdas

RemyaP_Intel · ‎12-11-2022

Hi,

We tried the same sample decode command with the input file present in the oneVPL repo at /examples/content/cars_320x240.h265 It ran without any hang or errors.

We are checking the same with the input file you have shared. Meanwhile could you please try running on your machine with the cars_320x240.h265 input file and see if there is any GPU hang or errors?

Regards,

Remya Premdas

njean · ‎12-12-2022

Hello Remya,

/examples/content/cars_320x240.h265 runs without problem because it is a 444 8 bits.

The bug is with 444 10 bits.

Thanks,

Nicolas

RemyaP_Intel · ‎12-18-2022

Hi,

Sorry for the delay. Our team is working on this issue internally and will get back to you soon with an update.

Regards,

Remya Premdas

njean · ‎02-22-2023

Hello Remya, do you have any update on this issue? Have you been able to reproduce it on your side?

RemyaP_Intel · ‎02-28-2023

Hi,

Sorry for the delay. After the analysis by our development team, they have confirmed the issue is with the driver and not VPL. They are working on fixing it and currently we do not have an ETA for this.

Regards,

Remya Premdas

njean · ‎03-08-2023

Hello Remya,

We went back to the issue and retested with the last oneVPL release (v2023.1.0):

oneVPL GPU Runtime: https://github.com/oneapi-src/oneVPL-intel-gpu/releases/tag/intel-onevpl-22.6.5

oneVPL Dispatcher and Samples: https://github.com/oneapi-src/oneVPL/releases/tag/v2023.1.0

Driver: https://vpg-src1:8443/projects/MVX3/repos/intel-media-driver/browse?at=refs%2Fheads%2Fintel-media-Matrox-22.6.6

Gmmlib: https://github.com/intel/gmmlib/releases/tag/intel-gmmlib-22.3.3

libva: https://github.com/intel/libva/releases/tag/2.17.0

libva-utils: https://github.com/intel/libva-utils/releases/tag/2.17.1

We also have used the latest kernel available (6.2.0-060200-generic_6.2.0).

The GPU hang is still happening.

We also reproduced the problem with the 444-8 bits streams (AYUV).

And we have observed that the issue is much easier to reproduce with streams with B frames; without B frames, it is more difficult to reproduce but we observe errors happening in the i915 driver.

RemyaP_Intel · ‎03-09-2023

Hi,

Thanks for sharing the observations with us. We will share this with our team. As said earlier, currently we do not have an ETA for this fix. We are following up on this issue with our internal team. We will let you know, if there are any updates.

Regards,

Remya Premdas

RemyaP_Intel · ‎06-07-2023

Hi,

Thanks for your patience. This is to inform you that we found the hang does not occur on Ubuntu 20.04, and that we are working on fixing the hang that occurs with this example on Ubuntu 22.04. We will let you know when there is any update.

Regards,

Remya Premdas

njean · ‎07-12-2023

Hello Remya,

I'm a bit surprised regarding your observation with the Ubuntu version. In our side, I'm pretty sure the first we had the problem we were using Ubuntu 20.04. It looks more related to the intel driver i915 instead of the Ubuntu version.

Recently, we have retry the tests and we reproduced it with the following:

Processor: RL: i5-1335U
The error in the linux dmesg file still points to a GPU hang in Intel Gfx stack i915 driver:
i915 kernel messages:
kernel: [26204.741232] i915 0000:00:02.0: [drm] Resetting vcs1 for preemption time out
kernel: [26204.741258] i915 0000:00:02.0: [drm] sample_decode[36374] context reset due to GPU hang
kernel: [26204.775905] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:10:28fffffd, in sample_decode [36374]
kernel: [26213.573237] i915 0000:00:02.0: [drm] Resetting vcs1 for preemption time out
kernel: [26213.573256] i915 0000:00:02.0: [drm] sample_decode[36373] context reset due to GPU hang
kernel: [26213.613847] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:10:28fffffd, in sample_decode [36373]

RemyaP_Intel · ‎07-18-2023

Hi,

Apologies for the delay in getting back.

Though we initially reported that Ubuntu 20.04 is good, but that the hang occurs on 22.04 and we are working on that, it turns out that with all the variables at play, we have not been able to determine why we cannot reproduce on 20.04.

We do know that there is a root cause in our hardware implementation on TGL and ADL-S. We discovered the issue and made a change in the hardware functionality to avoid the hang. However, TGL and ADL-S do not support this new functionality. (See below for how to execute the 2-stream command on TGL and ADL-S.)

The new functionality is supported on ADL-P.

So, in summary:

You should be able to run your command on ADL-P without a hang. (Please let us know if you do get a hang in the 2-stream case. As stated, there are many variables at play in and between the OS and the platform.)
You can run your command on TGL and ADL-S with modifications:
- the hang is often triggered when scalability and MMC are enabled. If you want to decode 2 streams simultaneously, first disable scalability and MMC. (Later platforms will support our new functionality that was implemented to avoid this hang. So no need in our later platforms to disable these features.)

Regards,

Remya Premdas

njean · ‎07-21-2023

Thanks Remya for your answer.

This is really great that you have a workaround for us. We would like to try it in our side, but we would need more precision about the scalability and the MMC feature you're talking about.

Would you have more details on those features and how we should disable them?

Regards,

Nicolas

RemyaP_Intel · ‎07-28-2023

Hi,

Please follow the below steps to disable scalability and MMC:

1. got to /etc/

2. modify igfx_user_feature.txt and igfx_user_feature_next.txt,

igfx_user_feature.txt:

add below key under [key]

[KEY]

0x00000001

UFKEY_INTERNAL\LibVa

.......

[VALUE]

Enable VP MMC

4

0

[VALUE]

Enable Codec MMC

4

0

[VALUE]

Enable Vebox Decompress

4

0

[VALUE]

Enable Media RenderEngine MMC

4

0

[VALUE]

Enable HCP Scalability Decode

4

0

igfx_user_feature_next.txt:

under [config]

Enable HCP Scalability Decode=0

Enable VP MMC=0

Enable Codec MMC=0

Enable Media RenderEngine MMC=0

Enable HCP Scalability Decode=0

3.How to double check if the key was set successfully?

You can check [report] in igfx_user_feature.txt, find below key to be "0"

Decode MMC In Use=0x0

Let us know if you face any issues.

--

Thanks,

Remya Premdas

njean · ‎08-03-2023

Hello Remya,

We have tried your modification to the configuration file and we aren`t able to reproduce the hang. This is really, really, really great!

I don't know what those feature are and what is the impact to disable them. Do you have some details or a link that explains how that works?

Thanks you very much!

RemyaP_Intel · ‎08-15-2023

Hi ,

Glad to know that your issue was resolved. Unfortunately, we do not have documentation related to the scalability and MMC feature.

If this workaround has helped you, kindly make sure to accept this as a solution. This would help others with similar issues.

Also, Let me know if I can go ahead and close this thread.

--

Regards,

Remya Premdas

RemyaP_Intel · ‎08-22-2023

Hi,

As confirmed, we are closing this thread. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Regards,

Remya Premdas