GPU Compute Software
Ask questions about Intel® Graphics Compute software technologies, such as OpenCL* GPU driver and oneAPI Level Zero
229 Discussions

GPU 1100 Max temperature and nonfatal errors when running xpui-smi?

Tore
Novice
249 Views

We are creating an agent to pull data from XPUs.  I have a node with Ubuntu 22.04.5 LTS OS

root@n022:~# uname -a
Linux n022 5.15.0-122-generic #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

 

Hardware: 
root@n022:~# clinfo -l
Platform #0: Intel(R) OpenCL Graphics
+-- Device #0: Intel(R) Data Center GPU Max 1100
`-- Device #1: Intel(R) Data Center GPU Max 1100

 

Driver:

root@n022:~# dkms status | grep -i i9
AUXILIARY_BUS is enabled for 5.15.0-122-generic.
intel-i915-dkms/1.23.10.72.231129.76, 5.15.0-122-generic, x86_64: installed

 

Temperature sensor information returns N/A.

 

root@n022:~# xpu-smi dump -d "-1" -m1,2,3,4,5,18 -i 1 -n1
Timestamp, DeviceId, GPU Power (W), GPU Frequency (MHz), GPU Core Temperature (Celsius Degree), GPU Memory Temperature (Celsius Degree), GPU Memory Utilization (%), GPU Memory Used (MiB)
11:12:08.956, 0, 52.37, 0, N/A, N/A, 0.05, 28.13
11:12:08.956, 1, 49.37, 0, N/A, N/A, 0.05, 27.99

 

Is this something that is planned to be fixed soon?

 

Lastly, when I run xpu-smi, I get the below non-fatal errors in syslog.  Are these known issues?

 

The example code works well. Host has Xeon Max 8480+.

 

[84460.492225] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected NONFATAL error GFX_MSTR_INTR:0x08000000
[84460.504560] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected DEV_ERR_STAT_REG_NONFATAL:0x00010000
[84460.516620] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC NONFATAL error
[84460.526912] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC_GLOBAL_ERR_STAT_MASTER_REG_NONFATAL:0x00000002
[84460.540884] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC_GLOBAL_ERR_STAT_SLAVE_REG_NONFATAL:0x00010000
[84460.554187] i915 0000:9a:00.0: [drm] *ERROR* GT0 [INTERRUPT] Invalid HBM SS3: Channel7 SOC NONFATAL error
[84460.565586] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected NONFATAL error GFX_MSTR_INTR:0x08000000
[84460.577907] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected DEV_ERR_STAT_REG_NONFATAL:0x00010000
[84460.589952] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC NONFATAL error
[84460.600246] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC_GLOBAL_ERR_STAT_MASTER_REG_NONFATAL:0x00000002
[84460.614211] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC_GLOBAL_ERR_STAT_SLAVE_REG_NONFATAL:0x00010000
[84460.627510] i915 0000:9a:00.0: [drm] *ERROR* GT0 [INTERRUPT] Invalid HBM SS3: Channel7 SOC NONFATAL error

 

Brgds,

ToreL

 

 

 

Labels (1)
0 Kudos
0 Replies
Reply