GPU Compute Software
Ask questions about Intel® Graphics Compute software technologies, such as OpenCL* GPU driver and oneAPI Level Zero
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
370 Discussions

GPU 1100 Max temperature and nonfatal errors when running xpui-smi?

Tore
Novice
503 Views

We are creating an agent to pull data from XPUs.  I have a node with Ubuntu 22.04.5 LTS OS

root@n022:~# uname -a
Linux n022 5.15.0-122-generic #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

 

Hardware: 
root@n022:~# clinfo -l
Platform #0: Intel(R) OpenCL Graphics
+-- Device #0: Intel(R) Data Center GPU Max 1100
`-- Device #1: Intel(R) Data Center GPU Max 1100

 

Driver:

root@n022:~# dkms status | grep -i i9
AUXILIARY_BUS is enabled for 5.15.0-122-generic.
intel-i915-dkms/1.23.10.72.231129.76, 5.15.0-122-generic, x86_64: installed

 

Temperature sensor information returns N/A.

 

root@n022:~# xpu-smi dump -d "-1" -m1,2,3,4,5,18 -i 1 -n1
Timestamp, DeviceId, GPU Power (W), GPU Frequency (MHz), GPU Core Temperature (Celsius Degree), GPU Memory Temperature (Celsius Degree), GPU Memory Utilization (%), GPU Memory Used (MiB)
11:12:08.956, 0, 52.37, 0, N/A, N/A, 0.05, 28.13
11:12:08.956, 1, 49.37, 0, N/A, N/A, 0.05, 27.99

 

Is this something that is planned to be fixed soon?

 

Lastly, when I run xpu-smi, I get the below non-fatal errors in syslog.  Are these known issues?

 

The example code works well. Host has Xeon Max 8480+.

 

[84460.492225] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected NONFATAL error GFX_MSTR_INTR:0x08000000
[84460.504560] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected DEV_ERR_STAT_REG_NONFATAL:0x00010000
[84460.516620] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC NONFATAL error
[84460.526912] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC_GLOBAL_ERR_STAT_MASTER_REG_NONFATAL:0x00000002
[84460.540884] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC_GLOBAL_ERR_STAT_SLAVE_REG_NONFATAL:0x00010000
[84460.554187] i915 0000:9a:00.0: [drm] *ERROR* GT0 [INTERRUPT] Invalid HBM SS3: Channel7 SOC NONFATAL error
[84460.565586] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected NONFATAL error GFX_MSTR_INTR:0x08000000
[84460.577907] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected DEV_ERR_STAT_REG_NONFATAL:0x00010000
[84460.589952] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC NONFATAL error
[84460.600246] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC_GLOBAL_ERR_STAT_MASTER_REG_NONFATAL:0x00000002
[84460.614211] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC_GLOBAL_ERR_STAT_SLAVE_REG_NONFATAL:0x00010000
[84460.627510] i915 0000:9a:00.0: [drm] *ERROR* GT0 [INTERRUPT] Invalid HBM SS3: Channel7 SOC NONFATAL error

 

Brgds,

ToreL

 

 

 

Labels (1)
0 Kudos
0 Replies
Reply