- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are creating an agent to pull data from XPUs. I have a node with Ubuntu 22.04.5 LTS OS
root@n022:~# uname -a
Linux n022 5.15.0-122-generic #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Hardware:
root@n022:~# clinfo -l
Platform #0: Intel(R) OpenCL Graphics
+-- Device #0: Intel(R) Data Center GPU Max 1100
`-- Device #1: Intel(R) Data Center GPU Max 1100
Driver:
root@n022:~# dkms status | grep -i i9
AUXILIARY_BUS is enabled for 5.15.0-122-generic.
intel-i915-dkms/1.23.10.72.231129.76, 5.15.0-122-generic, x86_64: installed
Temperature sensor information returns N/A.
root@n022:~# xpu-smi dump -d "-1" -m1,2,3,4,5,18 -i 1 -n1
Timestamp, DeviceId, GPU Power (W), GPU Frequency (MHz), GPU Core Temperature (Celsius Degree), GPU Memory Temperature (Celsius Degree), GPU Memory Utilization (%), GPU Memory Used (MiB)
11:12:08.956, 0, 52.37, 0, N/A, N/A, 0.05, 28.13
11:12:08.956, 1, 49.37, 0, N/A, N/A, 0.05, 27.99
Is this something that is planned to be fixed soon?
Lastly, when I run xpu-smi, I get the below non-fatal errors in syslog. Are these known issues?
The example code works well. Host has Xeon Max 8480+.
[84460.492225] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected NONFATAL error GFX_MSTR_INTR:0x08000000
[84460.504560] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected DEV_ERR_STAT_REG_NONFATAL:0x00010000
[84460.516620] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC NONFATAL error
[84460.526912] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC_GLOBAL_ERR_STAT_MASTER_REG_NONFATAL:0x00000002
[84460.540884] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC_GLOBAL_ERR_STAT_SLAVE_REG_NONFATAL:0x00010000
[84460.554187] i915 0000:9a:00.0: [drm] *ERROR* GT0 [INTERRUPT] Invalid HBM SS3: Channel7 SOC NONFATAL error
[84460.565586] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected NONFATAL error GFX_MSTR_INTR:0x08000000
[84460.577907] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected DEV_ERR_STAT_REG_NONFATAL:0x00010000
[84460.589952] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC NONFATAL error
[84460.600246] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC_GLOBAL_ERR_STAT_MASTER_REG_NONFATAL:0x00000002
[84460.614211] i915 0000:9a:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected SOC_GLOBAL_ERR_STAT_SLAVE_REG_NONFATAL:0x00010000
[84460.627510] i915 0000:9a:00.0: [drm] *ERROR* GT0 [INTERRUPT] Invalid HBM SS3: Channel7 SOC NONFATAL error
Brgds,
ToreL
Link Copied
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page