- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, we have encountered device acquisition failure issue on Gaudi-2.
On our on-premise Gaudi-2 server, 2 cards out of 8 are not working as intended. Any access to the device fails with the message synStatus=26 Device acquire failed.
In such cases, dmesg reports:
[1194030.096641] habanalabs 0000:43:00.0: PMMU_AXI_ERR_RSP: err cause: page fault
[1194030.097687] habanalabs 0000:43:00.0: PMMU page fault on va 0xfff0000100200980
[1194030.098693] habanalabs 0000:43:00.0: PMMU_AXI_ERR_RSP: err cause: slave error
[1194030.148378] habanalabs 0000:43:00.0: Going to reset device
[1194030.148405] habanalabs 0000:43:00.0: PDMA_CH1_AXI_ERR_RSP: err cause: qman sei intr
[1194030.149384] habanalabs 0000:43:00.0: PDMA_0-RAZWI SHARED RR HBW RD error, address 0x1fffffff0001380, Initiator coordinates 0x4
[1194030.154116] habanalabs 0000:43:00.0: PMMU0_PAGE_FAULT_WR_PERM: err cause: page fault
[1194030.155179] habanalabs 0000:43:00.0: PMMU0_PAGE_FAULT_WR_PERM: err cause: slave error
[1194030.158437] habanalabs 0000:43:00.0: PDMA1_QM: stream0. err cause: CQ AXI HBW error, qid = 4
[1194030.169474] habanalabs 0000:cc:00.0: Card 4 Port 6: link down
[1194030.169482] habanalabs 0000:cc:00.0: Card 4 Port 7: link down
[1194030.171222] habanalabs 0000:cc:00.0: Card 4 Port 7: link up
[1194030.171385] habanalabs 0000:cc:00.0: Card 4 Port 7: link down
[1194030.173414] habanalabs 0000:cc:00.0: Card 4 Port 9: link down
[1194030.175515] habanalabs 0000:19:00.0: Card 2 Port 16: link down
[1194030.175536] habanalabs 0000:19:00.0: Card 2 Port 17: link down
[1194030.177023] habanalabs 0000:19:00.0: Card 2 Port 17: link up
[1194030.177381] habanalabs 0000:19:00.0: Card 2 Port 17: link down
[1194030.179498] habanalabs 0000:19:00.0: Card 2 Port 18: link down
[1194030.187405] habanalabs 0000:b3:00.0: Card 6 Port 16: link down
[1194030.187421] habanalabs 0000:b3:00.0: Card 6 Port 17: link down
[1194030.189120] habanalabs 0000:b3:00.0: Card 6 Port 17: link up
[1194030.189388] habanalabs 0000:b3:00.0: Card 6 Port 17: link down
[1194030.191409] habanalabs 0000:b3:00.0: Card 6 Port 18: link down
[1194030.253917] habanalabs 0000:43:00.0: Killing CS 1.1
[1194030.253949] habanalabs 0000:43:00.0: CS 1 has been aborted while user process is waiting for it
[1194035.678482] habanalabs 0000:19:00.0: Card 2 Port 19: link down
[1194035.680487] habanalabs 0000:19:00.0: Card 2 Port 19: link up
[1194035.685452] habanalabs 0000:b3:00.0: Card 6 Port 19: link down
[1194035.687223] habanalabs 0000:b3:00.0: Card 6 Port 19: link up
[1194035.799391] habanalabs 0000:cc:00.0: Card 4 Port 3: link down
[1194035.801186] habanalabs 0000:cc:00.0: Card 4 Port 3: link up
[1194042.502358] habanalabs 0000:43:00.0: Driver version: 1.20.0-bd87f71
[1194042.555186] habanalabs 0000:43:00.0: Loading secured firmware to device, may take some time...
[1194042.680209] habanalabs 0000:43:00.0: preboot full version: 'Preboot version hl-gaudi2-1.20.0-fw-58.0.0-sec-9 (Jan 16 2025 - 17:51:04)'
[1194057.033814] habanalabs 0000:43:00.0: boot-fit version 58.0.0-sec-9
[1194059.622128] habanalabs 0000:43:00.0: Successfully loaded firmware to device
[1194060.706321] habanalabs 0000:43:00.0: Linux version 58.0.0-sec-9
[1194071.475317] habanalabs 0000:43:00.0: Successfully finished resetting the 0000:43:00.0 device
We can observe Uncor-Events increasing in hl-smi. In hl-smi, malfunctioning devices have indices 5 and 6. We tried resetting devices with hl-smi -r -i 0000:43:00.0 and 0000:44:00.0 (for devices 5 and 6), but had no luck.
Our SynapseAI driver version is 1.20.0 (1.20.0-bd87f71 on hl-smi), and we are using vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0 as the docker image.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We resolved this issue with a reboot, and is observing stable performance for ~18 hours now.
If anyone stumbles upon similar problem, rebooting the server might be a solution.
hl-smi result was like this (nothing peculiar but the Uncor-Events increasing every time i tried to acquire the device):
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.20.0-fw-58.1.1.1 |
| Driver Version: 1.20.0-bd87f71 |
| Nic Driver Version: 1.20.0-e4fe12d |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncor-Events|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-225 N/A | 0000:19:00.0 N/A | 0 |
| N/A 28C P0 89W / 600W | 98304MiB / 98304MiB | 0% 100% |
|-------------------------------+----------------------+----------------------+
| 1 HL-225 N/A | 0000:b3:00.0 N/A | 0 |
| N/A 29C P0 85W / 600W | 98304MiB / 98304MiB | 0% 100% |
|-------------------------------+----------------------+----------------------+
| 2 HL-225 N/A | 0000:1a:00.0 N/A | 0 |
| N/A 30C P0 86W / 600W | 82527MiB / 98304MiB | 0% 83% |
|-------------------------------+----------------------+----------------------+
| 3 HL-225 N/A | 0000:43:00.0 N/A | 0 |
| N/A 31C P0 85W / 600W | 82500MiB / 98304MiB | 0% 83% |
|-------------------------------+----------------------+----------------------+
| 4 HL-225 N/A | 0000:44:00.0 N/A | 0 |
| N/A 28C P0 99W / 600W | 20698MiB / 98304MiB | 0% 21% |
|-------------------------------+----------------------+----------------------+
| 5 HL-225 N/A | 0000:b4:00.0 N/A | 86 |
| N/A 31C P0 114W / 600W | 768MiB / 98304MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 6 HL-225 N/A | 0000:cc:00.0 N/A | 73 |
| N/A 30C P0 88W / 600W | 768MiB / 98304MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 7 HL-225 N/A | 0000:cd:00.0 N/A | 0 |
| N/A 29C P0 86W / 600W | 768MiB / 98304MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 135509 C python3 97536MiB
| 1 135580 C python3 97536MiB
| 2 202781 C python 81759MiB
| 3 203767 C python 81732MiB
| 4 774811 C python 19930MiB
| 5 N/A N/A N/A N/A |
| 6 N/A N/A N/A N/A |
| 7 N/A N/A N/A N/A |
+=============================================================================+
and the reset try with hl-smi -r was already done outside of the docker. hl-smi -r was working as intended, so there was no error log on that side. I could only observe the error log inside dmesg.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you run hl_smi on the system directly, i.e. not in a docker container, and provide the full output?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Next time the issue occurs a less intrusive option may be to unload and reload the drivers:
.
'modprobe -rf habanalabs'
'modprobe habanalabs'
.
This will reset all devices and re-initialize the drivers. This is intrusive, but less so than a complete reboot. Can we consider this ticket closed?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the tip! Our server is functioning stable after the reboot. I will try modprobe next time.
I wanted to accept this as the answer, but somehow clicked mine. Still, I closed the issue. Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We resolved this issue with a reboot, and is observing stable performance for ~18 hours now.
If anyone stumbles upon similar problem, rebooting the server might be a solution.
hl-smi result was like this (nothing peculiar but the Uncor-Events increasing every time i tried to acquire the device):
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.20.0-fw-58.1.1.1 |
| Driver Version: 1.20.0-bd87f71 |
| Nic Driver Version: 1.20.0-e4fe12d |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncor-Events|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-225 N/A | 0000:19:00.0 N/A | 0 |
| N/A 28C P0 89W / 600W | 98304MiB / 98304MiB | 0% 100% |
|-------------------------------+----------------------+----------------------+
| 1 HL-225 N/A | 0000:b3:00.0 N/A | 0 |
| N/A 29C P0 85W / 600W | 98304MiB / 98304MiB | 0% 100% |
|-------------------------------+----------------------+----------------------+
| 2 HL-225 N/A | 0000:1a:00.0 N/A | 0 |
| N/A 30C P0 86W / 600W | 82527MiB / 98304MiB | 0% 83% |
|-------------------------------+----------------------+----------------------+
| 3 HL-225 N/A | 0000:43:00.0 N/A | 0 |
| N/A 31C P0 85W / 600W | 82500MiB / 98304MiB | 0% 83% |
|-------------------------------+----------------------+----------------------+
| 4 HL-225 N/A | 0000:44:00.0 N/A | 0 |
| N/A 28C P0 99W / 600W | 20698MiB / 98304MiB | 0% 21% |
|-------------------------------+----------------------+----------------------+
| 5 HL-225 N/A | 0000:b4:00.0 N/A | 86 |
| N/A 31C P0 114W / 600W | 768MiB / 98304MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 6 HL-225 N/A | 0000:cc:00.0 N/A | 73 |
| N/A 30C P0 88W / 600W | 768MiB / 98304MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 7 HL-225 N/A | 0000:cd:00.0 N/A | 0 |
| N/A 29C P0 86W / 600W | 768MiB / 98304MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 135509 C python3 97536MiB
| 1 135580 C python3 97536MiB
| 2 202781 C python 81759MiB
| 3 203767 C python 81732MiB
| 4 774811 C python 19930MiB
| 5 N/A N/A N/A N/A |
| 6 N/A N/A N/A N/A |
| 7 N/A N/A N/A N/A |
+=============================================================================+
and the reset try with hl-smi -r was already done outside of the docker. hl-smi -r was working as intended, so there was no error log on that side. I could only observe the error log inside dmesg.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page