Intel® Gaudi® AI Accelerator
Support for the Intel® Gaudi® AI Accelerator
13 Discussions

Gaudi-2 device acquisition failure

taesukim_squeezebits
508 Views

Hello, we have encountered device acquisition failure issue on Gaudi-2.

On our on-premise Gaudi-2 server, 2 cards out of 8 are not working as intended. Any access to the device fails with the message synStatus=26 Device acquire failed.

 

In such cases, dmesg reports:

[1194030.096641] habanalabs 0000:43:00.0: PMMU_AXI_ERR_RSP: err cause: page fault
[1194030.097687] habanalabs 0000:43:00.0: PMMU page fault on va 0xfff0000100200980
[1194030.098693] habanalabs 0000:43:00.0: PMMU_AXI_ERR_RSP: err cause: slave error
[1194030.148378] habanalabs 0000:43:00.0: Going to reset device
[1194030.148405] habanalabs 0000:43:00.0: PDMA_CH1_AXI_ERR_RSP: err cause: qman sei intr
[1194030.149384] habanalabs 0000:43:00.0: PDMA_0-RAZWI SHARED RR HBW RD error, address 0x1fffffff0001380, Initiator coordinates 0x4
[1194030.154116] habanalabs 0000:43:00.0: PMMU0_PAGE_FAULT_WR_PERM: err cause: page fault
[1194030.155179] habanalabs 0000:43:00.0: PMMU0_PAGE_FAULT_WR_PERM: err cause: slave error
[1194030.158437] habanalabs 0000:43:00.0: PDMA1_QM: stream0. err cause: CQ AXI HBW error, qid = 4
[1194030.169474] habanalabs 0000:cc:00.0: Card 4 Port 6: link down
[1194030.169482] habanalabs 0000:cc:00.0: Card 4 Port 7: link down
[1194030.171222] habanalabs 0000:cc:00.0: Card 4 Port 7: link up
[1194030.171385] habanalabs 0000:cc:00.0: Card 4 Port 7: link down
[1194030.173414] habanalabs 0000:cc:00.0: Card 4 Port 9: link down
[1194030.175515] habanalabs 0000:19:00.0: Card 2 Port 16: link down
[1194030.175536] habanalabs 0000:19:00.0: Card 2 Port 17: link down
[1194030.177023] habanalabs 0000:19:00.0: Card 2 Port 17: link up
[1194030.177381] habanalabs 0000:19:00.0: Card 2 Port 17: link down
[1194030.179498] habanalabs 0000:19:00.0: Card 2 Port 18: link down
[1194030.187405] habanalabs 0000:b3:00.0: Card 6 Port 16: link down
[1194030.187421] habanalabs 0000:b3:00.0: Card 6 Port 17: link down
[1194030.189120] habanalabs 0000:b3:00.0: Card 6 Port 17: link up
[1194030.189388] habanalabs 0000:b3:00.0: Card 6 Port 17: link down
[1194030.191409] habanalabs 0000:b3:00.0: Card 6 Port 18: link down
[1194030.253917] habanalabs 0000:43:00.0: Killing CS 1.1
[1194030.253949] habanalabs 0000:43:00.0: CS 1 has been aborted while user process is waiting for it
[1194035.678482] habanalabs 0000:19:00.0: Card 2 Port 19: link down
[1194035.680487] habanalabs 0000:19:00.0: Card 2 Port 19: link up
[1194035.685452] habanalabs 0000:b3:00.0: Card 6 Port 19: link down
[1194035.687223] habanalabs 0000:b3:00.0: Card 6 Port 19: link up
[1194035.799391] habanalabs 0000:cc:00.0: Card 4 Port 3: link down
[1194035.801186] habanalabs 0000:cc:00.0: Card 4 Port 3: link up
[1194042.502358] habanalabs 0000:43:00.0: Driver version: 1.20.0-bd87f71
[1194042.555186] habanalabs 0000:43:00.0: Loading secured firmware to device, may take some time...
[1194042.680209] habanalabs 0000:43:00.0: preboot full version: 'Preboot version hl-gaudi2-1.20.0-fw-58.0.0-sec-9 (Jan 16 2025 - 17:51:04)'
[1194057.033814] habanalabs 0000:43:00.0: boot-fit version 58.0.0-sec-9
[1194059.622128] habanalabs 0000:43:00.0: Successfully loaded firmware to device
[1194060.706321] habanalabs 0000:43:00.0: Linux version 58.0.0-sec-9
[1194071.475317] habanalabs 0000:43:00.0: Successfully finished resetting the 0000:43:00.0 device

We can observe Uncor-Events increasing in hl-smi. In hl-smi, malfunctioning devices have indices 5 and 6. We tried resetting devices with hl-smi -r -i 0000:43:00.0 and 0000:44:00.0 (for devices 5 and 6), but had no luck.

Our SynapseAI driver version is 1.20.0 (1.20.0-bd87f71 on hl-smi), and we are using vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0 as the docker image.

0 Kudos
1 Solution
taesukim_squeezebits
444 Views

We resolved this issue with a reboot, and is observing stable performance for ~18 hours now.

If anyone stumbles upon similar problem, rebooting the server might be a solution.

 

hl-smi result was like this (nothing peculiar but the Uncor-Events increasing every time i tried to acquire the device):

+-----------------------------------------------------------------------------+
| HL-SMI Version:                              hl-1.20.0-fw-58.1.1.1          |
| Driver Version:                                     1.20.0-bd87f71          |
| Nic Driver Version:                                 1.20.0-e4fe12d          |
|-------------------------------+----------------------+----------------------+
| AIP  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncor-Events|
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | AIP-Util  Compute M. |
|===============================+======================+======================|
|   0  HL-225              N/A  | 0000:19:00.0     N/A |                   0  |
| N/A   28C   P0   89W /  600W  | 98304MiB /  98304MiB |     0%          100% |
|-------------------------------+----------------------+----------------------+
|   1  HL-225              N/A  | 0000:b3:00.0     N/A |                   0  |
| N/A   29C   P0   85W /  600W  | 98304MiB /  98304MiB |     0%          100% |
|-------------------------------+----------------------+----------------------+
|   2  HL-225              N/A  | 0000:1a:00.0     N/A |                   0  |
| N/A   30C   P0   86W /  600W  | 82527MiB /  98304MiB |     0%           83% |
|-------------------------------+----------------------+----------------------+
|   3  HL-225              N/A  | 0000:43:00.0     N/A |                   0  |
| N/A   31C   P0   85W /  600W  | 82500MiB /  98304MiB |     0%           83% |
|-------------------------------+----------------------+----------------------+
|   4  HL-225              N/A  | 0000:44:00.0     N/A |                   0  |
| N/A   28C   P0   99W /  600W  | 20698MiB /  98304MiB |     0%           21% |
|-------------------------------+----------------------+----------------------+
|   5  HL-225              N/A  | 0000:b4:00.0     N/A |                   86  |
| N/A   31C   P0  114W /  600W  |   768MiB /  98304MiB |     0%            0% |
|-------------------------------+----------------------+----------------------+
|   6  HL-225              N/A  | 0000:cc:00.0     N/A |                   73  |
| N/A   30C   P0   88W /  600W  |   768MiB /  98304MiB |     0%            0% |
|-------------------------------+----------------------+----------------------+
|   7  HL-225              N/A  | 0000:cd:00.0     N/A |                   0  |
| N/A   29C   P0   86W /  600W  |   768MiB /  98304MiB |     0%            0% |
|-------------------------------+----------------------+----------------------+
| Compute Processes:                                               AIP Memory |
|  AIP       PID   Type   Process name                             Usage      |
|=============================================================================|
|   0       135509     C   python3                                 97536MiB
|   1       135580     C   python3                                 97536MiB
|   2       202781     C   python                                  81759MiB
|   3       203767     C   python                                  81732MiB
|   4       774811     C   python                                  19930MiB
|   5        N/A   N/A    N/A                                      N/A        |
|   6        N/A   N/A    N/A                                      N/A        |
|   7        N/A   N/A    N/A                                      N/A        |
+=============================================================================+

and the reset try with hl-smi -r was already done outside of the docker. hl-smi -r was working as intended, so there was no error log on that side. I could only observe the error log inside dmesg.

View solution in original post

0 Kudos
4 Replies
James_Edwards
Employee
460 Views

Can you run hl_smi on the system directly, i.e. not in a docker container, and provide the full output?

0 Kudos
James_Edwards
Employee
432 Views

Next time the issue occurs a less intrusive option may be to unload and reload the drivers:

.

'modprobe -rf habanalabs'

'modprobe habanalabs'

.

This will reset all devices and re-initialize the drivers. This is intrusive, but less so than a complete reboot. Can we consider this ticket closed?

0 Kudos
taesukim_squeezebits
403 Views

Thanks for the tip! Our server is functioning stable after the reboot. I will try modprobe next time.

I wanted to accept this as the answer, but somehow clicked mine. Still, I closed the issue. Thank you.

0 Kudos
taesukim_squeezebits
445 Views

We resolved this issue with a reboot, and is observing stable performance for ~18 hours now.

If anyone stumbles upon similar problem, rebooting the server might be a solution.

 

hl-smi result was like this (nothing peculiar but the Uncor-Events increasing every time i tried to acquire the device):

+-----------------------------------------------------------------------------+
| HL-SMI Version:                              hl-1.20.0-fw-58.1.1.1          |
| Driver Version:                                     1.20.0-bd87f71          |
| Nic Driver Version:                                 1.20.0-e4fe12d          |
|-------------------------------+----------------------+----------------------+
| AIP  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncor-Events|
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | AIP-Util  Compute M. |
|===============================+======================+======================|
|   0  HL-225              N/A  | 0000:19:00.0     N/A |                   0  |
| N/A   28C   P0   89W /  600W  | 98304MiB /  98304MiB |     0%          100% |
|-------------------------------+----------------------+----------------------+
|   1  HL-225              N/A  | 0000:b3:00.0     N/A |                   0  |
| N/A   29C   P0   85W /  600W  | 98304MiB /  98304MiB |     0%          100% |
|-------------------------------+----------------------+----------------------+
|   2  HL-225              N/A  | 0000:1a:00.0     N/A |                   0  |
| N/A   30C   P0   86W /  600W  | 82527MiB /  98304MiB |     0%           83% |
|-------------------------------+----------------------+----------------------+
|   3  HL-225              N/A  | 0000:43:00.0     N/A |                   0  |
| N/A   31C   P0   85W /  600W  | 82500MiB /  98304MiB |     0%           83% |
|-------------------------------+----------------------+----------------------+
|   4  HL-225              N/A  | 0000:44:00.0     N/A |                   0  |
| N/A   28C   P0   99W /  600W  | 20698MiB /  98304MiB |     0%           21% |
|-------------------------------+----------------------+----------------------+
|   5  HL-225              N/A  | 0000:b4:00.0     N/A |                   86  |
| N/A   31C   P0  114W /  600W  |   768MiB /  98304MiB |     0%            0% |
|-------------------------------+----------------------+----------------------+
|   6  HL-225              N/A  | 0000:cc:00.0     N/A |                   73  |
| N/A   30C   P0   88W /  600W  |   768MiB /  98304MiB |     0%            0% |
|-------------------------------+----------------------+----------------------+
|   7  HL-225              N/A  | 0000:cd:00.0     N/A |                   0  |
| N/A   29C   P0   86W /  600W  |   768MiB /  98304MiB |     0%            0% |
|-------------------------------+----------------------+----------------------+
| Compute Processes:                                               AIP Memory |
|  AIP       PID   Type   Process name                             Usage      |
|=============================================================================|
|   0       135509     C   python3                                 97536MiB
|   1       135580     C   python3                                 97536MiB
|   2       202781     C   python                                  81759MiB
|   3       203767     C   python                                  81732MiB
|   4       774811     C   python                                  19930MiB
|   5        N/A   N/A    N/A                                      N/A        |
|   6        N/A   N/A    N/A                                      N/A        |
|   7        N/A   N/A    N/A                                      N/A        |
+=============================================================================+

and the reset try with hl-smi -r was already done outside of the docker. hl-smi -r was working as intended, so there was no error log on that side. I could only observe the error log inside dmesg.

0 Kudos
Reply