Intel® Gaudi® AI Accelerator
Support for the Intel® Gaudi® AI Accelerator
25 Discussions

Gaudi2 - Devices 1 & 3 Showing N/A Status in hl-smi and Unusable

RajashekarK_Intel
7,097 Views

Hi,

We're currently working with a Gaudi2 machine(8 Gaudi® 2 HL-225H mezzanine cards with 3rd Gen Xeon® processors) and have encountered a problem with device recognition. When I use the hl-smi tool to check the status of the devices, I'm seeing an "N/A" status for devices 1 and 3.

Is this expected behavior under certain conditions? If not, could anyone provide insights into what might be causing this and how to resolve it?

Any help would be greatly appreciated!

Labels (2)
0 Kudos
6 Replies
RajashekarK_Intel
7,027 Views

Update here, Unfortunately after reboot.
Only 4 cards are visible and

sudo hl-smi

is also giving the same output.

0 Kudos
RajashekarK_Intel
7,005 Views

We've tried

Remove and load the kenel module again, but issue still persists!

rmmod habanalabs 
modprobe habanalabs 
0 Kudos
James_Edwards
Employee
6,946 Views

Can you try resetting one of the devices using the hl-smi command? The reset command needs the -i option to specify the device:

.

hl-smi -r -i 0000:cc:00.0

.

This should reset device 0. Please post the dmesg output after executing the reset.

0 Kudos
James_Edwards
Employee
6,942 Views

I misspoke, this command will reset device 1 (one of the problem devices):

.

hl-smi -r -i 0000:cc:00.0

.

This command will reset device 3:

.

hl-smi -r -i 0000:cd:00.0

.

Please send the dmesg output after the reset command has completed for both.

0 Kudos
RajashekarK_Intel
6,893 Views

Thanks for you reply, will keep these noted.

We've moved to a different instance, Thank you!

0 Kudos
James_Edwards
Employee
5,690 Views

Can we close this issue?

0 Kudos
Reply