Intel® Gaudi® AI Accelerator
Support for the Intel® Gaudi® AI Accelerator
19 ディスカッション

Does multi-node works?

fmohamm
従業員
5,055件の閲覧回数

I am working on making the multinode run on Gaudi2 (v 1.20.0) by following the readme file here: https://github.com/HabanaAI/Megatron-LM/tree/main.  I am working with containers.  Even after multiple trials, I am not able to get the multi node code running on two gaudi nodes that are avaialable to me.

Is the readme file updated for latest  

 

0 件の賞賛
8 返答(返信)
James_Edwards
従業員
5,004件の閲覧回数

The documentation should be up to date. However, to debug your issue I will need more information regarding what you are trying to accomplish and the errors you are seeing. Please provide information on the example you are trying to run, how you are executing the example to run multi-node and the errors you are receiving.

fmohamm
従業員
4,937件の閲覧回数

I have two bare-metal gaudi machine, and I am trying the script   from https://github.com/HabanaAI/Megatron-LM/blob/1.20.0/examples/llama/README.md#setup  and https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html#driver-installation. 

Both the machines are connected from same jump server. During the setup, I am trying to check the accelerator interface status using the following commands. I see that the status on one machine is "up" while the status on other machine is "down". 

 

/opt/habanalabs/qual/[gaudi3,gaudi2,gaudi1]/bin/manage_network_ifs.sh --up

 /opt/habanalabs/qual/[gaudi3,gaudi2,gaudi1]/bin/manage_network_ifs.sh --status 

 

Is there any steps that are missing on README? 

 

Thanks 

 

James_Edwards
従業員
4,926件の閲覧回数

The customer has given me the following output on the "bad" node:

.

/opt/habanalabs/qual/gaudi2/bin/manage_network_ifs.sh --status
accel0
3 ports down (8, 22, 23)
accel1
3 ports down (8, 22, 23)
accel2
3 ports down (8, 22, 23)
accel3
3 ports down (8, 22, 23)
accel4
3 ports down (8, 22, 23)
accel5
3 ports down (8, 22, 23)
accel6
3 ports down (8, 22, 23)
accel7
3 ports down (8, 22, 23)

.

I requested that he contact the lab admin to make sure the Gaudi platform has been wired to the accelerator network correctly.

James_Edwards
従業員
4,792件の閲覧回数

Is there any status on this issue? Has the problem been resolved?

fmohamm
従業員
4,771件の閲覧回数

We built another machine from scratch. This new machine also has the same issue.  Exactly same error that I was getting in the previous machine.

James_Edwards
従業員
4,762件の閲覧回数

Is the machine you built from scratch linked into the same switch used on the accelerator network?

fmohamm
従業員
4,567件の閲覧回数

Talked to the IT and they said that the machines are on different switch. How the ports status being `down` is related to being on different switch?   

 

fmohamm_0-1745513410041.png

 

James_Edwards
従業員
4,563件の閲覧回数

IT basically didn't answer the question, as the new machine could be on the switch that was used previously, or it could be on a different one. If the "new" system is connected to the switch correctly, IT should see the illuminated LED indicator light, showing that the connection is working. If they are on, the ports for the system should be up. If they aren't, something is wrong with the switch or the cabling.

 .

Whatever the case, if the two systems are on a different switch and those switches are not connected through a "spine" switch the boxes will not communicate with one another.

 

返信