Intel® Gaudi® AI Accelerator
Support for the Intel® Gaudi® AI Accelerator
13 Discussions

Does multi-node works?

fmohamm
Employee
478 Views

I am working on making the multinode run on Gaudi2 (v 1.20.0) by following the readme file here: https://github.com/HabanaAI/Megatron-LM/tree/main.  I am working with containers.  Even after multiple trials, I am not able to get the multi node code running on two gaudi nodes that are avaialable to me.

Is the readme file updated for latest  

 

0 Kudos
6 Replies
James_Edwards
Employee
427 Views

The documentation should be up to date. However, to debug your issue I will need more information regarding what you are trying to accomplish and the errors you are seeing. Please provide information on the example you are trying to run, how you are executing the example to run multi-node and the errors you are receiving.

0 Kudos
fmohamm
Employee
360 Views

I have two bare-metal gaudi machine, and I am trying the script   from https://github.com/HabanaAI/Megatron-LM/blob/1.20.0/examples/llama/README.md#setup  and https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html#driver-installation. 

Both the machines are connected from same jump server. During the setup, I am trying to check the accelerator interface status using the following commands. I see that the status on one machine is "up" while the status on other machine is "down". 

 

/opt/habanalabs/qual/[gaudi3,gaudi2,gaudi1]/bin/manage_network_ifs.sh --up

 /opt/habanalabs/qual/[gaudi3,gaudi2,gaudi1]/bin/manage_network_ifs.sh --status 

 

Is there any steps that are missing on README? 

 

Thanks 

 

0 Kudos
James_Edwards
Employee
349 Views

The customer has given me the following output on the "bad" node:

.

/opt/habanalabs/qual/gaudi2/bin/manage_network_ifs.sh --status
accel0
3 ports down (8, 22, 23)
accel1
3 ports down (8, 22, 23)
accel2
3 ports down (8, 22, 23)
accel3
3 ports down (8, 22, 23)
accel4
3 ports down (8, 22, 23)
accel5
3 ports down (8, 22, 23)
accel6
3 ports down (8, 22, 23)
accel7
3 ports down (8, 22, 23)

.

I requested that he contact the lab admin to make sure the Gaudi platform has been wired to the accelerator network correctly.

0 Kudos
James_Edwards
Employee
215 Views

Is there any status on this issue? Has the problem been resolved?

0 Kudos
fmohamm
Employee
194 Views

We built another machine from scratch. This new machine also has the same issue.  Exactly same error that I was getting in the previous machine.

0 Kudos
James_Edwards
Employee
185 Views

Is the machine you built from scratch linked into the same switch used on the accelerator network?

0 Kudos
Reply