Intel® Gaudi® AI Accelerator
Support for the Intel® Gaudi® AI Accelerator

Does multi-node works?

fmohamm
직원
5,048 조회수

I am working on making the multinode run on Gaudi2 (v 1.20.0) by following the readme file here: https://github.com/HabanaAI/Megatron-LM/tree/main.  I am working with containers.  Even after multiple trials, I am not able to get the multi node code running on two gaudi nodes that are avaialable to me.

Is the readme file updated for latest  

 

0 포인트
8 응답
James_Edwards
직원
4,997 조회수

The documentation should be up to date. However, to debug your issue I will need more information regarding what you are trying to accomplish and the errors you are seeing. Please provide information on the example you are trying to run, how you are executing the example to run multi-node and the errors you are receiving.

0 포인트
fmohamm
직원
4,930 조회수

I have two bare-metal gaudi machine, and I am trying the script   from https://github.com/HabanaAI/Megatron-LM/blob/1.20.0/examples/llama/README.md#setup  and https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html#driver-installation. 

Both the machines are connected from same jump server. During the setup, I am trying to check the accelerator interface status using the following commands. I see that the status on one machine is "up" while the status on other machine is "down". 

 

/opt/habanalabs/qual/[gaudi3,gaudi2,gaudi1]/bin/manage_network_ifs.sh --up

 /opt/habanalabs/qual/[gaudi3,gaudi2,gaudi1]/bin/manage_network_ifs.sh --status 

 

Is there any steps that are missing on README? 

 

Thanks 

 

0 포인트
James_Edwards
직원
4,919 조회수

The customer has given me the following output on the "bad" node:

.

/opt/habanalabs/qual/gaudi2/bin/manage_network_ifs.sh --status
accel0
3 ports down (8, 22, 23)
accel1
3 ports down (8, 22, 23)
accel2
3 ports down (8, 22, 23)
accel3
3 ports down (8, 22, 23)
accel4
3 ports down (8, 22, 23)
accel5
3 ports down (8, 22, 23)
accel6
3 ports down (8, 22, 23)
accel7
3 ports down (8, 22, 23)

.

I requested that he contact the lab admin to make sure the Gaudi platform has been wired to the accelerator network correctly.

0 포인트
James_Edwards
직원
4,785 조회수

Is there any status on this issue? Has the problem been resolved?

0 포인트
fmohamm
직원
4,764 조회수

We built another machine from scratch. This new machine also has the same issue.  Exactly same error that I was getting in the previous machine.

0 포인트
James_Edwards
직원
4,755 조회수

Is the machine you built from scratch linked into the same switch used on the accelerator network?

0 포인트
fmohamm
직원
4,560 조회수

Talked to the IT and they said that the machines are on different switch. How the ports status being `down` is related to being on different switch?   

 

fmohamm_0-1745513410041.png

 

0 포인트
James_Edwards
직원
4,556 조회수

IT basically didn't answer the question, as the new machine could be on the switch that was used previously, or it could be on a different one. If the "new" system is connected to the switch correctly, IT should see the illuminated LED indicator light, showing that the connection is working. If they are on, the ports for the system should be up. If they aren't, something is wrong with the switch or the cabling.

 .

Whatever the case, if the two systems are on a different switch and those switches are not connected through a "spine" switch the boxes will not communicate with one another.

 

0 포인트
응답