- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am working on making the multinode run on Gaudi2 (v 1.20.0) by following the readme file here: https://github.com/HabanaAI/Megatron-LM/tree/main. I am working with containers. Even after multiple trials, I am not able to get the multi node code running on two gaudi nodes that are avaialable to me.
Is the readme file updated for latest
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The documentation should be up to date. However, to debug your issue I will need more information regarding what you are trying to accomplish and the errors you are seeing. Please provide information on the example you are trying to run, how you are executing the example to run multi-node and the errors you are receiving.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have two bare-metal gaudi machine, and I am trying the script from https://github.com/HabanaAI/Megatron-LM/blob/1.20.0/examples/llama/README.md#setup and https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html#driver-installation.
Both the machines are connected from same jump server. During the setup, I am trying to check the accelerator interface status using the following commands. I see that the status on one machine is "up" while the status on other machine is "down".
/opt/habanalabs/qual/[gaudi3,gaudi2,gaudi1]/bin/manage_network_ifs.sh --up
/opt/habanalabs/qual/[gaudi3,gaudi2,gaudi1]/bin/manage_network_ifs.sh --status
Is there any steps that are missing on README?
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The customer has given me the following output on the "bad" node:
.
/opt/habanalabs/qual/gaudi2/bin/manage_network_ifs.sh --status
accel0
3 ports down (8, 22, 23)
accel1
3 ports down (8, 22, 23)
accel2
3 ports down (8, 22, 23)
accel3
3 ports down (8, 22, 23)
accel4
3 ports down (8, 22, 23)
accel5
3 ports down (8, 22, 23)
accel6
3 ports down (8, 22, 23)
accel7
3 ports down (8, 22, 23)
.
I requested that he contact the lab admin to make sure the Gaudi platform has been wired to the accelerator network correctly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is there any status on this issue? Has the problem been resolved?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We built another machine from scratch. This new machine also has the same issue. Exactly same error that I was getting in the previous machine.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is the machine you built from scratch linked into the same switch used on the accelerator network?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page