<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Does multi-node  works? in Intel® Gaudi® AI Accelerator</title>
    <link>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1678408#M71</link>
    <description>&lt;P&gt;The documentation should be up to date. However, to debug your issue I will need more information regarding what you are trying to accomplish and the errors you are seeing. Please provide information on the example you are trying to run, how you are executing the example to run multi-node and the errors you are receiving.&lt;/P&gt;</description>
    <pubDate>Thu, 27 Mar 2025 15:41:14 GMT</pubDate>
    <dc:creator>James_Edwards</dc:creator>
    <dc:date>2025-03-27T15:41:14Z</dc:date>
    <item>
      <title>Does multi-node  works?</title>
      <link>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1678073#M67</link>
      <description>&lt;P&gt;&lt;SPAN&gt;I am working on making the multinode run on Gaudi2 (v 1.20.0) by following the readme file here: &lt;A class="" title="https://github.com/habanaai/megatron-lm/tree/main" href="https://github.com/HabanaAI/Megatron-LM/tree/main" target="_blank" rel="noreferrer noopener"&gt;https://github.com/HabanaAI/Megatron-LM/tree/main&lt;/A&gt;.&amp;nbsp; I am working with containers.&amp;nbsp; Even after multiple trials, I am not able to get the multi node code running on two gaudi nodes that are avaialable to me. &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Is the readme file updated for latest&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 26 Mar 2025 18:19:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1678073#M67</guid>
      <dc:creator>fmohamm</dc:creator>
      <dc:date>2025-03-26T18:19:55Z</dc:date>
    </item>
    <item>
      <title>Re: Does multi-node  works?</title>
      <link>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1678408#M71</link>
      <description>&lt;P&gt;The documentation should be up to date. However, to debug your issue I will need more information regarding what you are trying to accomplish and the errors you are seeing. Please provide information on the example you are trying to run, how you are executing the example to run multi-node and the errors you are receiving.&lt;/P&gt;</description>
      <pubDate>Thu, 27 Mar 2025 15:41:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1678408#M71</guid>
      <dc:creator>James_Edwards</dc:creator>
      <dc:date>2025-03-27T15:41:14Z</dc:date>
    </item>
    <item>
      <title>Re: Does multi-node  works?</title>
      <link>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1679613#M79</link>
      <description>&lt;P&gt;I have two bare-metal gaudi machine, and I am trying the script&amp;nbsp; &amp;nbsp;from&amp;nbsp;&lt;A href="https://github.com/HabanaAI/Megatron-LM/blob/1.20.0/examples/llama/README.md#setup" target="_blank" rel="noopener"&gt;https://github.com/HabanaAI/Megatron-LM/blob/1.20.0/examples/llama/README.md#setup&lt;/A&gt;&amp;nbsp; and&amp;nbsp;&lt;A href="https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html#driver-installation" target="_blank" rel="noopener"&gt;https://docs.habana.ai/en/latest/Installation_Guide/Driver_Installation.html#driver-installation.&amp;nbsp; &lt;/A&gt;&lt;/P&gt;&lt;P&gt;Both the machines are connected from same jump server. During the setup, I am trying to check the accelerator interface status using the following commands. I see that the status on one machine is "up" while the status on other machine is "down".&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;/&lt;/SPAN&gt;&lt;SPAN class=""&gt;opt&lt;/SPAN&gt;&lt;SPAN class=""&gt;/&lt;/SPAN&gt;&lt;SPAN class=""&gt;habanalabs&lt;/SPAN&gt;&lt;SPAN class=""&gt;/&lt;/SPAN&gt;&lt;SPAN class=""&gt;qual&lt;/SPAN&gt;&lt;SPAN class=""&gt;/&lt;/SPAN&gt;&lt;SPAN class=""&gt;[&lt;/SPAN&gt;&lt;SPAN class=""&gt;gaudi3&lt;/SPAN&gt;&lt;SPAN class=""&gt;,&lt;/SPAN&gt;&lt;SPAN class=""&gt;gaudi2&lt;/SPAN&gt;&lt;SPAN class=""&gt;,&lt;/SPAN&gt;&lt;SPAN class=""&gt;gaudi1&lt;/SPAN&gt;&lt;SPAN class=""&gt;]&lt;/SPAN&gt;&lt;SPAN class=""&gt;/&lt;/SPAN&gt;&lt;SPAN class=""&gt;bin&lt;/SPAN&gt;&lt;SPAN class=""&gt;/&lt;/SPAN&gt;&lt;SPAN class=""&gt;manage_network_ifs&lt;/SPAN&gt;&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;SPAN class=""&gt;sh&lt;/SPAN&gt; &lt;SPAN class=""&gt;--&lt;/SPAN&gt;&lt;SPAN class=""&gt;up&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;/opt/habanalabs/qual/[gaudi3,gaudi2,gaudi1]/bin/manage_network_ifs.sh --status&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there any steps that are missing on README?&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 01 Apr 2025 17:56:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1679613#M79</guid>
      <dc:creator>fmohamm</dc:creator>
      <dc:date>2025-04-01T17:56:00Z</dc:date>
    </item>
    <item>
      <title>Re: Does multi-node  works?</title>
      <link>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1679638#M80</link>
      <description>&lt;P&gt;The customer has given me the following output on the "bad" node:&lt;/P&gt;&lt;P&gt;.&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;/opt/habanalabs/qual/gaudi2/bin/manage_network_ifs.sh --status&lt;BR /&gt;accel0&lt;BR /&gt;3 ports down (8, 22, 23)&lt;BR /&gt;accel1&lt;BR /&gt;3 ports down (8, 22, 23)&lt;BR /&gt;accel2&lt;BR /&gt;3 ports down (8, 22, 23)&lt;BR /&gt;accel3&lt;BR /&gt;3 ports down (8, 22, 23)&lt;BR /&gt;accel4&lt;BR /&gt;3 ports down (8, 22, 23)&lt;BR /&gt;accel5&lt;BR /&gt;3 ports down (8, 22, 23)&lt;BR /&gt;accel6&lt;BR /&gt;3 ports down (8, 22, 23)&lt;BR /&gt;accel7&lt;BR /&gt;3 ports down (8, 22, 23)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I requested that he contact the lab admin to make sure the Gaudi platform has been wired to the accelerator network correctly.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 01 Apr 2025 19:38:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1679638#M80</guid>
      <dc:creator>James_Edwards</dc:creator>
      <dc:date>2025-04-01T19:38:03Z</dc:date>
    </item>
    <item>
      <title>Re: Does multi-node  works?</title>
      <link>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1682804#M83</link>
      <description>&lt;P&gt;Is there any status on this issue? Has the problem been resolved?&lt;/P&gt;</description>
      <pubDate>Mon, 14 Apr 2025 15:36:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1682804#M83</guid>
      <dc:creator>James_Edwards</dc:creator>
      <dc:date>2025-04-14T15:36:09Z</dc:date>
    </item>
    <item>
      <title>Re: Does multi-node  works?</title>
      <link>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1682922#M85</link>
      <description>&lt;P&gt;We built another machine from scratch. This new machine also has the same issue.&amp;nbsp; Exactly same error that I was getting in the previous machine.&lt;/P&gt;</description>
      <pubDate>Tue, 15 Apr 2025 00:41:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1682922#M85</guid>
      <dc:creator>fmohamm</dc:creator>
      <dc:date>2025-04-15T00:41:07Z</dc:date>
    </item>
    <item>
      <title>Re: Does multi-node  works?</title>
      <link>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1682925#M86</link>
      <description>&lt;P&gt;Is the machine you built from scratch linked into the same switch used on the accelerator network?&lt;/P&gt;</description>
      <pubDate>Tue, 15 Apr 2025 00:52:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1682925#M86</guid>
      <dc:creator>James_Edwards</dc:creator>
      <dc:date>2025-04-15T00:52:14Z</dc:date>
    </item>
    <item>
      <title>Re: Does multi-node  works?</title>
      <link>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1685172#M87</link>
      <description>&lt;P&gt;Talked to the IT and they said that the machines are on different switch. How the ports status being `down` is related to being on different switch?&amp;nbsp; &amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="fmohamm_0-1745513410041.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/65122i2642001BF889713C/image-size/medium?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="fmohamm_0-1745513410041.png" alt="fmohamm_0-1745513410041.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 24 Apr 2025 16:52:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1685172#M87</guid>
      <dc:creator>fmohamm</dc:creator>
      <dc:date>2025-04-24T16:52:43Z</dc:date>
    </item>
    <item>
      <title>Re: Does multi-node  works?</title>
      <link>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1685175#M88</link>
      <description>&lt;P&gt;IT basically didn't answer the question, as the new machine could be on the switch that was used previously, or it could be on a different one. If the "new" system is connected to the switch correctly,&amp;nbsp;IT should see the illuminated LED indicator light, showing that the connection is working. If they are on, the ports for the system should be up. If they aren't, something is wrong with the switch or the cabling.&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Whatever the case, if the two systems are on a different switch and those switches are not connected through a "spine" switch the boxes will not communicate with one another.&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 24 Apr 2025 17:04:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Gaudi-AI-Accelerator/Does-multi-node-works/m-p/1685175#M88</guid>
      <dc:creator>James_Edwards</dc:creator>
      <dc:date>2025-04-24T17:04:15Z</dc:date>
    </item>
  </channel>
</rss>

