<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re:Intel MPI connect problems in Intel® MPI Library</title>
    <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1346474#M9045</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;We assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Santosh&lt;/P&gt;&lt;BR /&gt;</description>
    <pubDate>Fri, 24 Dec 2021 10:18:54 GMT</pubDate>
    <dc:creator>SantoshY_Intel</dc:creator>
    <dc:date>2021-12-24T10:18:54Z</dc:date>
    <item>
      <title>Intel MPI connect problems</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1338791#M8962</link>
      <description>&lt;P&gt;I have a three node cluster connected by Myrinet infiniband switch. &amp;nbsp;I have a common NFS home directory for each node. The startup sequence for each node (bashrc, etc.) is thus identical. &amp;nbsp;Let's call the nodes, node1, node2, and node3. I have disabled the firewalls on each server (which are running the latest version of CentoOS 7).&lt;/P&gt;
&lt;P&gt;Each node runs mpi on itself without problem, e.g.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;mpirun -n 32 -host localhost hostname&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Now is the confusing part. Running the following works fine:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;node1&amp;gt; mpirun -n 32 -host node2 hostname
node1&amp;gt; mpirun -n 32 -host node3 hostname&lt;/LI-CODE&gt;
&lt;P&gt;I can even specify multiple nodes:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;node1&amp;gt; mpirun -n 96 -hosts localhost,node2,node3 hostname&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;This runs fine and I can see the output of node1, node2, and node2&lt;/P&gt;
&lt;P&gt;Running the same commands from node2 works with node1, but hangs when attempted with node3, e.g.&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;node2&amp;gt; mpirun -n 32 -host node3 hostname&lt;/LI-CODE&gt;
&lt;P&gt;hangs. The opposite also does not work:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;node3&amp;gt; mpirun -n 32 -host node2 hostname&lt;/LI-CODE&gt;
&lt;P&gt;However, the following two combinations work fine:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;node2&amp;gt; mpirun -n 32 -host node1 hostname
node3&amp;gt; mpirun -n 32 -host node1 hostname&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I am unsure how to troubleshoot at this point. &amp;nbsp;Any suggestions would be gratefully received.&lt;/P&gt;
&lt;P&gt;The openapi version is the latest (as of today)&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Intel(R) MPI Library for Linux* OS, Version 2021.4 Build 20210831 (id: 758087adf)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Copyright 2003-2021, Intel Corporation.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 24 Nov 2021 04:39:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1338791#M8962</guid>
      <dc:creator>paul312</dc:creator>
      <dc:date>2021-11-24T04:39:00Z</dc:date>
    </item>
    <item>
      <title>Re: Intel MPI connect problems</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1338860#M8964</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks for reaching out to us.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The Intel MPI Library uses an SSH mechanism to access remote nodes. SSH requires a password and this may cause the MPI application to hang.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;So, could you please check whether you can do passwordless ssh from node2 to node3 &amp;amp; from node3 to node2?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If passwordless ssh fails, then you need to establish a passwordless SSH connection between node2 and node3 to ensure proper communication of MPI processes.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You can try to do:&lt;/P&gt;
&lt;P&gt;1. Check the SSH settings.&lt;/P&gt;
&lt;P&gt;2. Make sure that the passwordless authorization by public keys is enabled and configured.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If the issue still persists, then could you please provide the debug log for the below command on &lt;STRONG&gt;&lt;I&gt;node2&lt;/I&gt;&lt;/STRONG&gt;?&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -v -n 32 -host node3 hostname&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Also, please provide the debug log for either of the below two commands on &lt;STRONG&gt;&lt;I&gt;node2&lt;/I&gt;&lt;/STRONG&gt;.&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;clck -f nodefile -Fhealth_user //for user
clck -f nodefile -Fhealth_admin //for admin&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;
&lt;P&gt;Santosh&lt;/P&gt;</description>
      <pubDate>Wed, 24 Nov 2021 10:36:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1338860#M8964</guid>
      <dc:creator>SantoshY_Intel</dc:creator>
      <dc:date>2021-11-24T10:36:08Z</dc:date>
    </item>
    <item>
      <title>Re: Intel MPI connect problems</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1339140#M8975</link>
      <description>&lt;P&gt;I discovered that the vendor had made an error in the static IPv4 assignment so that on node2, in the /etc/hosts file the address of node3 actually pointed to node1. &amp;nbsp;Of course, I had checked that password-less ssh worked on both the 10 Gb ethernet and infiniband, but I didn't notice that when I logged into node3 from node2 that the prompt for node1&amp;gt; came up. &amp;nbsp;In short, all nodes connect via pasword-less ssh without error (and to the correct node!). Note that the 10Gb ethernet addresses indicated here are node1, node2, and node3. The infiniband addresses are node1-ib, node2-ib, and node3-ib.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The problem with node2 persists and if anything is stranger than before. The details are listed below, but in short node1 connects with all three nodes without problem. node3 connects with node1 and itself without problem. node2 connects to itself, but not to node1 or node3. Firewalls are dsiabled (for the moment). A systemctl status firewalld confirms this. Actually the fact that node1 connects to all three nodes confirms this as well.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Any idea of what to try next?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;1. From node1&lt;/P&gt;
&lt;P&gt;mpirun -n 96 -host localhost,node2-ib,node3-ib hostname&lt;/P&gt;
&lt;P&gt;&amp;gt; works on all three hosts without error&lt;/P&gt;
&lt;P&gt;2. from node1&lt;/P&gt;
&lt;P&gt;mpirun -n 32 -host node2 hostname&lt;/P&gt;
&lt;P&gt;&amp;gt; works without error&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;3. From node2&lt;/P&gt;
&lt;P&gt;mpirun -n 32 -host localhost hostname&lt;/P&gt;
&lt;P&gt;&amp;gt; works without problem&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;4. from node2&lt;/P&gt;
&lt;P&gt;mpirun -n 32 -host node3 hostname&lt;/P&gt;
&lt;P&gt;&amp;gt; hangs&lt;/P&gt;
&lt;P&gt;5. The command&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -v -n 32 -host node3 hostname&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&amp;gt; hangs and gives the output (in box) below. The "&lt;/SPAN&gt;&lt;SPAN&gt;clck -f nodefile -Fhealth_user" whether run as user or root returns the error message:&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Nodefile could not be accessed: nodefile&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;(base) paulfons@tau:~&amp;gt;I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -v -n 32 -host node3-ib  hostname
[mpiexec@tau] Launch arguments: /usr/bin/ssh -q -x node3-ib /opt/intel/oneapi/mpi/2021.4.0/bin//hydra_bstrap_proxy --upstream-host tau --upstream-port 35949 --pgid 0 --launcher ssh --launcher-number 0 --base-path /opt/intel/oneapi/mpi/2021.4.0/bin/ --tree-width 16 --tree-level 1 --time-left -1 --launch-type 2 --debug --proxy-id 0 --node-id 0 --subtree-size 1 /opt/intel/oneapi/mpi/2021.4.0/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9 
[mpiexec@tau] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on neutrino-ib (pid 20019, exit code 768)
[mpiexec@node2] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@node2] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@node2] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec@tau] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec@node2] Possible reasons:
[mpiexec@node2] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node2] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node2] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@ node2] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@node2]    You may try using -bootstrap option to select alternative launcher.
(base) paulfons@node2:~&amp;gt;clck -f nodefile -Fhealth_admin 
Intel(R) Cluster Checker 2021 Update 4 (build 20210910)

Nodefile could not be accessed: nodefile
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;mpirun -n 32 -host node1-ib hostname&lt;/P&gt;
&lt;P&gt;&amp;gt; hangs with the same error below:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;(base) me@node2:/data/temp&amp;gt;mpirun -n 32 -host node1  hostname
[mpiexec@node2] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on muon (pid 24420, exit code 768)
[mpiexec@node2] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@node2] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@node2] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec@node2] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec@node2] Possible reasons:
[mpiexec@node2] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node2] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node2] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node2] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@node2]    You may try using -bootstrap option to select alternative launcher.
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;mpirun -n 32 -host node3-ib hostname&lt;/P&gt;
&lt;P&gt;&amp;gt; hangs with the same error message below:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;base) me@node2:/data/temp&amp;gt;mpirun -n 32 -host node3 hostname
[mpiexec@node2] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on neutrino (pid 24383, exit code 768)
[mpiexec@node2] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@node2] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@node2] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec@tau] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec@node2] Possible reasons:
[mpiexec@node2] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node2] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node2] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node2] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@node2]    You may try using -bootstrap option to select alternative launcher.
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;6. from node3&lt;/P&gt;
&lt;P&gt;mpirun -n 32 -host node1 hostname&lt;/P&gt;
&lt;P&gt;&amp;gt; runs without error&lt;/P&gt;
&lt;P&gt;mpirun -n 32 -host node1-ib hostname&lt;/P&gt;
&lt;P&gt;&amp;gt; runs without error&lt;/P&gt;
&lt;P&gt;mpirun -n 32 -host node3 hostname&lt;/P&gt;
&lt;P&gt;&amp;gt; runs without error&lt;/P&gt;
&lt;P&gt;mpirun -n 32 -host node3-ib hostname&lt;/P&gt;
&lt;P&gt;&amp;gt; runs without error&lt;/P&gt;
&lt;P&gt;mpirun -n 32 -host node2-ib hostname&lt;/P&gt;
&lt;P&gt;&amp;gt; runs without error&lt;/P&gt;
&lt;P&gt;mpirun -n32 -host node2 hostname&lt;/P&gt;
&lt;P&gt;&amp;gt; runs without error&lt;/P&gt;
&lt;P&gt;7. From node2&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 25 Nov 2021 08:54:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1339140#M8975</guid>
      <dc:creator>paul312</dc:creator>
      <dc:date>2021-11-25T08:54:46Z</dc:date>
    </item>
    <item>
      <title>Re: Intel MPI connect problems</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1339158#M8976</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;"nodefile" is a file containing the list of available nodes in the cluster.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="background-color: var(--lwc-colorbackgroundinput,#ffffff); color: var(--lwc-colortextweak,#3e3e3c);"&gt;If the nodefile doesn't exist, then it will throw the below error:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="nodefile.png" style="width: 562px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/24324i7751DE2C95C19BB9/image-size/large?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="nodefile.png" alt="nodefile.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;So, create a nodefile with the list of available nodes.&lt;/P&gt;
&lt;P&gt;For example:&lt;/P&gt;
&lt;P&gt;$cat nodefile&lt;/P&gt;
&lt;P&gt;node1&lt;/P&gt;
&lt;P&gt;node2&lt;/P&gt;
&lt;P&gt;node3&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Now run the below command on node2:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;clck -f nodefile -Fhealth_user&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The above command will generate clck_results.log &amp;amp; clck_execution_warnings.log along with some analysis as seen in the attached image.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Please provide us the clck_results.log &amp;amp; clck_execution_warnings.log&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;
&lt;P&gt;Santosh&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 25 Nov 2021 11:10:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1339158#M8976</guid>
      <dc:creator>SantoshY_Intel</dc:creator>
      <dc:date>2021-11-25T11:10:45Z</dc:date>
    </item>
    <item>
      <title>Re: Intel MPI connect problems</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1339392#M8978</link>
      <description>&lt;P&gt;Dear Santosh,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;I attempted to try the clck process, but the process appears to hang without progress (four hours or more). I attached the content below for reference.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;(base) paulfons@tau:~&amp;gt;I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -v -n 32 -host neutrino  hostname
[mpiexec@tau] Launch arguments: /usr/bin/ssh -q -x neutrino /opt/intel/oneapi/mpi/2021.4.0/bin//hydra_bstrap_proxy --upstream-host tau --upstream-port 37470 --pgid 0 --launcher ssh --launcher-number 0 --base-path /opt/intel/oneapi/mpi/2021.4.0/bin/ --tree-width 16 --tree-level 1 --time-left -1 --launch-type 2 --debug --proxy-id 0 --node-id 0 --subtree-size 1 /opt/intel/oneapi/mpi/2021.4.0/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9 
[mpiexec@tau] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on neutrino (pid 43869, exit code 768)
[mpiexec@tau] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@tau] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@tau] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec@tau] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec@tau] Possible reasons:
[mpiexec@tau] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@tau] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@tau] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node2] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@node2]    You may try using -bootstrap option to select alternative launcher.
(base) paulfons@tau:~&amp;gt;clck -f nodefile -Fhealth_user
Intel(R) Cluster Checker 2021 Update 4 (build 20210910)

Running Collect

...
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In addition, I also immediately tried the referenced ssh command "/usr/bin/ssh -x -q node3" and it worked without error.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Do you have any further suggestions?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 26 Nov 2021 08:16:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1339392#M8978</guid>
      <dc:creator>paul312</dc:creator>
      <dc:date>2021-11-26T08:16:01Z</dc:date>
    </item>
    <item>
      <title>Re: Intel MPI connect problems</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1340885#M8997</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Please try the below steps on the &lt;STRONG&gt;node2&lt;/STRONG&gt; command prompt for adding the IP addresses of node1 and node3 to the list of known hosts on &lt;STRONG&gt;node2&lt;/STRONG&gt;:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;ssh &amp;lt;ip address of node1&amp;gt;
ssh &amp;lt;ip address of node3&amp;gt;&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The above commands might give you the below statement:&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;"Are you sure you want to continue connecting (yes/no/[fingerprint])?"&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;**&lt;/EM&gt;For the above statement provide your answer as "&lt;STRONG&gt;yes&lt;/STRONG&gt;".&lt;/P&gt;
&lt;P&gt;A warning will be displayed which confirms that the IP address is added to the list of known hosts as below:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;Warning: Permanently added &amp;lt;ip address of node 1&amp;gt;[&amp;lt;ip address of node3&amp;gt;] to the list of known hosts.&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;After successful ssh to node1/node3, you will be in node1's/node3's command prompt. So, do run the command "&lt;STRONG&gt;exit"&lt;/STRONG&gt; and return to node2 terminal.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;After adding both&lt;STRONG&gt; IP addresses&lt;/STRONG&gt; of node1 and node3 to the list of known hosts on node2, try the below command on &lt;STRONG&gt;node2&lt;/STRONG&gt;:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;mpirun -bootstrap ssh -n 6 -ppn 2 -hosts &amp;lt;IP address of node1&amp;gt;,&amp;lt;IP address of node 2&amp;gt;,&amp;lt;IP address of node3&amp;gt; hostname&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We tried the above steps and were able to run successfully as shown in the below screenshot.&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="node2.png" style="width: 999px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/24490iC46BB468F4C04E41/image-size/large?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="node2.png" alt="node2.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;So, could you please try the above steps and let us know whether it resolves your issue?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Also, we have observed that there is an inconsistency in the display of the hostname(tau &amp;amp; node2) in the debug log provided by you( as highlighted in yellow &amp;amp; green color in the below screenshot).&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="inconsistent.png" style="width: 999px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/24491i1F81E082ADC90867/image-size/large?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="inconsistent.png" alt="inconsistent.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;This dynamic change of hostname during the running of an MPI program at your end might also cause the problem. So, could you please let us know whether you have any idea why it is changing dynamically at your end?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;
&lt;P&gt;Santosh&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 02 Dec 2021 12:34:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1340885#M8997</guid>
      <dc:creator>SantoshY_Intel</dc:creator>
      <dc:date>2021-12-02T12:34:51Z</dc:date>
    </item>
    <item>
      <title>Re:Intel MPI connect problems</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1343072#M9013</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;We haven't heard back from you. Could you please provide an update on your issue? Please get back to us if the issue still persists.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Santosh&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 10 Dec 2021 08:19:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1343072#M9013</guid>
      <dc:creator>SantoshY_Intel</dc:creator>
      <dc:date>2021-12-10T08:19:29Z</dc:date>
    </item>
    <item>
      <title>Re: Re:Intel MPI connect problems</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1343077#M9014</link>
      <description>&lt;P&gt;Dear Santosh,&lt;/P&gt;
&lt;P&gt;&amp;nbsp; I am sorry to be so tardy in reply. &amp;nbsp;I did check that ssh works between the nodes. &amp;nbsp;The hostname difference was me trying to be clever. The real node name is tau, but I used an editor to change the name of the nodes to node1, node2, and node3 to make following the logins easier. &amp;nbsp;Below is what happens when I connect to node2 (tau) and try to connect to nodes node3 or node1. &amp;nbsp;Note the "-ib" is the infiniband network, the node without a suffix is the 10GB ethernet port. There are no problems logging in, but a simple "mpirun -n 32 -host node-ib hostname" hangs.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;(base) user@node2:/data/Vasp/Cu/relax&amp;gt;ssh node3-ib&lt;BR /&gt;Last login: Tue Nov 30 12:28:58 2021 from 172.17.69.249&lt;BR /&gt;(base) user@node3:~&amp;gt;exit&lt;BR /&gt;logout&lt;BR /&gt;Connection to node3-ib closed.&lt;BR /&gt;(base) user@node2:/data/Vasp/Cu/relax&amp;gt;ssh node1-ib&lt;BR /&gt;Last login: Fri Dec 10 16:11:22 2021 from 172.17.69.249&lt;BR /&gt;(base) user@node1:~&amp;gt;exit&lt;BR /&gt;logout&lt;BR /&gt;Connection to node1-ib closed.&lt;BR /&gt;(base) user@node2:/data/Vasp/Cu/relax&amp;gt;ssh node3&lt;BR /&gt;Last login: Fri Dec 10 17:32:02 2021 from 192.168.1.3&lt;BR /&gt;(base) user@node3:~&amp;gt;exit&lt;BR /&gt;logout&lt;BR /&gt;Connection to node3 closed.&lt;BR /&gt;(base) user@node2:/data/Vasp/Cu/relax&amp;gt;ssh node1&lt;BR /&gt;Last login: Fri Dec 10 17:30:26 2021 from 192.168.1.3&lt;BR /&gt;(base) user@node1:~&amp;gt;exit&lt;BR /&gt;logout&lt;BR /&gt;Connection to node1 closed.&lt;BR /&gt;(base) user@node2:/data/Vasp/Cu/relax&amp;gt;ssh node3&lt;BR /&gt;Last login: Fri Dec 10 17:32:27 2021 from 172.17.69.3&lt;BR /&gt;ssh (base) user@node3:~&amp;gt;ssh node2-ib&lt;BR /&gt;Last login: Fri Dec 10 16:11:38 2021 from 172.17.69.249&lt;BR /&gt;(base) user@node2:~&amp;gt;exit&lt;BR /&gt;logout&lt;BR /&gt;Connection to node2-ib closed.&lt;BR /&gt;(base) user@node3:~&amp;gt;ssh node2&lt;BR /&gt;Last login: Fri Dec 10 17:31:08 2021 from 192.168.1.4&lt;BR /&gt;(base) user@node2:~&amp;gt;exit&lt;BR /&gt;logout&lt;BR /&gt;Connection to node2 closed.&lt;BR /&gt;(base) user@node3:~&amp;gt;exit&lt;BR /&gt;logout&lt;BR /&gt;Connection to node3 closed.&lt;/P&gt;</description>
      <pubDate>Fri, 10 Dec 2021 08:37:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1343077#M9014</guid>
      <dc:creator>paul312</dc:creator>
      <dc:date>2021-12-10T08:37:17Z</dc:date>
    </item>
    <item>
      <title>Re: Intel MPI connect problems</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1343103#M9015</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&amp;gt;&amp;gt;"There are no problems logging in, but a simple "mpirun -n 32 -host node-ib hostname" hangs."&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Could you please try using the "&lt;STRONG&gt;IP address of node-ib&lt;/STRONG&gt;" instead of&amp;nbsp; "node-ib"?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;FI_PROVIDER=mlx mpirun -n 32 -host IPaddress-of-node hostname&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For finding the IP address, use the below command on node-ib:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;ifconfig&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Please let us know whether &lt;EM&gt;"mpirun -n 32 -host IPaddress-of-node hostname"&lt;/EM&gt; works or still hangs?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;
&lt;P&gt;Santosh&lt;/P&gt;</description>
      <pubDate>Fri, 24 Dec 2021 10:22:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1343103#M9015</guid>
      <dc:creator>SantoshY_Intel</dc:creator>
      <dc:date>2021-12-24T10:22:25Z</dc:date>
    </item>
    <item>
      <title>Re:Intel MPI connect problems</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1344923#M9030</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;We haven't heard back from you. Could you please provide an update on your issue? Please get back to us if the issue still persists.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Santosh&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 17 Dec 2021 12:19:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1344923#M9030</guid>
      <dc:creator>SantoshY_Intel</dc:creator>
      <dc:date>2021-12-17T12:19:46Z</dc:date>
    </item>
    <item>
      <title>Re:Intel MPI connect problems</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1346474#M9045</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;We assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Santosh&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 24 Dec 2021 10:18:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1346474#M9045</guid>
      <dc:creator>SantoshY_Intel</dc:creator>
      <dc:date>2021-12-24T10:18:54Z</dc:date>
    </item>
    <item>
      <title>Re: Re:Intel MPI connect problems</title>
      <link>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1346593#M9048</link>
      <description>&lt;P&gt;Hi Santosh,&lt;/P&gt;
&lt;P&gt;&amp;nbsp; I am sorry for the delay in replying. &amp;nbsp;In effect, I have been working around the problem as I can use mpirun to run jobs on any combination of all three nodes from node1 and node3. The problem seems that I cannot submit jobs from node2 (except to node2 itself). The mpirun command hangs from node2 when directed to run a job on node1 or node3 regardless if IP addresses (infiniband or the 10Gb ethernet interfaces are used). &amp;nbsp;I am at a loss as to what to try next as obviously the infiniband and mpi are running fine when initiated from node1 or node3 which implies that node2 is connected properly to the network. I also installed the latest oneapi version to make sure that the software was the same on all three does (it ostensibly was before but a version from earlier this year). &amp;nbsp;Any ideas as to how to debug next with this new info.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Best wishes,&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Paul&lt;/P&gt;</description>
      <pubDate>Sat, 25 Dec 2021 06:15:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-MPI-Library/Intel-MPI-connect-problems/m-p/1346593#M9048</guid>
      <dc:creator>paul312</dc:creator>
      <dc:date>2021-12-25T06:15:38Z</dc:date>
    </item>
  </channel>
</rss>

