Is it possible to us Intel MPI 5.1.0.038 with TCP over ethernet?

Rashawn_K_Intel1 · ‎07-20-2015

I have just installed Intel 5.1.0.038 into a shared NFS space meant for use by my team who share four machines connected by Ethernet. I have a machines.LINIX file listing the hostname of each machine that I saved into the NFS space. My intention is for us to use the library over the Ethernet connection, but I am not sure that I have configured it correctly to do this. I am not the administrator of the machines and so I do not have explicit root access. However, I am in the sudoer group. During the installation, I was not able to install using the "Run using sudo privileges and password for system wide access for all users" because permission to the is denied to root. I think this happens because root does not have access to the NFS space (even though I am running the installation from this space). I have only been able to install the library as myself, "as current user to limit access to user level." Are there any suggestions that I should try to work around this so that I am able to enable access for my team to an NFS installed MPI library?

Thank you,

Rashawn

Steve_H_Intel1 · ‎07-20-2015

Rashawn:

For your installation path of Intel MPI Library ("as current user to limit access to user level."), and depending on the respective Shell environment, can the team members successfully issue either the command:

.../impi/5.1.0.038/intel64/bin/mpivars.sh intel64

or,

.../impi/5.1.0.038/intel64/bin/mpivars.csh intel64

If so, you and the team members should be able to successfully run MPI applications on your cluster.

-Steve

Rashawn_K_Intel1 · ‎07-21-2015

In reply to Steve, all members of the team are able to invoke the MPI environment with the mpivars.* scripts:

source <pathToIMPIbuild>/MPI/intel-5.1.0.038/bin/compilervars.csh -arch intel64 -platform linux

We are all able to run a very simple mpirun command intended to run the sample application, test.c, on all of the four nodes using tcp:

[rlknapp@fxcoral001 rashawn]$ mpirun -n 4 -perhost 1 ./testc-intelMPI                                 
Hello world: rank 0 of 4 running on fxcoral001
Hello world: rank 1 of 4 running on fxcoral001
Hello world: rank 2 of 4 running on fxcoral001
Hello world: rank 3 of 4 running on fxcoral001

I have a machines.Linux file which lists each hostname of the four nodes:

fxcoral001
fxcoral002
fxcoral003
fxcoral004

When I point the mpirun statement to the machine file like this:

mpirun -n 4 -perhost 1 -hostfile <pathTo>/machines.LINUX ./testc-intelMPI

I get this set of errors:

[proxy:0:1@fxcoral002] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "fxcoral002" to "fxcoral001.fx.intel.com" (No route to host)
[proxy:0:1@fxcoral002] main (../../pm/pmiserv/pmip.c:414): unable to connect to server fxcoral001.fx.intel.com at port 51788 (check for firewalls!)
[proxy:0:2@fxcoral003] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "fxcoral003" to "fxcoral001.fx.intel.com" (No route to host)
[proxy:0:2@fxcoral003] main (../../pm/pmiserv/pmip.c:414): unable to connect to server fxcoral001.fx.intel.com at port 51788 (check for firewalls!)
[proxy:0:3@fxcoral004] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "fxcoral004" to "fxcoral001.fx.intel.com" (No route to host)

[proxy:0:3@fxcoral004] main (../../pm/pmiserv/pmip.c:414): unable to connect to server fxcoral001.fx.intel.com at port 51788 (check for firewalls!)

I have to abort this when this happens.

I get the same output when I call:

mpiexec.hydra -f <pathTo>/machines.LINUX -n 4 -ppn 1 ./testc-intelMPI

My goal is to run Intel MPI applications using tcp as the fabric. I played around with setting I_MPI_FABRICS_LIST to tcp and I also included I_MPI_FABRICS in the command line:

mpirun -n 4 -perhost 1 -env I_MPI_FABRICS tcp -hostfile <pathTo>/machines.LINUX ./testc-intelMPI

The result was the same as the previous error messages shown.

What should I do to enable running Intel MPI over tcp?

Thank you,

Rashawn

Artem_R_Intel1 · ‎07-22-2015

Hello Rashawn,
It seems it's unable to connect from the hosts specified in your hostfile (machines.LINUX with short hostnames) to the launching node (which is detected automatically as FQDN).
Could you please check that the following network paths work via ssh:
fxcoral002 <-> fxcoral001.fx.intel.com
fxcoral003 <-> fxcoral001.fx.intel.com
fxcoral004 <-> fxcoral001.fx.intel.com

Rashawn_K_Intel1 · ‎07-22-2015

Artem,

Thank you for reminding me to check this basic thing; there was an issue with one of the nodes where I had to delete the offending key from my ~/.ssh/known_hosts file. I thought for sure this would clear up the problem, but it has not.

Here is the output from testing the SSH logins from fxcoral001.fx.intel.com to the other three machines and the other direction:

[rlknapp@fxcoral002 ~]$ ssh -2 -Y fxcoral001.fx.intel.com
Last login: Wed Jul 22 14:26:55 2015 from fxtcarilab030.fx.intel.com
[rlknapp@fxcoral001 ~]$ 
[rlknapp@fxcoral001 ~]$ ssh -2 -Y fxcoral002
Last login: Wed Jul 22 14:32:31 2015 from fxtcarilab030.fx.intel.com
[rlknapp@fxcoral002 ~]$ 

[rlknapp@fxcoral003 ~]$ ssh -2 -Y fxcoral001.fx.intel.com
Last login: Wed Jul 22 14:33:25 2015 from fxcoral002.fx.intel.com
[rlknapp@fxcoral001 ~]$ ssh -2 -Y fxcoral003
Last login: Wed Jul 22 14:35:25 2015 from fxtcarilab030.fx.intel.com
[rlknapp@fxcoral003 ~]$

[rlknapp@fxcoral004 ~]$ ssh -2 -Y fxcoral001.fx.intel.com
Last login: Wed Jul 22 14:36:37 2015 from fxcoral003.fx.intel.com
[rlknapp@fxcoral001 ~]$ ssh -2 -Y fxcoral004
Last login: Wed Jul 22 14:38:56 2015 from fxtcarilab030.fx.intel.com
[rlknapp@fxcoral004 ~]$

When I execute the application with mpirun, I get the errors reported previously:

[rlknapp@fxcoral001 rashawn]$ mpirun -n 4 -perhost 1 -env I_MPI_FABRICS tcp -hostfile <pathTo>/machines.LINUX ./testc-intelMPI
[proxy:0:2@fxcoral003] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "fxcoral003" to "fxcoral001.fx.intel.com" (No route to host)
[proxy:0:2@fxcoral003] main (../../pm/pmiserv/pmip.c:414): unable to connect to server fxcoral001.fx.intel.com at port 52783 (check for firewalls!)
[proxy:0:1@fxcoral002] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "fxcoral002" to "fxcoral001.fx.intel.com" (No route to host)
[proxy:0:1@fxcoral002] main (../../pm/pmiserv/pmip.c:414): unable to connect to server fxcoral001.fx.intel.com at port 52783 (check for firewalls!)
[proxy:0:3@fxcoral004] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "fxcoral004" to "fxcoral001.fx.intel.com" (No route to host)
[proxy:0:3@fxcoral004] main (../../pm/pmiserv/pmip.c:414): unable to connect to server fxcoral001.fx.intel.com at port 52783 (check for firewalls!)
^C[mpiexec@fxcoral001] Sending Ctrl-C to processes as requested
[mpiexec@fxcoral001] Press Ctrl-C again to force abort
[mpiexec@fxcoral001] HYDU_sock_write (../../utils/sock/sock.c:417): write error (Bad file descriptor)
[mpiexec@fxcoral001] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:244): unable to write data to proxy
[mpiexec@fxcoral001] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:175): unable to send signal downstream
[mpiexec@fxcoral001] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@fxcoral001] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@fxcoral001] main (../../ui/mpich/mpiexec.c:1119): process manager error waiting for completion

I am not sure what I am doing wrong at this point, but I believe it is possible to run Intel MPI using tcp as the fabric. Are there other options I should provide to mpirun or is this a possible system issue (e.g., a configuration file that overrides the ssh settings)?

Thank you,

Rashawn

Rashawn_K_Intel1 · ‎07-22-2015

Update:

Firewalls were enabled on the nodes; a system administrator in FX, where the machine are, disabled them. This solved the issue:

[rlknapp@fxcoral001 rashawn]$ mpirun -n 4 -perhost 1 -env I_MPI_FABRICS tcp -hostfile machines.LINUX ./testc-intelMPI
Hello world: rank 0 of 4 running on fxcoral001
Hello world: rank 1 of 4 running on fxcoral002
Hello world: rank 2 of 4 running on fxcoral003
Hello world: rank 3 of 4 running on fxcoral004

Thank you all for your assistance,

Rashawn