- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have just installed Intel 5.1.0.038 into a shared NFS space meant for use by my team who share four machines connected by Ethernet. I have a machines.LINIX file listing the hostname of each machine that I saved into the NFS space. My intention is for us to use the library over the Ethernet connection, but I am not sure that I have configured it correctly to do this. I am not the administrator of the machines and so I do not have explicit root access. However, I am in the sudoer group. During the installation, I was not able to install using the "Run using sudo privileges and password for system wide access for all users" because permission to the is denied to root. I think this happens because root does not have access to the NFS space (even though I am running the installation from this space). I have only been able to install the library as myself, "as current user to limit access to user level." Are there any suggestions that I should try to work around this so that I am able to enable access for my team to an NFS installed MPI library?
Thank you,
Rashawn
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Rashawn:
For your installation path of Intel MPI Library ("as current user to limit access to user level."), and depending on the respective Shell environment, can the team members successfully issue either the command:
.../impi/5.1.0.038/intel64/bin/mpivars.sh intel64
or,
.../impi/5.1.0.038/intel64/bin/mpivars.csh intel64
If so, you and the team members should be able to successfully run MPI applications on your cluster.
-Steve
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In reply to Steve, all members of the team are able to invoke the MPI environment with the mpivars.* scripts:
source <pathToIMPIbuild>/MPI/intel-5.1.0.038/bin/compilervars.csh -arch intel64 -platform linux
We are all able to run a very simple mpirun command intended to run the sample application, test.c, on all of the four nodes using tcp:
[rlknapp@fxcoral001 rashawn]$ mpirun -n 4 -perhost 1 ./testc-intelMPI Hello world: rank 0 of 4 running on fxcoral001 Hello world: rank 1 of 4 running on fxcoral001 Hello world: rank 2 of 4 running on fxcoral001 Hello world: rank 3 of 4 running on fxcoral001
I have a machines.Linux file which lists each hostname of the four nodes:
fxcoral001 fxcoral002 fxcoral003 fxcoral004
When I point the mpirun statement to the machine file like this:
mpirun -n 4 -perhost 1 -hostfile <pathTo>/machines.LINUX ./testc-intelMPI
I get this set of errors:
[proxy:0:1@fxcoral002] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "fxcoral002" to "fxcoral001.fx.intel.com" (No route to host) [proxy:0:1@fxcoral002] main (../../pm/pmiserv/pmip.c:414): unable to connect to server fxcoral001.fx.intel.com at port 51788 (check for firewalls!) [proxy:0:2@fxcoral003] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "fxcoral003" to "fxcoral001.fx.intel.com" (No route to host) [proxy:0:2@fxcoral003] main (../../pm/pmiserv/pmip.c:414): unable to connect to server fxcoral001.fx.intel.com at port 51788 (check for firewalls!) [proxy:0:3@fxcoral004] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "fxcoral004" to "fxcoral001.fx.intel.com" (No route to host) [proxy:0:3@fxcoral004] main (../../pm/pmiserv/pmip.c:414): unable to connect to server fxcoral001.fx.intel.com at port 51788 (check for firewalls!)
I have to abort this when this happens.
I get the same output when I call:
mpiexec.hydra -f <pathTo>/machines.LINUX -n 4 -ppn 1 ./testc-intelMPI
My goal is to run Intel MPI applications using tcp as the fabric. I played around with setting I_MPI_FABRICS_LIST to tcp and I also included I_MPI_FABRICS in the command line:
mpirun -n 4 -perhost 1 -env I_MPI_FABRICS tcp -hostfile <pathTo>/machines.LINUX ./testc-intelMPI
The result was the same as the previous error messages shown.
What should I do to enable running Intel MPI over tcp?
Thank you,
Rashawn
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Rashawn,
It seems it's unable to connect from the hosts specified in your hostfile (machines.LINUX with short hostnames) to the launching node (which is detected automatically as FQDN).
Could you please check that the following network paths work via ssh:
fxcoral002 <-> fxcoral001.fx.intel.com
fxcoral003 <-> fxcoral001.fx.intel.com
fxcoral004 <-> fxcoral001.fx.intel.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Artem,
Thank you for reminding me to check this basic thing; there was an issue with one of the nodes where I had to delete the offending key from my ~/.ssh/known_hosts file. I thought for sure this would clear up the problem, but it has not.
Here is the output from testing the SSH logins from fxcoral001.fx.intel.com to the other three machines and the other direction:
[rlknapp@fxcoral002 ~]$ ssh -2 -Y fxcoral001.fx.intel.com Last login: Wed Jul 22 14:26:55 2015 from fxtcarilab030.fx.intel.com [rlknapp@fxcoral001 ~]$ [rlknapp@fxcoral001 ~]$ ssh -2 -Y fxcoral002 Last login: Wed Jul 22 14:32:31 2015 from fxtcarilab030.fx.intel.com [rlknapp@fxcoral002 ~]$ [rlknapp@fxcoral003 ~]$ ssh -2 -Y fxcoral001.fx.intel.com Last login: Wed Jul 22 14:33:25 2015 from fxcoral002.fx.intel.com [rlknapp@fxcoral001 ~]$ ssh -2 -Y fxcoral003 Last login: Wed Jul 22 14:35:25 2015 from fxtcarilab030.fx.intel.com [rlknapp@fxcoral003 ~]$ [rlknapp@fxcoral004 ~]$ ssh -2 -Y fxcoral001.fx.intel.com Last login: Wed Jul 22 14:36:37 2015 from fxcoral003.fx.intel.com [rlknapp@fxcoral001 ~]$ ssh -2 -Y fxcoral004 Last login: Wed Jul 22 14:38:56 2015 from fxtcarilab030.fx.intel.com [rlknapp@fxcoral004 ~]$
When I execute the application with mpirun, I get the errors reported previously:
[rlknapp@fxcoral001 rashawn]$ mpirun -n 4 -perhost 1 -env I_MPI_FABRICS tcp -hostfile <pathTo>/machines.LINUX ./testc-intelMPI [proxy:0:2@fxcoral003] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "fxcoral003" to "fxcoral001.fx.intel.com" (No route to host) [proxy:0:2@fxcoral003] main (../../pm/pmiserv/pmip.c:414): unable to connect to server fxcoral001.fx.intel.com at port 52783 (check for firewalls!) [proxy:0:1@fxcoral002] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "fxcoral002" to "fxcoral001.fx.intel.com" (No route to host) [proxy:0:1@fxcoral002] main (../../pm/pmiserv/pmip.c:414): unable to connect to server fxcoral001.fx.intel.com at port 52783 (check for firewalls!) [proxy:0:3@fxcoral004] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "fxcoral004" to "fxcoral001.fx.intel.com" (No route to host) [proxy:0:3@fxcoral004] main (../../pm/pmiserv/pmip.c:414): unable to connect to server fxcoral001.fx.intel.com at port 52783 (check for firewalls!) ^C[mpiexec@fxcoral001] Sending Ctrl-C to processes as requested [mpiexec@fxcoral001] Press Ctrl-C again to force abort [mpiexec@fxcoral001] HYDU_sock_write (../../utils/sock/sock.c:417): write error (Bad file descriptor) [mpiexec@fxcoral001] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:244): unable to write data to proxy [mpiexec@fxcoral001] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:175): unable to send signal downstream [mpiexec@fxcoral001] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status [mpiexec@fxcoral001] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event [mpiexec@fxcoral001] main (../../ui/mpich/mpiexec.c:1119): process manager error waiting for completion
I am not sure what I am doing wrong at this point, but I believe it is possible to run Intel MPI using tcp as the fabric. Are there other options I should provide to mpirun or is this a possible system issue (e.g., a configuration file that overrides the ssh settings)?
Thank you,
Rashawn
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Update:
Firewalls were enabled on the nodes; a system administrator in FX, where the machine are, disabled them. This solved the issue:
[rlknapp@fxcoral001 rashawn]$ mpirun -n 4 -perhost 1 -env I_MPI_FABRICS tcp -hostfile machines.LINUX ./testc-intelMPI Hello world: rank 0 of 4 running on fxcoral001 Hello world: rank 1 of 4 running on fxcoral002 Hello world: rank 2 of 4 running on fxcoral003 Hello world: rank 3 of 4 running on fxcoral004
Thank you all for your assistance,
Rashawn

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page