Software Archive
Read-only legacy content
17061 Discussions

Cannot start MPI job from host system [me too, but SOLVED]

Anders_H_
Beginner
798 Views

Hi, 

I wasn't sure whether or not to add a new post since I seem to have exactly the same problem as what's explained here: https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/599473 , but I didn't solve the problem.

I am running RHEL6.5, OFED 3.18.1 with 4 Xeon Phi cards in a external static bridge. The IP addresses are:
Host: 192.168.1.16
mic0: 192.168.1.17
mic1: 192.168.1.18
mic2: 192.168.1.19
mic3: 192.168.1.20

and the mic's can connect with each other and the host with ssh keys. The host has no firewall running. openibd and ofed-mic are running. I have tried Intel cluster 2015 update 3, 2015 update 5 and 2016 update 1 - all with the same problem. When I try to run anything from the host on the coprocessor:

[on host]:
export I_MPI_MIC=1
export I_MPI_DYNAMIC_CONNECTION=1
source /opt/intel/impi_latest/intel64/bin/mpivars.sh
mpiexec.hydra -n 1 -hosts mic0 $I_MPI_ROOT/mic/bin/IMB-MPI1

the result is:
[proxy:0:0@kif-mic0] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "kif-mic0" to "127.0.0.1" (Connection refused)
[proxy:0:0@kif-mic0] main (../../pm/pmiserv/pmip.c:415): unable to connect to server 127.0.0.1 at port 35690 (check for firewalls!)

Attached is the verbose output from mpiexec.hydra (log.txt). It seems like several people has this issue, but I haven't found a solution that worked for me yet. What do I do here? 

0 Kudos
1 Reply
Anders_H_
Beginner
798 Views

Update: 

It works if I specify -iface br0 (see ifconfig.txt for current ifconfig). So the command

mpirun -n 1 -iface br0 -hosts mic0 $I_MPI_ROOT/mic/bin/IMB-MPI1

works as expected. It also worked running on all cards. 

TL;DR: I needed to specify -iface br0 in the mpirun command. 

Edit: Since I use virtual infiniband, I switched to -iface virbr0.

0 Kudos
Reply