Software Archive
Read-only legacy content
17061 Discussions

Cannot start MPI job from host system

Michael_R_10
Beginner
4,051 Views

Hi All,

I am trying to run an application from the host machine on a coprocessor. When I execute the test command

mpirun -n 1 -host mic0 hostname

I get the following error message

[proxy:0:0@machine-mic0.domain] HYDU_sock_connect (./utils/sock/sock.c:241): unable to connect from "machine-mic0.domain" to "127.0.0.1" (Connection refused)
[proxy:0:0@machine-mic0.domain] main (./pm/pmiserv/pmip.c:353): unable to connect to server 127.0.0.1 at port 42661 (check for firewalls!)
 

When I ssh into the coprocessor and run the same command, I get the expected output.

I have checked that environmnet variable I_MPI_MIC is set to 'enable', I have disabled the host firewall, and since /opt/intel is not available over NFS, I have copied the necessary libraries to the coprocessor.  I'm not sure where to proceed from here.

Best Regards,

Michael

0 Kudos
7 Replies
Frances_R_Intel
Employee
4,051 Views

The 127.0.0.1 address is the loopback address. I am not sure why MPI is trying to use that address. Could you check your /etc/host file on the coprocessor to make sure the host and coprocessor are in there with the right addresses and the names MPI is using? 

0 Kudos
Michael_R_10
Beginner
4,051 Views

The /etc/hosts file on the coprocessor reads

127.0.0.1       localhost.localdomain localhost
::1             localhost.localdomain localhost
172.31.1.254    host machine.domain
172.31.1.1     machine-mic0.domain mic0
172.31.2.1     machine-mic1.domain mic1

I can ssh into the mic and ssh back into the host.

0 Kudos
Artem_R_Intel1
Employee
4,051 Views

Hi Michael,

It looks like the command fails due to specific network settings.

Could you please provide output of the following commands (from the host side):
hostname -i
I_MPI_MIC=1 mpirun -v -n 1 -host mic0 hostname
mpirun -V
cat /etc/hosts

Could you please also try the following command:
I_MPI_MIC=1 mpirun -localhost 172.31.1.254 -n 1 -host mic0 hostname

 

0 Kudos
Michael_R_10
Beginner
4,051 Views

Artem R. (Intel) wrote:

Hi Michael,

It looks like the command fails due to specific network settings.

Could you please provide output of the following commands (from the host side):
hostname -i
I_MPI_MIC=1 mpirun -v -n 1 -host mic0 hostname
mpirun -V
cat /etc/hosts

Could you please also try the following command:
I_MPI_MIC=1 mpirun -localhost 172.31.1.254 -n 1 -host mic0 hostname

 

 

Hi Artem,
Thanks for your reply. I ran the commands on the host and list the results below.

hostname -i
127.0.0.1

I_MPI_MIC=1 mpirun -v -n 1 -host mic0 hostname
I've attached the file mpirun_output.txt with the output

mpirun -V
Intel(R) MPI Library for Linux* OS, Version 4.1.0 Build 20120831
Copyright (C) 2003-2012, Intel Corporation. All rights reserved

cat /etc/hosts
#127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
127.0.0.1   localhost machine.domain machine
#::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
172.31.1.1      mic0.local mic0
172.31.2.1      mic1.local mic1
172.31.1.1      machine-mic0.domain mic0 #Generated-by-micctrl
172.31.2.1      machine-mic1.domain mic1 #Generated-by-micctrl

I_MPI_MIC=1 mpirun -localhost 172.31.1.254 -n 1 -host mic0 hostname
The mpirun command didn't recognize the -localhost option and displayed the help screen

 

Thank you again,
Michael

0 Kudos
Artem_R_Intel1
Employee
4,051 Views

Hi Michael,

Is it possible for you to try the latest Intel MPI Library versions (5.1.x)? '-localhost' option is implemented there and should help with this issue.

Otherwise you should correct your network settings. 'hostname -i' on the host should report IP address '172.31.1.254'.

0 Kudos
Michael_R_10
Beginner
4,051 Views

Hi Artem,

I added the appropriate IP addresses to the /etc/hosts file, which enabled the mpirun commands to run hostname sucessfully. Thank you!

To continue testing, I followed the instructions here and compiled the montecarlo.c program. I can run it successfully on the host machine or on mic0 from the host, but if I try to use both the host and mic0 to process the program with

mpirun -n 1 -host machine /tmp/montecarlo : -n 1 -host mic0 /tmp/montecarlo

, I receive a list of error messages:
machine-mic0.domain:SCM:2fa8:afe08700: 245 us(245 us):  open_hca: ibv_get_device_list() failed
machine-mic0.domain:SCM:2fa8:afe08700: 201 us(201 us):  open_hca: ibv_get_device_list() failed
machine-mic0.domain:CMA:2fa8:afe08700: 621 us(621 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib0 configured?
machine-mic0.domain:CMA:2fa8:afe08700: 559 us(559 us):  open_hca: getaddr_netdev ERROR:No such device. Is ib1 configured?
... (full log attached, ran with -v)

I am not sure if this is relevant, but the network between the host and mic are a static pair. Any advice for solving this will also be welcome.

Thanks in advance,
Michael

0 Kudos
Artem_R_Intel1
Employee
4,051 Views

Hi Michael,

It looks like ofed-mic service isn't running - please check. See the Intel® Manycore Platform Software Stack (Intel® MPSS) User's Guide for details.

You can find the similar example in the Intel® MPI Library Troubleshooting Guide.

0 Kudos
Reply