- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everybody!
I just installed ICTCE on my test machine (1 PC, 2 VMs as nodes). When I try to get an MPI ring up and running, this happens:
> mpdboot -n 3 -f mpd.hosts
LAUNCHED mpd on istanbul via
RUNNING: mpd on istanbul
LAUNCHED mpd on cnode1 via istanbul
Traceback (most recent call last):
File "", line 918, in
File "", line 669, in mpdboot
File "", line 758, in launch_one_mpd
File "/usr/lib/python2.6/subprocess.py", line 595, in __init__
errread, errwrite)
File "/usr/lib/python2.6/subprocess.py", line 1106, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
where mpd.hosts looks like this:
istanbul
cnode1
cnode2
mpdcheck -f mpd.hosts -v gives
obtaining hostname via gethostname and getfqdn
gethostname gives istanbul
getfqdn gives istanbul.site
checking out unqualified hostname; make sure is not "localhost", etc.
checking out qualified hostname; make sure is not "localhost", etc.
obtain IP addrs via qualified and unqualified hostnames; make sure other than 127.0.0.1
gethostbyname_ex: ('istanbul.site', ['istanbul'], ['192.168.220.105'])
gethostbyname_ex: ('istanbul.site', ['istanbul'], ['192.168.220.105'])
checking that IP addrs resolve to same host
now do some gethostbyaddr and gethostbyname_ex for machines in hosts file
checking gethostbyXXX for unqualified istanbul
gethostbyname_ex: ('istanbul.site', ['istanbul'], ['192.168.220.105'])
checking gethostbyXXX for qualified istanbul
gethostbyname_ex: ('istanbul.site', ['istanbul'], ['192.168.220.105'])
checking gethostbyXXX for unqualified cnode1
gethostbyname_ex: ('cnode1.site', ['cnode1'], ['192.168.220.118'])
checking gethostbyXXX for qualified cnode1
gethostbyname_ex: ('cnode1.site', ['cnode1'], ['192.168.220.118'])
checking gethostbyXXX for unqualified cnode2
gethostbyname_ex: ('cnode2.site', ['cnode2'], ['192.168.220.119'])
checking gethostbyXXX for qualified cnode2
gethostbyname_ex: ('cnode2.site', ['cnode2'], ['192.168.220.119'])
obtain IP addrs via localhost name; make sure that it equal to 127.0.0.1
gethostbyname_ex: ('localhost', ['ipv6-localhost', 'ipv6-loopback'], ['127.0.0.1'])
ssh cnode1 and so on works perfectly well. lamboot mpd.hosts also works, so I'm pretty sure that establishing connections to the other nodes is not the problem.
Any ideas?
Thanks in advance.
I just installed ICTCE on my test machine (1 PC, 2 VMs as nodes). When I try to get an MPI ring up and running, this happens:
> mpdboot -n 3 -f mpd.hosts
LAUNCHED mpd on istanbul via
RUNNING: mpd on istanbul
LAUNCHED mpd on cnode1 via istanbul
Traceback (most recent call last):
File "
File "
File "
File "/usr/lib/python2.6/subprocess.py", line 595, in __init__
errread, errwrite)
File "/usr/lib/python2.6/subprocess.py", line 1106, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
where mpd.hosts looks like this:
istanbul
cnode1
cnode2
mpdcheck -f mpd.hosts -v gives
obtaining hostname via gethostname and getfqdn
gethostname gives istanbul
getfqdn gives istanbul.site
checking out unqualified hostname; make sure is not "localhost", etc.
checking out qualified hostname; make sure is not "localhost", etc.
obtain IP addrs via qualified and unqualified hostnames; make sure other than 127.0.0.1
gethostbyname_ex: ('istanbul.site', ['istanbul'], ['192.168.220.105'])
gethostbyname_ex: ('istanbul.site', ['istanbul'], ['192.168.220.105'])
checking that IP addrs resolve to same host
now do some gethostbyaddr and gethostbyname_ex for machines in hosts file
checking gethostbyXXX for unqualified istanbul
gethostbyname_ex: ('istanbul.site', ['istanbul'], ['192.168.220.105'])
checking gethostbyXXX for qualified istanbul
gethostbyname_ex: ('istanbul.site', ['istanbul'], ['192.168.220.105'])
checking gethostbyXXX for unqualified cnode1
gethostbyname_ex: ('cnode1.site', ['cnode1'], ['192.168.220.118'])
checking gethostbyXXX for qualified cnode1
gethostbyname_ex: ('cnode1.site', ['cnode1'], ['192.168.220.118'])
checking gethostbyXXX for unqualified cnode2
gethostbyname_ex: ('cnode2.site', ['cnode2'], ['192.168.220.119'])
checking gethostbyXXX for qualified cnode2
gethostbyname_ex: ('cnode2.site', ['cnode2'], ['192.168.220.119'])
obtain IP addrs via localhost name; make sure that it equal to 127.0.0.1
gethostbyname_ex: ('localhost', ['ipv6-localhost', 'ipv6-loopback'], ['127.0.0.1'])
ssh cnode1 and so on works perfectly well. lamboot mpd.hosts also works, so I'm pretty sure that establishing connections to the other nodes is not the problem.
Any ideas?
Thanks in advance.
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - ictceeval
ssh cnode1 and so on works perfectly well. lamboot mpd.hosts also works, so I'm pretty sure that establishing connections to the other nodes is not the problem.
Hi ictceeval,
Thanks for posting. Since you're using ssh for remote shell access, you need to specify this on the mpdboot command line:
$ mpdboot -r ssh -n 3 -f mpd.hostsThe default for the Intel MPI Library is rsh.
Let us know how it goes.
Regards,
~Gergana
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - ictceeval
ssh cnode1 and so on works perfectly well. lamboot mpd.hosts also works, so I'm pretty sure that establishing connections to the other nodes is not the problem.
Hi ictceeval,
Thanks for posting. Since you're using ssh for remote shell access, you need to specify this on the mpdboot command line:
$ mpdboot -r ssh -n 3 -f mpd.hostsThe default for the Intel MPI Library is rsh.
Let us know how it goes.
Regards,
~Gergana
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Gergana Slavova (Intel)
Hi ictceeval,
Thanks for posting. Since you're using ssh for remote shell access, you need to specify this on the mpdboot command line:
$ mpdboot -r ssh -n 3 -f mpd.hostsThe default for the Intel MPI Library is rsh.
Let us know how it goes.
Regards,
~Gergana
Thanks for your quick help! I didn't know that. Unfortunately this seems to lead to another issue:
> mpdboot -n 3 -f mpd.hosts -r ssh -v
running mpdallexit on istanbul
LAUNCHED mpd on istanbul via
RUNNING: mpd on istanbul
LAUNCHED mpd on cnode1 via istanbul
LAUNCHED mpd on cnode2 via istanbul
mpdboot_istanbul (handle_mpd_output 828): Failed to establish a socket connection with cnode1:41650 : [Errno 111] Connection refused
mpdboot_istanbul (handle_mpd_output 845): failed to connect to mpd on cnode1
How do I interpret that output? It says "LAUNCHED mpd on cnode1" and then again "Failed to establish...."?!
UPDATE:
Somehow, things seem to get out of hand. Now, I'm getting this message:
> mpdboot -n 3 -f mpd.hosts -r ssh -v --chkup
checking cnode1
checking cnode2
there are 3 hosts up (counting local)
running mpdallexit on istanbul
LAUNCHED mpd on istanbul via
RUNNING: mpd on istanbul
LAUNCHED mpd on cnode1 via istanbul
LAUNCHED mpd on cnode2 via istanbul
mpdboot_istanbul (handle_mpd_output 837): failed to ping mpd on cnode1; received output={}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gergana!
I'm glad to report: problem solved. The other issue was that the slave nodes couldn't talk back to the master node due to missing entries in /etc/hosts and missing ssh-keys. Having fixed that, I am now able to set up and MPI ring.
Thank you very much for your help!
ictceeval
I'm glad to report: problem solved. The other issue was that the slave nodes couldn't talk back to the master node due to missing entries in /etc/hosts and missing ssh-keys. Having fixed that, I am now able to set up and MPI ring.
Thank you very much for your help!
ictceeval
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - ictceeval
I'm glad to report: problem solved. The other issue was that the slave nodes couldn't talk back to the master node due to missing entries in /etc/hosts and missing ssh-keys. Having fixed that, I am now able to set up and MPI ring.
Thanks for letting me know, ictceeval. I'm glad things are working for you now. Have fun using the Intel Cluster Tools!
Regards,
~Gergana
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page