- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am sometimes able to run parallel jobs, but very often they fail with errors - most often with:
mpdboot_cl1n052 (handle_mpd_output 575): failed to ping mpd on cl1n038; recvd output={}
but sometimes the error is:
mpdboot_cl1n003 (handle_mpd_output 583): failed to connect to mpd on cl1n040
The node names (cl1nNNN) in the error messages are not always the same, so I suspect it is something systemic.
The mpd commands I use are:
mpdallexit
mpdboot -n 64 -r ssh -f ${NODEFILE}
mpdtrace
mpiexec -np 64 ./a.out
mpdallexit
Can anyone give a suggestion? I should say that we have tcp and infiniband, but our infiniband is broken at the moment. Typically intel MPI doesn't mind that very much, and fails over to tcp. In case it helps, /etc/hosts is appended to this email.
Thanks,
Sean
127.0.0.1 localhost.localdomain localhost
10.11.12.7 files.tae.mysite.com files.mysite.com files
# special IPv6 addresses
::1 localhost ipv6-localhost ipv6-loopback
fe00::0 ipv6-localnet
ff00::0 ipv6-mcastprefix
ff02::1 ipv6-allnodes
ff02::2 ipv6-allrouters
ff02::3 ipv6-allhosts
#The following was added by scance. Do not remove:
10.0.1.1 cl1n001
10.0.1.10 cl1n010
10.0.1.11 cl1n011
10.0.1.12 cl1n012
10.0.1.13 cl1n013
10.0.1.14 cl1n014
10.0.1.15 cl1n015
10.0.1.16 cl1n016
10.0.1.17 cl1n017
10.0.1.18 cl1n018
10.0.1.19 cl1n019
10.0.1.2 cl1n002
10.0.1.20 cl1n020
10.0.1.21 cl1n021
10.0.1.22 cl1n022
10.0.1.23 cl1n023
10.0.1.24 cl1n024
10.0.1.25 cl1n025
10.0.1.26 cl1n026
10.0.1.27 cl1n027
10.0.1.28 cl1n028
10.0.1.29 cl1n029
10.0.1.3 cl1n003
10.0.1.30 cl1n030
10.0.1.31 cl1n031
10.0.1.32 cl1n032
10.0.1.33 cl1n033
10.0.1.34 cl1n034
10.0.1.35 cl1n035
10.0.1.36 cl1n036
10.0.1.37 cl1n037
10.0.1.38 cl1n038
10.0.1.39 cl1n039
10.0.1.4 cl1n004
10.0.1.40 cl1n040
10.0.1.41 cl1n041
10.0.1.42 cl1n042
10.0.1.43 cl1n043
10.0.1.44 cl1n044
10.0.1.45 cl1n045
10.0.1.46 cl1n046
10.0.1.47 cl1n047
10.0.1.48 cl1n048
10.0.1.49 cl1n049
10.0.1.5 cl1n005
10.0.1.50 cl1n050
10.0.1.51 cl1n051
10.0.1.52 cl1n052
10.0.1.53 cl1n053
10.0.1.54 cl1n054
10.0.1.55 cl1n055
10.0.1.56 cl1n056
10.0.1.57 cl1n057
10.0.1.58 cl1n058
10.0.1.59 cl1n059
10.0.1.6 cl1n006
10.0.1.60 cl1n060
10.0.1.61 cl1n061
10.0.1.62 cl1n062
10.0.1.63 cl1n063
10.0.1.64 cl1n064
10.0.1.7 cl1n007
10.0.1.8 cl1n008
10.0.1.9 cl1n009
10.0.10.1 taz3.americas.sgi.com taz3
10.0.40.1 cl1n001-bmc
10.0.40.10 cl1n010-bmc
10.0.40.11 cl1n011-bmc
10.0.40.12 cl1n012-bmc
10.0.40.13 cl1n013-bmc
10.0.40.14 cl1n014-bmc
10.0.40.15 cl1n015-bmc
10.0.40.16 cl1n016-bmc
10.0.40.17 cl1n017-bmc
10.0.40.18 cl1n018-bmc
10.0.40.19 cl1n019-bmc
10.0.40.2 cl1n002-bmc
10.0.40.20 cl1n020-bmc
10.0.40.21 cl1n021-bmc
10.0.40.22 cl1n022-bmc
10.0.40.23 cl1n023-bmc
10.0.40.24 cl1n024-bmc
10.0.40.25 cl1n025-bmc
10.0.40.26 cl1n026-bmc
10.0.40.27 cl1n027-bmc
10.0.40.28 cl1n028-bmc
10.0.40.29 cl1n029-bmc
10.0.40.3 cl1n003-bmc
10.0.40.30 cl1n030-bmc
10.0.40.31 cl1n031-bmc
10.0.40.32 cl1n032-bmc
10.0.40.33 cl1n033-bmc
10.0.40.34 cl1n034-bmc
10.0.40.35 cl1n035-bmc
10.0.40.36 cl1n036-bmc
10.0.40.37 cl1n037-bmc
10.0.40.38 cl1n038-bmc
10.0.40.39 cl1n039-bmc
10.0.40.4 cl1n004-bmc
10.0.40.40 cl1n040-bmc
10.0.40.41 cl1n041-bmc
10.0.40.42 cl1n042-bmc
10.0.40.43 cl1n043-bmc
10.0.40.44 cl1n044-bmc
10.0.40.45 cl1n045-bmc
10.0.40.46 cl1n046-bmc
10.0.40.47 cl1n047-bmc
10.0.40.48 cl1n048-bmc
10.0.40.49 cl1n049-bmc
10.0.40.5 cl1n005-bmc
10.0.40.50 cl1n050-bmc
10.0.40.51 cl1n051-bmc
10.0.40.52 cl1n052-bmc
10.0.40.53 cl1n053-bmc
10.0.40.54 cl1n054-bmc
10.0.40.55 cl1n055-bmc
10.0.40.56 cl1n056-bmc
10.0.40.57 cl1n057-bmc
10.0.40.58 cl1n058-bmc
10.0.40.59 cl1n059-bmc
10.0.40.6 cl1n006-bmc
10.0.40.60 cl1n060-bmc
10.0.40.61 cl1n061-bmc
10.0.40.62 cl1n062-bmc
10.0.40.63 cl1n063-bmc
10.0.40.64 cl1n064-bmc
10.0.40.7 cl1n007-bmc
10.0.40.8 cl1n008-bmc
10.0.40.9 cl1n009-bmc
10.11.12.9 taz.mysite.com taz
192.168.10.1 linux.site linux
#End scance-section
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK, so how should I specify tcp/ip? I tried this:
export I_MPI_DEVICE=rdssm:sock
It failed to ping again as before. Is my syntax wrong?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK, so how should I specify tcp/ip? I tried this:
export I_MPI_DEVICE=rdssm:sock
It failed to ping again as before. Is my syntax wrong?
I_MPI_DEVICE=ssm
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sean,
To specify using TCP/IP, you need to set I_MPI_DEVICE=ssm. This will run over sockets across nodes, and using the shm device within a node.
Additionally, the error you provide could be due to a failed connection with the node, inability to start the mpd daemon on the remote node, etc. Can you verify that you're using the latest version of Intel MPI Library 3.2 Update 1? You can do so by running "mpiexec -V".
Also, make sure no leftover mpd python processes exist on the nodes. You can do so by running "ps aux | grep mpd". Go ahead and kill any left over mpd.py procs you find.
Regards,
~Gergana
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, make sure no leftover mpd python processes exist on the nodes. You can do so by running "ps aux | grep mpd". Go ahead and kill any left over mpd.py procs you find.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe it is related selinux/firewall. You can stop those services and try again.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page