Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

problem to start mpd ring

tamuer
Beginner
987 Views
this is the information i got:

yukai@hc-abs:/home_sas/yukai => mpdboot -d -v -r ssh -f mpd.hosts -n 7
debug: starting
running mpdallexit on hc-abs
LAUNCHED mpd on hc-abs via
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.2.006/bin64/mpd.py --ncpus=1 --myhost=hc-abs -e -d -s 7
debug: mpd on hc-abs on port 40529
RUNNING: mpd on hc-abs
debug: info for running mpd: {'ip': '', 'ncpus': 1, 'list_port': 40529, 'entry_port': '', 'host': 'hc-abs', 'entry_host': '', 'ifhn': ''}
LAUNCHED mpd on n10 via hc-abs
debug: launch cmd= ssh -x -n n10 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.2.006/bin64/mpd.py -h hc-abs -p 40529 --ifhn=192.168.0.160 --ncpus=1 --myhost=n10 --myip=192.168.0.160 -e -d -s 7
LAUNCHED mpd on n11 via hc-abs
debug: launch cmd= ssh -x -n n11 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.2.006/bin64/mpd.py -h hc-abs -p 40529 --ifhn=192.168.0.161 --ncpus=1 --myhost=n11 --myip=192.168.0.161 -e -d -s 7
LAUNCHED mpd on n12 via hc-abs
debug: launch cmd= ssh -x -n n12 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.2.006/bin64/mpd.py -h hc-abs -p 40529 --ifhn=192.168.0.162 --ncpus=1 --myhost=n12 --myip=192.168.0.162 -e -d -s 7
LAUNCHED mpd on n13 via hc-abs
debug: launch cmd= ssh -x -n n13 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.2.006/bin64/mpd.py -h hc-abs -p 40529 --ifhn=192.168.0.163 --ncpus=1 --myhost=n13 --myip=192.168.0.163 -e -d -s 7
debug: mpd on n10 on port 32896
mpdboot_hc-abs (handle_mpd_output 886): failed to ping mpd on n10; received output={}


i am sure ssh work perfectly (passwordless).
mpd.hosts:
n10
n11
n12
n13
n14
n15
n16

mpirun works fine for each node.

yukai@hc-abs:/home_sas/yukai => cpuinfo
Intel Xeon Processor (Intel64 Dunnington)
===== Processor composition =====
Processors(CPUs) : 16
Packages(sockets) : 4
Cores per package : 4
Threads per core : 1
===== Processor identification =====
Processor Thread Id. Core Id. Package Id.
0 0 0 0
1 0 0 1
2 0 0 2
3 0 0 3
4 0 2 0
5 0 2 1
6 0 2 2
7 0 2 3
8 0 1 0
9 0 1 1
10 0 1 2
11 0 1 3
12 0 3 0
13 0 3 1
14 0 3 2
15 0 3 3
===== Placement on packages =====
Package Id. Core Id. Processors
0 0,2,1,3 0,4,8,12
1 0,2,1,3 1,5,9,13
2 0,2,1,3 2,6,10,14
3 0,2,1,3 3,7,11,15
===== Cache sharing =====
Cache Size Processors
L1 32 KB no sharing
L2 3 MB (0,8)(1,9)(2,10)(3,11)(4,12)(5,13)(6,14)(7,15)
L3 8 MB (0,4,8,12)(1,5,9,13)(2,6,10,14)(3,7,11,15)


/etc/hosts looks fine.

Any help and suggestion will be greatly appreciated!

0 Kudos
14 Replies
tamuer
Beginner
987 Views
problem solved!

Now I have a question about mpd.hosts.

The ring doesn't work if I don't put the head node in the first line.

Now the question is how I can avoid this cause I don't want to use the head node (leave it for the other system programs).

Eithe starting ring without the head node or submitting jobs to specified nodes in the ring?

0 Kudos
Dmitry_K_Intel2
Employee
986 Views
Hi,

Have you tried '-nolocal' option?

Regards!
Dmitry
0 Kudos
compbio
Beginner
987 Views
Hi tamuer,

Could you please tell me how to resolve that. I'm having the same problem.

Thanks,
Tuan
0 Kudos
Dmitry_K_Intel2
Employee
987 Views
Hi Tuan,

Could you clarify what problem you have.
What library version do you use.
Could you post commands and error message here and I'll try to help you.

Regards!
Dmitry
0 Kudos
tamuer
Beginner
987 Views
hi Tuan, I just did what Dmitry told me. He is a wonderful expert.
0 Kudos
Dmitry_K_Intel2
Employee
987 Views
Thanks, Tamuer.
0 Kudos
Daniel_Redig
Beginner
987 Views
What was the fix? I've just begun experiencing the problem on a cluster that was working previously. Thanks.
0 Kudos
Dmitry_K_Intel2
Employee
987 Views
Hi Daniel,

Could you clarify what the problem is? What version of the Intel MPI Library do you use?
Usually there are some log files in /tmp directory. Try 'ls /tmp | grep mpd'

Please provide as much information as possible and I'll try to help you.

Regards!
Dmitry
0 Kudos
Daniel_Redig
Beginner
987 Views
Hi,
I'm using ICT 3.2.2. Everything works fine on all nodes except 2. The install is on a shared filesystem. The logfile is empty. If I run w/ -d I get:
[root@test1 ~]# mpdboot -n 2 -r ssh -f machines -d
debug: starting
running mpdallexit on test1
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.2.006/bin64/mpd.py --ncpus=1 --myhost=test1 -e -d -s 2
debug: mpd on test1 on port 37556
debug: info for running mpd: {'ip': '10.11.178.192', 'ncpus': 1, 'list_port': 37556, 'entry_port': '', 'host': 'test1', 'entry_host': '', 'ifhn': ''}
debug: launch cmd= ssh -x -n test2 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME /opt/intel/impi/3.2.2.006/bin64/mpd.py -h test1 -p 37556 --ifhn=10.11.179.27 --ncpus=1 --myhost=test2 --myip=10.11.179.27 -e -d -s 2
debug: mpd on test2 on port 50042
mpdboot_test1 (handle_mpd_output 886): failed to ping mpd on test2; received output={}
[root@test1 ~]# mpdboot -n 2 -r ssh -f machines -ddebug: startingrunning mpdallexit on test1debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.2.006/bin64/mpd.py --ncpus=1 --myhost=test1 -e -d -s 2debug: mpd on test1 on port 37556debug: info for running mpd: {'ip': '10.11.178.192', 'ncpus': 1, 'list_port': 37556, 'entry_port': '', 'host': 'test1', 'entry_host': '', 'ifhn': ''}debug: launch cmd= ssh -x -n test2 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME /opt/intel/impi/3.2.2.006/bin64/mpd.py -h test1 -p 37556 --ifhn=10.11.179.27 --ncpus=1 --myhost=test2 --myip=10.11.179.27 -e -d -s 2debug: mpd on test2 on port 50042mpdboot_test1 (handle_mpd_output 886): failed to ping mpd on test2; received output={}
0 Kudos
Dmitry_K_Intel2
Employee
987 Views
Daniel,

Could you check that you can log-in without entering password (or passphrase) from test1 to test2 and vice versa.
[root@test1 ~] ssh test1

Passwordless ssh connection is one of requirements.

Regards!
Dmitry
0 Kudos
Daniel_Redig
Beginner
987 Views
passwordless ssh is working properly.
0 Kudos
Dmitry_K_Intel2
Employee
987 Views
Daniel,

Looks like there are some limitations on the network ports. Do you use firewall? Or might be some ssh ports are restricted? Could you please check with your system administrator?

Regards!
Dmitry
0 Kudos
Daniel_Redig
Beginner
987 Views
Hi Dmitry,
No firewall rules are defined, selinux is disabled. I can use ibping to ping between the machines and get replies. Still cannot create a ring. MPDs can start locally. Passworddless ssh works perfectly. Authentication is from the same NIS server as all the other nodes in the cluster that do work. It is an odd problem, IMHO!! Any more suggestions? Thanks,
Dan
0 Kudos
Dmitry_K_Intel2
Employee
987 Views
Hi Daniel,

Let's compare ssh versions! I'm using:
[root@cluster1002 ~]$ ssh -V
ssh: Reflection for Secure IT 6.1.2.1 (build 3005) on x86_64-redhat-linux-gnu (64-bit)

Could you check for mpd processes on both nodes?
[root@cluster1002 ~] ps ux
[root@claster1002 ~] ssh test2 -x ps ux
If there is an mpd process please kill it.

[root@claster1002 ~] echo test1 > mpd.hosts
[root@claster1002 ~] echo test2 >> mpd.hosts
[root@claster1002 ~] mpdboot -r ssh -n 2 -d
Check the ring:
[root@claster1002 ~] mpdtrace
If there is no ring, let's try to create it by hand:

[root@test1 ~] env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.2.006/bin64/mpd.py --ncpus=1 --myhost=test1 -e -d -s 2
You'll get port number (port_number) which will be used in the next command

[root@test1 ~] ssh -x -n test2 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME /opt/intel/impi/3.2.2.006/bin64/mpd.py -h test1 -p port_number --ifhn=10.11.179.27 --ncpus=1 --myhost=test2 --myip=10.11.179.27 -e -d -s 2

If ssh works correctly new mpd ring will be created.
[root@test1 ~] mpdtrace
test1
test2

If it doesn't work it means that you have some issues with configuration. If it work send me the output - probably your ssh outputs information in another format.

Regards!
Dmitry
0 Kudos
Reply