Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

problem to start mpd ring

tamuer
初学者
2,736 次查看
this is the information i got:

yukai@hc-abs:/home_sas/yukai => mpdboot -d -v -r ssh -f mpd.hosts -n 7
debug: starting
running mpdallexit on hc-abs
LAUNCHED mpd on hc-abs via
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.2.006/bin64/mpd.py --ncpus=1 --myhost=hc-abs -e -d -s 7
debug: mpd on hc-abs on port 40529
RUNNING: mpd on hc-abs
debug: info for running mpd: {'ip': '', 'ncpus': 1, 'list_port': 40529, 'entry_port': '', 'host': 'hc-abs', 'entry_host': '', 'ifhn': ''}
LAUNCHED mpd on n10 via hc-abs
debug: launch cmd= ssh -x -n n10 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.2.006/bin64/mpd.py -h hc-abs -p 40529 --ifhn=192.168.0.160 --ncpus=1 --myhost=n10 --myip=192.168.0.160 -e -d -s 7
LAUNCHED mpd on n11 via hc-abs
debug: launch cmd= ssh -x -n n11 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.2.006/bin64/mpd.py -h hc-abs -p 40529 --ifhn=192.168.0.161 --ncpus=1 --myhost=n11 --myip=192.168.0.161 -e -d -s 7
LAUNCHED mpd on n12 via hc-abs
debug: launch cmd= ssh -x -n n12 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.2.006/bin64/mpd.py -h hc-abs -p 40529 --ifhn=192.168.0.162 --ncpus=1 --myhost=n12 --myip=192.168.0.162 -e -d -s 7
LAUNCHED mpd on n13 via hc-abs
debug: launch cmd= ssh -x -n n13 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.2.006/bin64/mpd.py -h hc-abs -p 40529 --ifhn=192.168.0.163 --ncpus=1 --myhost=n13 --myip=192.168.0.163 -e -d -s 7
debug: mpd on n10 on port 32896
mpdboot_hc-abs (handle_mpd_output 886): failed to ping mpd on n10; received output={}


i am sure ssh work perfectly (passwordless).
mpd.hosts:
n10
n11
n12
n13
n14
n15
n16

mpirun works fine for each node.

yukai@hc-abs:/home_sas/yukai => cpuinfo
Intel Xeon Processor (Intel64 Dunnington)
===== Processor composition =====
Processors(CPUs) : 16
Packages(sockets) : 4
Cores per package : 4
Threads per core : 1
===== Processor identification =====
Processor Thread Id. Core Id. Package Id.
0 0 0 0
1 0 0 1
2 0 0 2
3 0 0 3
4 0 2 0
5 0 2 1
6 0 2 2
7 0 2 3
8 0 1 0
9 0 1 1
10 0 1 2
11 0 1 3
12 0 3 0
13 0 3 1
14 0 3 2
15 0 3 3
===== Placement on packages =====
Package Id. Core Id. Processors
0 0,2,1,3 0,4,8,12
1 0,2,1,3 1,5,9,13
2 0,2,1,3 2,6,10,14
3 0,2,1,3 3,7,11,15
===== Cache sharing =====
Cache Size Processors
L1 32 KB no sharing
L2 3 MB (0,8)(1,9)(2,10)(3,11)(4,12)(5,13)(6,14)(7,15)
L3 8 MB (0,4,8,12)(1,5,9,13)(2,6,10,14)(3,7,11,15)


/etc/hosts looks fine.

Any help and suggestion will be greatly appreciated!

0 项奖励
14 回复数
tamuer
初学者
2,736 次查看
problem solved!

Now I have a question about mpd.hosts.

The ring doesn't work if I don't put the head node in the first line.

Now the question is how I can avoid this cause I don't want to use the head node (leave it for the other system programs).

Eithe starting ring without the head node or submitting jobs to specified nodes in the ring?

0 项奖励
Dmitry_K_Intel2
2,735 次查看
Hi,

Have you tried '-nolocal' option?

Regards!
Dmitry
0 项奖励
compbio
初学者
2,736 次查看
Hi tamuer,

Could you please tell me how to resolve that. I'm having the same problem.

Thanks,
Tuan
0 项奖励
Dmitry_K_Intel2
2,736 次查看
Hi Tuan,

Could you clarify what problem you have.
What library version do you use.
Could you post commands and error message here and I'll try to help you.

Regards!
Dmitry
0 项奖励
tamuer
初学者
2,736 次查看
hi Tuan, I just did what Dmitry told me. He is a wonderful expert.
0 项奖励
Dmitry_K_Intel2
2,736 次查看
Thanks, Tamuer.
0 项奖励
Daniel_Redig
初学者
2,736 次查看
What was the fix? I've just begun experiencing the problem on a cluster that was working previously. Thanks.
0 项奖励
Dmitry_K_Intel2
2,736 次查看
Hi Daniel,

Could you clarify what the problem is? What version of the Intel MPI Library do you use?
Usually there are some log files in /tmp directory. Try 'ls /tmp | grep mpd'

Please provide as much information as possible and I'll try to help you.

Regards!
Dmitry
0 项奖励
Daniel_Redig
初学者
2,736 次查看
Hi,
I'm using ICT 3.2.2. Everything works fine on all nodes except 2. The install is on a shared filesystem. The logfile is empty. If I run w/ -d I get:
[root@test1 ~]# mpdboot -n 2 -r ssh -f machines -d
debug: starting
running mpdallexit on test1
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.2.006/bin64/mpd.py --ncpus=1 --myhost=test1 -e -d -s 2
debug: mpd on test1 on port 37556
debug: info for running mpd: {'ip': '10.11.178.192', 'ncpus': 1, 'list_port': 37556, 'entry_port': '', 'host': 'test1', 'entry_host': '', 'ifhn': ''}
debug: launch cmd= ssh -x -n test2 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME /opt/intel/impi/3.2.2.006/bin64/mpd.py -h test1 -p 37556 --ifhn=10.11.179.27 --ncpus=1 --myhost=test2 --myip=10.11.179.27 -e -d -s 2
debug: mpd on test2 on port 50042
mpdboot_test1 (handle_mpd_output 886): failed to ping mpd on test2; received output={}
[root@test1 ~]# mpdboot -n 2 -r ssh -f machines -ddebug: startingrunning mpdallexit on test1debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.2.006/bin64/mpd.py --ncpus=1 --myhost=test1 -e -d -s 2debug: mpd on test1 on port 37556debug: info for running mpd: {'ip': '10.11.178.192', 'ncpus': 1, 'list_port': 37556, 'entry_port': '', 'host': 'test1', 'entry_host': '', 'ifhn': ''}debug: launch cmd= ssh -x -n test2 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME /opt/intel/impi/3.2.2.006/bin64/mpd.py -h test1 -p 37556 --ifhn=10.11.179.27 --ncpus=1 --myhost=test2 --myip=10.11.179.27 -e -d -s 2debug: mpd on test2 on port 50042mpdboot_test1 (handle_mpd_output 886): failed to ping mpd on test2; received output={}
0 项奖励
Dmitry_K_Intel2
2,736 次查看
Daniel,

Could you check that you can log-in without entering password (or passphrase) from test1 to test2 and vice versa.
[root@test1 ~] ssh test1

Passwordless ssh connection is one of requirements.

Regards!
Dmitry
0 项奖励
Daniel_Redig
初学者
2,736 次查看
passwordless ssh is working properly.
0 项奖励
Dmitry_K_Intel2
2,736 次查看
Daniel,

Looks like there are some limitations on the network ports. Do you use firewall? Or might be some ssh ports are restricted? Could you please check with your system administrator?

Regards!
Dmitry
0 项奖励
Daniel_Redig
初学者
2,736 次查看
Hi Dmitry,
No firewall rules are defined, selinux is disabled. I can use ibping to ping between the machines and get replies. Still cannot create a ring. MPDs can start locally. Passworddless ssh works perfectly. Authentication is from the same NIS server as all the other nodes in the cluster that do work. It is an odd problem, IMHO!! Any more suggestions? Thanks,
Dan
0 项奖励
Dmitry_K_Intel2
2,736 次查看
Hi Daniel,

Let's compare ssh versions! I'm using:
[root@cluster1002 ~]$ ssh -V
ssh: Reflection for Secure IT 6.1.2.1 (build 3005) on x86_64-redhat-linux-gnu (64-bit)

Could you check for mpd processes on both nodes?
[root@cluster1002 ~] ps ux
[root@claster1002 ~] ssh test2 -x ps ux
If there is an mpd process please kill it.

[root@claster1002 ~] echo test1 > mpd.hosts
[root@claster1002 ~] echo test2 >> mpd.hosts
[root@claster1002 ~] mpdboot -r ssh -n 2 -d
Check the ring:
[root@claster1002 ~] mpdtrace
If there is no ring, let's try to create it by hand:

[root@test1 ~] env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.2.006/bin64/mpd.py --ncpus=1 --myhost=test1 -e -d -s 2
You'll get port number (port_number) which will be used in the next command

[root@test1 ~] ssh -x -n test2 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME /opt/intel/impi/3.2.2.006/bin64/mpd.py -h test1 -p port_number --ifhn=10.11.179.27 --ncpus=1 --myhost=test2 --myip=10.11.179.27 -e -d -s 2

If ssh works correctly new mpd ring will be created.
[root@test1 ~] mpdtrace
test1
test2

If it doesn't work it means that you have some issues with configuration. If it work send me the output - probably your ssh outputs information in another format.

Regards!
Dmitry
0 项奖励
回复