Community
cancel
Showing results for 
Search instead for 
Did you mean: 
tamuer
Beginner
111 Views

problem to start mpd ring

this is the information i got:

yukai@hc-abs:/home_sas/yukai => mpdboot -d -v -r ssh -f mpd.hosts -n 7
debug: starting
running mpdallexit on hc-abs
LAUNCHED mpd on hc-abs via
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.2.006/bin64/mpd.py --ncpus=1 --myhost=hc-abs -e -d -s 7
debug: mpd on hc-abs on port 40529
RUNNING: mpd on hc-abs
debug: info for running mpd: {'ip': '', 'ncpus': 1, 'list_port': 40529, 'entry_port': '', 'host': 'hc-abs', 'entry_host': '', 'ifhn': ''}
LAUNCHED mpd on n10 via hc-abs
debug: launch cmd= ssh -x -n n10 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.2.006/bin64/mpd.py -h hc-abs -p 40529 --ifhn=192.168.0.160 --ncpus=1 --myhost=n10 --myip=192.168.0.160 -e -d -s 7
LAUNCHED mpd on n11 via hc-abs
debug: launch cmd= ssh -x -n n11 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.2.006/bin64/mpd.py -h hc-abs -p 40529 --ifhn=192.168.0.161 --ncpus=1 --myhost=n11 --myip=192.168.0.161 -e -d -s 7
LAUNCHED mpd on n12 via hc-abs
debug: launch cmd= ssh -x -n n12 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.2.006/bin64/mpd.py -h hc-abs -p 40529 --ifhn=192.168.0.162 --ncpus=1 --myhost=n12 --myip=192.168.0.162 -e -d -s 7
LAUNCHED mpd on n13 via hc-abs
debug: launch cmd= ssh -x -n n13 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.2.006/bin64/mpd.py -h hc-abs -p 40529 --ifhn=192.168.0.163 --ncpus=1 --myhost=n13 --myip=192.168.0.163 -e -d -s 7
debug: mpd on n10 on port 32896
mpdboot_hc-abs (handle_mpd_output 886): failed to ping mpd on n10; received output={}


i am sure ssh work perfectly (passwordless).
mpd.hosts:
n10
n11
n12
n13
n14
n15
n16

mpirun works fine for each node.

yukai@hc-abs:/home_sas/yukai => cpuinfo
Intel Xeon Processor (Intel64 Dunnington)
===== Processor composition =====
Processors(CPUs) : 16
Packages(sockets) : 4
Cores per package : 4
Threads per core : 1
===== Processor identification =====
Processor Thread Id. Core Id. Package Id.
0 0 0 0
1 0 0 1
2 0 0 2
3 0 0 3
4 0 2 0
5 0 2 1
6 0 2 2
7 0 2 3
8 0 1 0
9 0 1 1
10 0 1 2
11 0 1 3
12 0 3 0
13 0 3 1
14 0 3 2
15 0 3 3
===== Placement on packages =====
Package Id. Core Id. Processors
0 0,2,1,3 0,4,8,12
1 0,2,1,3 1,5,9,13
2 0,2,1,3 2,6,10,14
3 0,2,1,3 3,7,11,15
===== Cache sharing =====
Cache Size Processors
L1 32 KB no sharing
L2 3 MB (0,8)(1,9)(2,10)(3,11)(4,12)(5,13)(6,14)(7,15)
L3 8 MB (0,4,8,12)(1,5,9,13)(2,6,10,14)(3,7,11,15)


/etc/hosts looks fine.

Any help and suggestion will be greatly appreciated!

0 Kudos
14 Replies
tamuer
Beginner
111 Views

problem solved!

Now I have a question about mpd.hosts.

The ring doesn't work if I don't put the head node in the first line.

Now the question is how I can avoid this cause I don't want to use the head node (leave it for the other system programs).

Eithe starting ring without the head node or submitting jobs to specified nodes in the ring?

Dmitry_K_Intel2
Employee
111 Views

Hi,

Have you tried '-nolocal' option?

Regards!
Dmitry
compbio
Beginner
111 Views

Hi tamuer,

Could you please tell me how to resolve that. I'm having the same problem.

Thanks,
Tuan
Dmitry_K_Intel2
Employee
111 Views

Hi Tuan,

Could you clarify what problem you have.
What library version do you use.
Could you post commands and error message here and I'll try to help you.

Regards!
Dmitry
tamuer
Beginner
111 Views

hi Tuan, I just did what Dmitry told me. He is a wonderful expert.
Dmitry_K_Intel2
Employee
111 Views

Thanks, Tamuer.
Daniel_Redig
Beginner
111 Views

What was the fix? I've just begun experiencing the problem on a cluster that was working previously. Thanks.
Dmitry_K_Intel2
Employee
111 Views

Hi Daniel,

Could you clarify what the problem is? What version of the Intel MPI Library do you use?
Usually there are some log files in /tmp directory. Try 'ls /tmp | grep mpd'

Please provide as much information as possible and I'll try to help you.

Regards!
Dmitry
Daniel_Redig
Beginner
111 Views

Hi,
I'm using ICT 3.2.2. Everything works fine on all nodes except 2. The install is on a shared filesystem. The logfile is empty. If I run w/ -d I get:
[root@test1 ~]# mpdboot -n 2 -r ssh -f machines -d
debug: starting
running mpdallexit on test1
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.2.006/bin64/mpd.py --ncpus=1 --myhost=test1 -e -d -s 2
debug: mpd on test1 on port 37556
debug: info for running mpd: {'ip': '10.11.178.192', 'ncpus': 1, 'list_port': 37556, 'entry_port': '', 'host': 'test1', 'entry_host': '', 'ifhn': ''}
debug: launch cmd= ssh -x -n test2 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME /opt/intel/impi/3.2.2.006/bin64/mpd.py -h test1 -p 37556 --ifhn=10.11.179.27 --ncpus=1 --myhost=test2 --myip=10.11.179.27 -e -d -s 2
debug: mpd on test2 on port 50042
mpdboot_test1 (handle_mpd_output 886): failed to ping mpd on test2; received output={}
[root@test1 ~]# mpdboot -n 2 -r ssh -f machines -ddebug: startingrunning mpdallexit on test1debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.2.006/bin64/mpd.py --ncpus=1 --myhost=test1 -e -d -s 2debug: mpd on test1 on port 37556debug: info for running mpd: {'ip': '10.11.178.192', 'ncpus': 1, 'list_port': 37556, 'entry_port': '', 'host': 'test1', 'entry_host': '', 'ifhn': ''}debug: launch cmd= ssh -x -n test2 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME /opt/intel/impi/3.2.2.006/bin64/mpd.py -h test1 -p 37556 --ifhn=10.11.179.27 --ncpus=1 --myhost=test2 --myip=10.11.179.27 -e -d -s 2debug: mpd on test2 on port 50042mpdboot_test1 (handle_mpd_output 886): failed to ping mpd on test2; received output={}
Dmitry_K_Intel2
Employee
111 Views

Daniel,

Could you check that you can log-in without entering password (or passphrase) from test1 to test2 and vice versa.
[root@test1 ~] ssh test1

Passwordless ssh connection is one of requirements.

Regards!
Dmitry
Daniel_Redig
Beginner
111 Views

passwordless ssh is working properly.
Dmitry_K_Intel2
Employee
111 Views

Daniel,

Looks like there are some limitations on the network ports. Do you use firewall? Or might be some ssh ports are restricted? Could you please check with your system administrator?

Regards!
Dmitry
Daniel_Redig
Beginner
111 Views

Hi Dmitry,
No firewall rules are defined, selinux is disabled. I can use ibping to ping between the machines and get replies. Still cannot create a ring. MPDs can start locally. Passworddless ssh works perfectly. Authentication is from the same NIS server as all the other nodes in the cluster that do work. It is an odd problem, IMHO!! Any more suggestions? Thanks,
Dan
Dmitry_K_Intel2
Employee
111 Views

Hi Daniel,

Let's compare ssh versions! I'm using:
[root@cluster1002 ~]$ ssh -V
ssh: Reflection for Secure IT 6.1.2.1 (build 3005) on x86_64-redhat-linux-gnu (64-bit)

Could you check for mpd processes on both nodes?
[root@cluster1002 ~] ps ux
[root@claster1002 ~] ssh test2 -x ps ux
If there is an mpd process please kill it.

[root@claster1002 ~] echo test1 > mpd.hosts
[root@claster1002 ~] echo test2 >> mpd.hosts
[root@claster1002 ~] mpdboot -r ssh -n 2 -d
Check the ring:
[root@claster1002 ~] mpdtrace
If there is no ring, let's try to create it by hand:

[root@test1 ~] env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.2.006/bin64/mpd.py --ncpus=1 --myhost=test1 -e -d -s 2
You'll get port number (port_number) which will be used in the next command

[root@test1 ~] ssh -x -n test2 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME /opt/intel/impi/3.2.2.006/bin64/mpd.py -h test1 -p port_number --ifhn=10.11.179.27 --ncpus=1 --myhost=test2 --myip=10.11.179.27 -e -d -s 2

If ssh works correctly new mpd ring will be created.
[root@test1 ~] mpdtrace
test1
test2

If it doesn't work it means that you have some issues with configuration. If it work send me the output - probably your ssh outputs information in another format.

Regards!
Dmitry
Reply