Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

mpd error

altlogic09
Beginner
2,321 Views

Hi!

I have a problem with Altair PBS PRO + Intel MPI. I can launch a task with mpiexec command on several nodes. But when I try to launch this task on several nodes under PBS I get error.

What I doing:
1) Starting mpd on nodes:
qwer@mgr:/mnt/share/piex> cat mpd.hosts
ib-mgr:10
ib-cn01:16
ib-cn02:16
ib-cn03:16
ib-cn04:16
ib-cn05:16
qwer@mgr:/mnt/share/piex> mpdboot -n 6 -f mpd.hosts -r ssh


2) Cheking:
qwer@mgr:/mnt/share/piex> mpdtrace
ib-mgr
ib-cn04
ib-cn03
ib-cn02
ib-cn01
ib-cn05


3) Start mpi-program without PBS:
qwer@mgr:/mnt/share/piex> mpiexec -ppn 10 -n 50 /mnt/share/piex/pi -nolocal
Process 24 on ib-cn04
Process 22 on ib-cn04
Process 13 on ib-mgr
[Why -nolocal ignored?]
Process 29 on ib-cn04
Process 21 on ib-cn04

...
Process 25 on ib-cn04
Process 26 on ib-cn04
Process 36 on ib-cn03

pi = 3.1415926535897931
time = 0.435737 sec.

OK. Task was launched on all nodes right.


4) Make a job file for PBS:
qwer@mgr:/mnt/share/piex> cat test.job
#!/bin/bash

#PBS -q long
#PBS -l nodes=5:ppn=10,mem=100mb,walltime=1:30:00
#PBS -S /bin/bash
#PBS -N piex

echo " Start date:`/bin/date`"
mpiexec -ppn 10 -n 50 /mnt/share/piex/pi -nolocal
echo " End date:`/bin/date`"


5) Start mpi program with PBS:
qwer@mgr:/mnt/share/piex> qsub test.job
673.mgr

6) Where is my job?
qwer@mgr:/mnt/share/piex> qstat

7)What happend?
qwer@mgr:/mnt/share/piex> cat piex.o673
Start date: 27 13:55:47 VLAT 2009
mpiexec_mgr: cannot connect to local mpd (/tmp/pbs.673.mgr/mpd2.console_mgr_qwer); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
End date: 27 13:55:47 VLAT 2009


8) Realy mpd not runnig?
qwer@mgr:/mnt/share/piex> mpdtrace -l
ib-mgr_60696 (10.10.0.1)
ib-cn04_41952 (10.10.0.14)
ib-cn03_43736 (10.10.0.13)
ib-cn02_45542 (10.10.0.12)
ib-cn01_52394 (10.10.0.11)
ib-cn05_44083 (10.10.0.15)


What I doing else:
a) set env var
qwer@mgr: I_MPI_CPUINFO=/proc/cpuinfo
result - nothing.
b) try to find connection port, which locking PBS for mpd. I think, that pbs search connection with mpd deamon not in right port.

What reason of my problems?

About my system:

mgr:~ # cat /etc/SuSE-release
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
PATCHLEVEL = 1

qwer@mgr:/mnt/share/piex> mpiexec -V
Intel MPI Library for Linux, 64-bit applications, Version 3.2.1 Build 20090312
Copyright (C) 2003-2009 Intel Corporation. All rights reserved.

mgr:~ # qstat -Bf
Server: mgr
server_state = Active
server_host = extmgr.hp
scheduling = True
total_jobs = 1
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1 Exiting:0 Begun
:0
acl_roots = foo,root@mgr
default_queue = workq
log_events = 511
mail_from = adm
query_other_jobs = True
resources_default.ncpus = 1
default_chunk.ncpus = 1
resources_assigned.mem = 0kb
resources_assigned.ncpus = 1
resources_assigned.nodect = 1
scheduler_iteration = 600
FLicenses = 95
resv_enable = True
node_fail_requeue = 310
max_array_size = 10000
pbs_license_file_location = 7788@mgr
pbs_license_min = 0
pbs_license_max = 2147483647
pbs_license_linger_time = 3600
license_count = Avail_Global:95 Avail_Local:0 Used:1 High_Use:96
pbs_version = PBSPro_10.0.0.82981
eligible_time_enable = False

qwer@mgr:/mnt/share/piex> cpuinfo
Architecture : x86_64
Hyperthreading: disabled
Packages : 4
Cores : 16
Processors : 16
===== Processor identification =====
Processor Thread Core Package
0 0 0 0
1 0 0 2
2 0 0 4
3 0 0 6
4 0 1 0
5 0 1 2
6 0 1 4
7 0 1 6
8 0 2 0
9 0 2 2
10 0 2 4
11 0 2 6
12 0 3 0
13 0 3 2
14 0 3 4
15 0 3 6
===== Processor placement =====
Package Cores Processors
0 0,1,2,3 0,4,8,12
2 0,1,2,3 1,5,9,13
4 0,1,2,3 2,6,10,14
6 0,1,2,3 3,7,11,15
===== Cache sharing =====
Cache Size Processors
L1 32 KB no sharing
L2 4 MB (0,4)(1,5)(2,6)(3,7)(8,12)(9,13)(10,14)(11,15)

0 Kudos
8 Replies
altlogic09
Beginner
2,321 Views
Avaible resourses:
qwer@mgr:/mnt/share/piex> pbsnodes -a
mgr
Mom = extmgr.hp
ntype = PBS
state = free
pcpus = 16
Priority = 0
resources_available.arch = linux
resources_available.host = extmgr
resources_available.mem = 32960976kb
resources_available.ncpus = 16
resources_available.vnode = mgr
resources_assigned.mem = 0kb
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

cn01
Mom = cn01.hp
ntype = PBS
state = free
pcpus = 16
resources_available.arch = linux
resources_available.host = cn01
resources_available.mem = 32960896kb
resources_available.ncpus = 16
resources_available.vnode = cn01
resources_assigned.mem = 0kb
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

cn02
Mom = cn02.hp
ntype = PBS
state = free
pcpus = 16
resources_available.arch = linux
resources_available.host = cn02
resources_available.mem = 32960896kb
resources_available.ncpus = 16
resources_available.vnode = cn02
resources_assigned.mem = 0kb
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

cn03
Mom = cn03.hp
ntype = PBS
state = free
pcpus = 16
resources_available.arch = linux
resources_available.host = cn03
resources_available.mem = 32960896kb
resources_available.ncpus = 16
resources_available.vnode = cn03
resources_assigned.mem = 0kb
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

cn04
Mom = cn04.hp
ntype = PBS
state = free
pcpus = 16
resources_available.arch = linux
resources_available.host = cn04
resources_available.mem = 32960896kb
resources_available.ncpus = 16
resources_available.vnode = cn04
resources_assigned.mem = 0kb
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

cn05
Mom = cn05.hp
ntype = PBS
state = free
pcpus = 16
resources_available.arch = linux
resources_available.host = cn05
resources_available.mem = 32960896kb
resources_available.ncpus = 16
resources_available.vnode = cn05
resources_assigned.mem = 0kb
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared



0 Kudos
TimP
Honored Contributor III
2,321 Views
Quoting - altlogic09


7)What happend?
qwer@mgr:/mnt/share/piex> cat piex.o673
Start date: 27 13:55:47 VLAT 2009
mpiexec_mgr: cannot connect to local mpd (/tmp/pbs.673.mgr/mpd2.console_mgr_qwer); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
End date: 27 13:55:47 VLAT 2009

In the PBS script, run mpdboot (preferably with -r ssh), using the PBS_NODEFILE node list, so your job has its own mpd on the assigned group of nodes. At the end of your script, run mpdallexit. If you prefer, use mpirun so as to combine mpdboot, mpiexec, and mpdallexit.
It used to be OK to expect the PBS script to inherit the mpivars path settings from the session where you submit the job. Lately, it's necessary to set up the entire environment in the script.

0 Kudos
altlogic09
Beginner
2,321 Views
What does it mean?
Lately, it's necessary to set up the entire environment in the script.

What variables I should set?

My new bath file for job:


#!/bin/bash

#PBS -q long
#PBS -l nodes=6
#PBS -l ncpus=90
#PBS -l mem=2GB
#PBS -l walltime=240:00:00
#PBS -S /bin/bash
#PBS -N v3

echo " Start date:`/bin/date`"

cd /mnt/share/testfort/v3_cp

#mpdboot -n 6 -f mpd.hosts -r ssh
#mpiexec -n 90 ./vl_2
#mpdallexit

mpirun -r ssh -n 90 -f mpd.hosts ./vl_2

echo " End date:`/bin/date`"

I get error message, If I launched a task with mpdboot-mpiexec-mpdallexit:


mpdboot_mgr (handle_mpd_output 828): Failed to establish a socket connection with ib-cn01:58575 : (111, 'Connection refused')
mpdboot_mgr (handle_mpd_output 845): failed to connect to mpd on ib-cn01
mpiexec_mgr: cannot connect to local mpd (/tmp/pbs.741.mgr/mpd2.console_mgr_zaytsev); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpdallexit: cannot connect to local mpd (/tmp/pbs.741.mgr/mpd2.console_mgr_zaytsev); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)


I get error message, If I launched a task with mpirun:

mpdboot_mgr (handle_mpd_output 837): failed to ping mpd on cn01; received output={}

If I launch task not under PBS - all OK (with mpirun and with mpdboot-mpiexec-mpdallexit).

What does it mean error code of mpi 827, 828, 845???

0 Kudos
TimP
Honored Contributor III
2,321 Views
Quoting - altlogic09


#!/bin/bash

#PBS -q long
#PBS -l nodes=6
#PBS -l ncpus=90
#PBS -l mem=2GB
#PBS -l walltime=240:00:00
#PBS -S /bin/bash
#PBS -N v3

echo " Start date:`/bin/date`"

cd /mnt/share/testfort/v3_cp

mpirun -r ssh -n 90 -f mpd.hosts ./vl_2

Where did you get mpd.hosts? In the PBS installations I've seen, the assigned node list appears in $(PBS_NODEFILE) or some such. If you want to use mpd.hosts, you would have to replace its contents by copying the list passed to your job by PBS. Is ib-cn01 one of the nodes allocated to your job by PBS? How do you know?
0 Kudos
Gergana_S_Intel
Employee
2,321 Views

Hi altlogic09,

Just as a quick clarification on Tim's comments above: the Intel MPI Library is integrated into PBS Pro enough, so you don't have to specify a hosts file when running under the scheduler. I recommend you change your batch file to the following:

#!/bin/bash
#PBS -q long
#PBS -l nodes=6
#PBS -l ncpus=90
#PBS -l mem=2GB
#PBS -l walltime=240:00:00
#PBS -S /bin/bash
#PBS -N v3
echo " Start date:`/bin/date`"
cd /mnt/share/testfort/v3_cp
mpirun -r ssh -n 90 ./vl_2
echo " End date:`/bin/date`"

Note how you don't have to specify the -f option. That's because the Intel MPI Library grabs the list of hosts from PBS directly. Of course, make sure you run mpdallexit to clean up any existing MPDs on the cluster before you submit your new job.

You can certainly also use the mpdboot-mpiexec-mpdallexit schema under PBS, but that would involve you making sure you're picking up the correct hosts file. Here's a sample based on your batch script:

#!/bin/bash
#PBS -q long
#PBS -l nodes=6
#PBS -l ncpus=90
#PBS -l mem=2GB
#PBS -l walltime=240:00:00
#PBS -S /bin/bash
#PBS -N v3
echo " Start date:`/bin/date`"
cd /mnt/share/testfort/v3_cp
NHOSTS=`cat $PBS_NODEFILE|wc -l`
mpdboot -n $NHOSTS -f $PBS_NODEFILE -r ssh
mpiexec -n 90 ./vl_2
mpdallexit
echo " End date:`/bin/date`"

As you can see, using mpirun is easier. I hope this helps. Let us know how it goes.

Regards,
~Gergana

0 Kudos
altlogic09
Beginner
2,321 Views

zaytsev@mgr:/mnt/share/testfort/v3_cp> mpdboot -f mpd.hosts -n 6 -r ssh
mpdboot_mgr (handle_mpd_output 837): failed to ping mpd on cn01; received output={}

What is the error?

I have access to any nodes without password. hasn't files identy.pub &identy in directory ~/.ssh. Is it good?

zaytsev@mgr:~/.ssh> l
53316
drwx------ 2 zaytsev toguusers 4096 2009-11-03 18:04 ./
drwx------ 27 zaytsev users 4096 2009-10-28 10:58 ../
-rw-r--r-- 1 zaytsev users 393 2009-07-17 15:55 authorized_keys2
-rw-r--r-- 1 zaytsev users 0 2009-11-03 18:04 cat
-rw------- 1 zaytsev users 1675 2009-07-17 15:55 id_rsa
-rw-r--r-- 1 zaytsev users 393 2009-07-17 15:55 id_rsa.pub
-rw-r--r-- 1 zaytsev users 2930 2009-11-03 17:30 known_hosts
-rw-r--r-- 1 zaytsev users 54511357 2009-10-22 22:49 VNI.IMSL.Fortran.Numerical.Library.v6.0.for.Sun.Studio.12.LINUX.EM64T-TBE.rar

zaytsev@mgr:~/.ssh> cat known_hosts
cn01,10.0.0.11 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
cn02,10.0.0.12 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
cn03,10.0.0.13 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
cn04,10.0.0.14 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
cn05,10.0.0.15 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
ib-cn01,10.10.0.11 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
ib-cn02,10.10.0.12 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
ib-cn03,10.10.0.13 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
ib-cn04,10.10.0.14 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
ib-cn05,10.10.0.15 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
ib-mgr,10.10.0.1 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAulRy7M+gVL2+mvg7+QGzhEbW8Hk2H7AxtqEjmZ6iZkaxwdbVMEfxpsgsrJ9EcWQWiGJ4K3qfKz+9dpfq0AskZNOnI0cZdeolpSObgLiQva6g/69dYrzx1WLlf98bU1YMuZ5Cll2PTcHHpoTCC30hkDVeRcifKzR9FRSIr9MtF+s=
mgr,10.0.0.1 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAulRy7M+gVL2+mvg7+QGzhEbW8Hk2H7AxtqEjmZ6iZkaxwdbVMEfxpsgsrJ9EcWQWiGJ4K3qfKz+9dpfq0AskZNOnI0cZdeolpSObgLiQva6g/69dYrzx1WLlf98bU1YMuZ5Cll2PTcHHpoTCC30hkDVeRcifKzR9FRSIr9MtF+s=
10.10.190.10 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAulRy7M+gVL2+mvg7+QGzhEbW8Hk2H7AxtqEjmZ6iZkaxwdbVMEfxpsgsrJ9EcWQWiGJ4K3qfKz+9dpfq0AskZNOnI0cZdeolpSObgLiQva6g/69dYrzx1WLlf98bU1YMuZ5Cll2PTcHHpoTCC30hkDVeRcifKzR9FRSIr9MtF+s=



0 Kudos
Gergana_S_Intel
Employee
2,321 Views

Hi altlogic09,

Well, since you have an account where this works, and an account where this doesn't, I would say compare the environments of the two and see how they differ.

For example, on our local clusters, my account has the authorized_keys file under the .ssh directory, not authorized_keys2. I'm not sure if the ssh settings require a specific name. Your known_hosts file looks good enough, assuming no corruption in the encription lines.

Also, the Intel MPI Library creates some logfiles for the user in the /tmp directory on the mgr and cn01 nodes. Those would be good to look at.

Regards,
~Gergana

0 Kudos
altlogic09
Beginner
2,321 Views
I was mistake.I haven't now account where "mpdboot" work wright! I can launch task only with command mpirun (without PBS). I have error:

qwer@mgr:/mnt/share/piex> mpdboot -r ssh -n 6 -f mpd.hosts
mpdboot_mgr (handle_mpd_output 828): Failed to establish a socket connection with ib-cn01:42335 : (111, 'Connection refused')
mpdboot_mgr (handle_mpd_output 845): failed to connect to mpd on ib-cn01

!!!

killing of all process for user qwer on all nodes hope start mpdboot manualy. Now mpdboot boot correct! And boot under PBS wright. WHY???

0 Kudos
Reply