- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi!
I have a problem with Altair PBS PRO + Intel MPI. I can launch a task with mpiexec command on several nodes. But when I try to launch this task on several nodes under PBS I get error.
What I doing:
1) Starting mpd on nodes:
qwer@mgr:/mnt/share/piex> cat mpd.hosts
ib-mgr:10
ib-cn01:16
ib-cn02:16
ib-cn03:16
ib-cn04:16
ib-cn05:16
qwer@mgr:/mnt/share/piex> mpdboot -n 6 -f mpd.hosts -r ssh
2) Cheking:
qwer@mgr:/mnt/share/piex> mpdtrace
ib-mgr
ib-cn04
ib-cn03
ib-cn02
ib-cn01
ib-cn05
3) Start mpi-program without PBS:
qwer@mgr:/mnt/share/piex> mpiexec -ppn 10 -n 50 /mnt/share/piex/pi -nolocal
Process 24 on ib-cn04
Process 22 on ib-cn04
Process 13 on ib-mgr [Why -nolocal ignored?]
Process 29 on ib-cn04
Process 21 on ib-cn04
...
Process 25 on ib-cn04
Process 26 on ib-cn04
Process 36 on ib-cn03
pi = 3.1415926535897931
time = 0.435737 sec.
OK. Task was launched on all nodes right.
4) Make a job file for PBS:
qwer@mgr:/mnt/share/piex> cat test.job
#!/bin/bash
#PBS -q long
#PBS -l nodes=5:ppn=10,mem=100mb,walltime=1:30:00
#PBS -S /bin/bash
#PBS -N piex
echo " Start date:`/bin/date`"
mpiexec -ppn 10 -n 50 /mnt/share/piex/pi -nolocal
echo " End date:`/bin/date`"
5) Start mpi program with PBS:
qwer@mgr:/mnt/share/piex> qsub test.job
673.mgr
6) Where is my job?
qwer@mgr:/mnt/share/piex> qstat
7)What happend?
qwer@mgr:/mnt/share/piex> cat piex.o673
Start date: 27 13:55:47 VLAT 2009
mpiexec_mgr: cannot connect to local mpd (/tmp/pbs.673.mgr/mpd2.console_mgr_qwer); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
End date: 27 13:55:47 VLAT 2009
8) Realy mpd not runnig?
qwer@mgr:/mnt/share/piex> mpdtrace -l
ib-mgr_60696 (10.10.0.1)
ib-cn04_41952 (10.10.0.14)
ib-cn03_43736 (10.10.0.13)
ib-cn02_45542 (10.10.0.12)
ib-cn01_52394 (10.10.0.11)
ib-cn05_44083 (10.10.0.15)
What I doing else:
a) set env var
qwer@mgr: I_MPI_CPUINFO=/proc/cpuinfo
result - nothing.
b) try to find connection port, which locking PBS for mpd. I think, that pbs search connection with mpd deamon not in right port.
What reason of my problems?
About my system:
mgr:~ # cat /etc/SuSE-release
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
PATCHLEVEL = 1
qwer@mgr:/mnt/share/piex> mpiexec -V
Intel MPI Library for Linux, 64-bit applications, Version 3.2.1 Build 20090312
Copyright (C) 2003-2009 Intel Corporation. All rights reserved.
mgr:~ # qstat -Bf
Server: mgr
server_state = Active
server_host = extmgr.hp
scheduling = True
total_jobs = 1
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1 Exiting:0 Begun
:0
acl_roots = foo,root@mgr
default_queue = workq
log_events = 511
mail_from = adm
query_other_jobs = True
resources_default.ncpus = 1
default_chunk.ncpus = 1
resources_assigned.mem = 0kb
resources_assigned.ncpus = 1
resources_assigned.nodect = 1
scheduler_iteration = 600
FLicenses = 95
resv_enable = True
node_fail_requeue = 310
max_array_size = 10000
pbs_license_file_location = 7788@mgr
pbs_license_min = 0
pbs_license_max = 2147483647
pbs_license_linger_time = 3600
license_count = Avail_Global:95 Avail_Local:0 Used:1 High_Use:96
pbs_version = PBSPro_10.0.0.82981
eligible_time_enable = False
qwer@mgr:/mnt/share/piex> cpuinfo
Architecture : x86_64
Hyperthreading: disabled
Packages : 4
Cores : 16
Processors : 16
===== Processor identification =====
Processor Thread Core Package
0 0 0 0
1 0 0 2
2 0 0 4
3 0 0 6
4 0 1 0
5 0 1 2
6 0 1 4
7 0 1 6
8 0 2 0
9 0 2 2
10 0 2 4
11 0 2 6
12 0 3 0
13 0 3 2
14 0 3 4
15 0 3 6
===== Processor placement =====
Package Cores Processors
0 0,1,2,3 0,4,8,12
2 0,1,2,3 1,5,9,13
4 0,1,2,3 2,6,10,14
6 0,1,2,3 3,7,11,15
===== Cache sharing =====
Cache Size Processors
L1 32 KB no sharing
L2 4 MB (0,4)(1,5)(2,6)(3,7)(8,12)(9,13)(10,14)(11,15)
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
mgr
Mom = extmgr.hp
ntype = PBS
state = free
pcpus = 16
Priority = 0
resources_available.arch = linux
resources_available.host = extmgr
resources_available.mem = 32960976kb
resources_available.ncpus = 16
resources_available.vnode = mgr
resources_assigned.mem = 0kb
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
cn01
Mom = cn01.hp
ntype = PBS
state = free
pcpus = 16
resources_available.arch = linux
resources_available.host = cn01
resources_available.mem = 32960896kb
resources_available.ncpus = 16
resources_available.vnode = cn01
resources_assigned.mem = 0kb
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
cn02
Mom = cn02.hp
ntype = PBS
state = free
pcpus = 16
resources_available.arch = linux
resources_available.host = cn02
resources_available.mem = 32960896kb
resources_available.ncpus = 16
resources_available.vnode = cn02
resources_assigned.mem = 0kb
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
cn03
Mom = cn03.hp
ntype = PBS
state = free
pcpus = 16
resources_available.arch = linux
resources_available.host = cn03
resources_available.mem = 32960896kb
resources_available.ncpus = 16
resources_available.vnode = cn03
resources_assigned.mem = 0kb
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
cn04
Mom = cn04.hp
ntype = PBS
state = free
pcpus = 16
resources_available.arch = linux
resources_available.host = cn04
resources_available.mem = 32960896kb
resources_available.ncpus = 16
resources_available.vnode = cn04
resources_assigned.mem = 0kb
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
cn05
Mom = cn05.hp
ntype = PBS
state = free
pcpus = 16
resources_available.arch = linux
resources_available.host = cn05
resources_available.mem = 32960896kb
resources_available.ncpus = 16
resources_available.vnode = cn05
resources_assigned.mem = 0kb
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
7)What happend?
qwer@mgr:/mnt/share/piex> cat piex.o673
Start date: 27 13:55:47 VLAT 2009
mpiexec_mgr: cannot connect to local mpd (/tmp/pbs.673.mgr/mpd2.console_mgr_qwer); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
End date: 27 13:55:47 VLAT 2009
It used to be OK to expect the PBS script to inherit the mpivars path settings from the session where you submit the job. Lately, it's necessary to set up the entire environment in the script.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What variables I should set?
My new bath file for job:
#!/bin/bash
#PBS -q long
#PBS -l nodes=6
#PBS -l ncpus=90
#PBS -l mem=2GB
#PBS -l walltime=240:00:00
#PBS -S /bin/bash
#PBS -N v3
echo " Start date:`/bin/date`"
cd /mnt/share/testfort/v3_cp
#mpdboot -n 6 -f mpd.hosts -r ssh
#mpiexec -n 90 ./vl_2
#mpdallexit
mpirun -r ssh -n 90 -f mpd.hosts ./vl_2
echo " End date:`/bin/date`"
I get error message, If I launched a task with mpdboot-mpiexec-mpdallexit:
mpdboot_mgr (handle_mpd_output 828): Failed to establish a socket connection with ib-cn01:58575 : (111, 'Connection refused')
mpdboot_mgr (handle_mpd_output 845): failed to connect to mpd on ib-cn01
mpiexec_mgr: cannot connect to local mpd (/tmp/pbs.741.mgr/mpd2.console_mgr_zaytsev); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpdallexit: cannot connect to local mpd (/tmp/pbs.741.mgr/mpd2.console_mgr_zaytsev); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
I get error message, If I launched a task with mpirun:
mpdboot_mgr (handle_mpd_output 837): failed to ping mpd on cn01; received output={}
If I launch task not under PBS - all OK (with mpirun and with mpdboot-mpiexec-mpdallexit).
What does it mean error code of mpi 827, 828, 845???
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#!/bin/bash
#PBS -q long
#PBS -l nodes=6
#PBS -l ncpus=90
#PBS -l mem=2GB
#PBS -l walltime=240:00:00
#PBS -S /bin/bash
#PBS -N v3
echo " Start date:`/bin/date`"
cd /mnt/share/testfort/v3_cp
mpirun -r ssh -n 90 -f mpd.hosts ./vl_2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi altlogic09,
Just as a quick clarification on Tim's comments above: the Intel MPI Library is integrated into PBS Pro enough, so you don't have to specify a hosts file when running under the scheduler. I recommend you change your batch file to the following:
#!/bin/bash
#PBS -q long
#PBS -l nodes=6
#PBS -l ncpus=90
#PBS -l mem=2GB
#PBS -l walltime=240:00:00
#PBS -S /bin/bash
#PBS -N v3
echo " Start date:`/bin/date`"
cd /mnt/share/testfort/v3_cp
mpirun -r ssh -n 90 ./vl_2
echo " End date:`/bin/date`"
Note how you don't have to specify the -f option. That's because the Intel MPI Library grabs the list of hosts from PBS directly. Of course, make sure you run mpdallexit
to clean up any existing MPDs on the cluster before you submit your new job.
You can certainly also use the mpdboot-mpiexec-mpdallexit
schema under PBS, but that would involve you making sure you're picking up the correct hosts file. Here's a sample based on your batch script:
#!/bin/bash
#PBS -q long
#PBS -l nodes=6
#PBS -l ncpus=90
#PBS -l mem=2GB
#PBS -l walltime=240:00:00
#PBS -S /bin/bash
#PBS -N v3
echo " Start date:`/bin/date`"
cd /mnt/share/testfort/v3_cp
NHOSTS=`cat $PBS_NODEFILE|wc -l`
mpdboot -n $NHOSTS -f $PBS_NODEFILE -r ssh
mpiexec -n 90 ./vl_2
mpdallexit
echo " End date:`/bin/date`"
As you can see, using mpirun
is easier. I hope this helps. Let us know how it goes.
Regards,
~Gergana
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
zaytsev@mgr:/mnt/share/testfort/v3_cp> mpdboot -f mpd.hosts -n 6 -r ssh
mpdboot_mgr (handle_mpd_output 837): failed to ping mpd on cn01; received output={}
zaytsev@mgr:~/.ssh> l
53316
drwx------ 2 zaytsev toguusers 4096 2009-11-03 18:04 ./
drwx------ 27 zaytsev users 4096 2009-10-28 10:58 ../
-rw-r--r-- 1 zaytsev users 393 2009-07-17 15:55 authorized_keys2
-rw-r--r-- 1 zaytsev users 0 2009-11-03 18:04 cat
-rw------- 1 zaytsev users 1675 2009-07-17 15:55 id_rsa
-rw-r--r-- 1 zaytsev users 393 2009-07-17 15:55 id_rsa.pub
-rw-r--r-- 1 zaytsev users 2930 2009-11-03 17:30 known_hosts
-rw-r--r-- 1 zaytsev users 54511357 2009-10-22 22:49 VNI.IMSL.Fortran.Numerical.Library.v6.0.for.Sun.Studio.12.LINUX.EM64T-TBE.rar
cn01,10.0.0.11 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
cn02,10.0.0.12 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
cn03,10.0.0.13 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
cn04,10.0.0.14 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
cn05,10.0.0.15 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
ib-cn01,10.10.0.11 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
ib-cn02,10.10.0.12 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
ib-cn03,10.10.0.13 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
ib-cn04,10.10.0.14 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
ib-cn05,10.10.0.15 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAngZyLEl/RS+Rxo5tmGxT/bX13OjQlRWGOmzgMI0dOvANRxC8OwknURkm50yDU/cOkJf8JZc1g0AJCNUZs4dvXZWcmJlOzJO+j7VRv7Ei/R2XHur6pmyeCQcl0dgb4piL2HAd/cH8t9A4bP1RWzlfwyNIHd2/f68SqmeHHmdzelU=
ib-mgr,10.10.0.1 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAulRy7M+gVL2+mvg7+QGzhEbW8Hk2H7AxtqEjmZ6iZkaxwdbVMEfxpsgsrJ9EcWQWiGJ4K3qfKz+9dpfq0AskZNOnI0cZdeolpSObgLiQva6g/69dYrzx1WLlf98bU1YMuZ5Cll2PTcHHpoTCC30hkDVeRcifKzR9FRSIr9MtF+s=
mgr,10.0.0.1 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAulRy7M+gVL2+mvg7+QGzhEbW8Hk2H7AxtqEjmZ6iZkaxwdbVMEfxpsgsrJ9EcWQWiGJ4K3qfKz+9dpfq0AskZNOnI0cZdeolpSObgLiQva6g/69dYrzx1WLlf98bU1YMuZ5Cll2PTcHHpoTCC30hkDVeRcifKzR9FRSIr9MtF+s=
10.10.190.10 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAulRy7M+gVL2+mvg7+QGzhEbW8Hk2H7AxtqEjmZ6iZkaxwdbVMEfxpsgsrJ9EcWQWiGJ4K3qfKz+9dpfq0AskZNOnI0cZdeolpSObgLiQva6g/69dYrzx1WLlf98bU1YMuZ5Cll2PTcHHpoTCC30hkDVeRcifKzR9FRSIr9MtF+s=
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi altlogic09,
Well, since you have an account where this works, and an account where this doesn't, I would say compare the environments of the two and see how they differ.
For example, on our local clusters, my account has the authorized_keys file under the .ssh
directory, not authorized_keys2. I'm not sure if the ssh settings require a specific name. Your known_hosts file looks good enough, assuming no corruption in the encription lines.
Also, the Intel MPI Library creates some logfiles for the user in the /tmp
directory on the mgr and cn01 nodes. Those would be good to look at.
Regards,
~Gergana
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
qwer@mgr:/mnt/share/piex> mpdboot -r ssh -n 6 -f mpd.hosts
mpdboot_mgr (handle_mpd_output 828): Failed to establish a socket connection with ib-cn01:42335 : (111, 'Connection refused')
mpdboot_mgr (handle_mpd_output 845): failed to connect to mpd on ib-cn01
!!!
killing of all process for user qwer on all nodes hope start mpdboot manualy. Now mpdboot boot correct! And boot under PBS wright. WHY???
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page