Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

mpd shut down

lpa
Beginner
798 Views
Hello,

I am running a 32 cpus process, using IntelMPI.
Tests have been done with 3.1, 3.2 and 4.0 version.
The cluster runs with OpenPBS queue manager, on CentOS 4.8, with other simulation softwares.
The total number of cpus is over 600

If other processes are running, with many cpus, our submit job falls, with the message :
mpdboot_node67.beicip.co.fr (handle_mpd_output 883):Failed to establish a socket connection with node66.beicip.co.fr:54578 : (111, 'Connection refused')
mpdboot_node67.beicip.co.fr (handle_mpd_output 902): failed to connect to mpd on node66.beicp.co.fr

but there is an initial message, telling that mpd didn't answer.
I post all the debug file, as it seems that mpd was running fine before on both 66 and 67 nodes....

I did a test with a hostfile, on 150 cpus, with a small script, and everything worked fine... mpdtrace was ok on all nodes.

If you have any idea, or any test I could try, let me know.

Thanks a lot
Laurent

+ mpirun --rsh=/usr/bin/ssh -np 1 /apps/puma/Frs400f/../Frs400f/bin/Linux/RHEL4/x86_64/ref-mpi-IntelMPI3.1-intel9.1/pre.ref
debug: starting
running mpdallexit on node67.beicip.co.fr
LAUNCHED mpd on node67.beicip.co.fr via
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py --ncpus=1 --myhost=node67.beicip.co.fr -e -d -s 32
debug: mpd on node67.beicip.co.fr on port 42492
RUNNING: mpd on node67.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.67', 'ncpus': 1, 'list_port': 42492, 'entry_port': '', 'host': 'node67.beicip.co.fr', 'entry_host': '', 'ifhn': ''}
LAUNCHED mpd on node66.beicip.co.fr via node67.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node66.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node67.beicip.co.fr -p 42492 --ifhn=192.168.2.66 --ncpus=1 --myhost=node66.beicip.co.fr --myip=192.168.2.66 -e -d -s 32
LAUNCHED mpd on node65.beicip.co.fr via node67.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node65.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node67.beicip.co.fr -p 42492 --ifhn=192.168.2.65 --ncpus=1 --myhost=node65.beicip.co.fr --myip=192.168.2.65 -e -d -s 32
LAUNCHED mpd on node64.beicip.co.fr via node67.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node64.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node67.beicip.co.fr -p 42492 --ifhn=192.168.2.64 --ncpus=1 --myhost=node64.beicip.co.fr --myip=192.168.2.64 -e -d -s 32
LAUNCHED mpd on node59.beicip.co.fr via node67.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node59.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node67.beicip.co.fr -p 42492 --ifhn=192.168.2.59 --ncpus=1 --myhost=node59.beicip.co.fr --myip=192.168.2.59 -e -d -s 32
debug: mpd on node66.beicip.co.fr on port 37774
RUNNING: mpd on node66.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.66', 'ncpus': 1, 'list_port': 37774, 'entry_port': 42492, 'host': 'node66.beicip.co.fr', 'entry_host': 'node67.beicip.co.fr', 'ifhn': '', 'pid': 12495}
debug: mpd on node65.beicip.co.fr on port 44250
RUNNING: mpd on node65.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.65', 'ncpus': 1, 'list_port': 44250, 'entry_port': 42492, 'host': 'node65.beicip.co.fr', 'entry_host': 'node67.beicip.co.fr', 'ifhn': '', 'pid': 12496}
debug: mpd on node64.beicip.co.fr on port 36465
RUNNING: mpd on node64.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.64', 'ncpus': 1, 'list_port': 36465, 'entry_port': 42492, 'host': 'node64.beicip.co.fr', 'entry_host': 'node67.beicip.co.fr', 'ifhn': '', 'pid': 12497}
debug: mpd on node59.beicip.co.fr on port 48882
RUNNING: mpd on node59.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.59', 'ncpus': 1, 'list_port': 48882, 'entry_port': 42492, 'host': 'node59.beicip.co.fr', 'entry_host': 'node67.beicip.co.fr', 'ifhn': '', 'pid': 12498}
LAUNCHED mpd on node58.beicip.co.fr via node66.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node58.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node66.beicip.co.fr -p 37774 --ifhn=192.168.2.58 --ncpus=1 --myhost=node58.beicip.co.fr --myip=192.168.2.58 -e -d -s 32
LAUNCHED mpd on node57.beicip.co.fr via node66.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node57.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node66.beicip.co.fr -p 37774 --ifhn=192.168.2.57 --ncpus=1 --myhost=node57.beicip.co.fr --myip=192.168.2.57 -e -d -s 32
LAUNCHED mpd on node56.beicip.co.fr via node66.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node56.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node66.beicip.co.fr -p 37774 --ifhn=192.168.2.56 --ncpus=1 --myhost=node56.beicip.co.fr --myip=192.168.2.56 -e -d -s 32
LAUNCHED mpd on node51.beicip.co.fr via node66.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node51.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node66.beicip.co.fr -p 37774 --ifhn=192.168.2.51 --ncpus=1 --myhost=node51.beicip.co.fr --myip=192.168.2.51 -e -d -s 32
LAUNCHED mpd on node50.beicip.co.fr via node59.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node50.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node59.beicip.co.fr -p 48882 --ifhn=192.168.2.50 --ncpus=1 --myhost=node50.beicip.co.fr --myip=192.168.2.50 -e -d -s 32
LAUNCHED mpd on node49.beicip.co.fr via node59.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node49.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node59.beicip.co.fr -p 48882 --ifhn=192.168.2.49 --ncpus=1 --myhost=node49.beicip.co.fr --myip=192.168.2.49 -e -d -s 32
LAUNCHED mpd on node48.beicip.co.fr via node59.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node48.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node59.beicip.co.fr -p 48882 --ifhn=192.168.2.48 --ncpus=1 --myhost=node48.beicip.co.fr --myip=192.168.2.48 -e -d -s 32
LAUNCHED mpd on node47.beicip.co.fr via node59.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node47.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node59.beicip.co.fr -p 48882 --ifhn=192.168.2.47 --ncpus=1 --myhost=node47.beicip.co.fr --myip=192.168.2.47 -e -d -s 32
debug: mpd on node58.beicip.co.fr on port 34860
RUNNING: mpd on node58.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.58', 'ncpus': 1, 'list_port': 34860, 'entry_port': 37774, 'host': 'node58.beicip.co.fr', 'entry_host': 'node66.beicip.co.fr', 'ifhn': '', 'pid': 12499}
debug: mpd on node57.beicip.co.fr on port 60335
RUNNING: mpd on node57.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.57', 'ncpus': 1, 'list_port': 60335, 'entry_port': 37774, 'host': 'node57.beicip.co.fr', 'entry_host': 'node66.beicip.co.fr', 'ifhn': '', 'pid': 12500}
debug: mpd on node56.beicip.co.fr on port 47492
RUNNING: mpd on node56.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.56', 'ncpus': 1, 'list_port': 47492, 'entry_port': 37774, 'host': 'node56.beicip.co.fr', 'entry_host': 'node66.beicip.co.fr', 'ifhn': '', 'pid': 12501}
debug: mpd on node51.beicip.co.fr on port 34620
RUNNING: mpd on node51.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.51', 'ncpus': 1, 'list_port': 34620, 'entry_port': 37774, 'host': 'node51.beicip.co.fr', 'entry_host': 'node66.beicip.co.fr', 'ifhn': '', 'pid': 12502}
debug: mpd on node50.beicip.co.fr on port 56740
RUNNING: mpd on node50.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.50', 'ncpus': 1, 'list_port': 56740, 'entry_port': 48882, 'host': 'node50.beicip.co.fr', 'entry_host': 'node59.beicip.co.fr', 'ifhn': '', 'pid': 12503}
debug: mpd on node49.beicip.co.fr on port 53873
RUNNING: mpd on node49.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.49', 'ncpus': 1, 'list_port': 53873, 'entry_port': 48882, 'host': 'node49.beicip.co.fr', 'entry_host': 'node59.beicip.co.fr', 'ifhn': '', 'pid': 12504}
debug: mpd on node48.beicip.co.fr on port 42140
RUNNING: mpd on node48.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.48', 'ncpus': 1, 'list_port': 42140, 'entry_port': 48882, 'host': 'node48.beicip.co.fr', 'entry_host': 'node59.beicip.co.fr', 'ifhn': '', 'pid': 12505}
LAUNCHED mpd on node46.beicip.co.fr via node58.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node46.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node58.beicip.co.fr -p 34860 --ifhn=192.168.2.46 --ncpus=1 --myhost=node46.beicip.co.fr --myip=192.168.2.46 -e -d -s 32
debug: mpd on node47.beicip.co.fr on port 47804
RUNNING: mpd on node47.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.47', 'ncpus': 1, 'list_port': 47804, 'entry_port': 48882, 'host': 'node47.beicip.co.fr', 'entry_host': 'node59.beicip.co.fr', 'ifhn': '', 'pid': 12506}
LAUNCHED mpd on node45.beicip.co.fr via node58.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node45.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node58.beicip.co.fr -p 34860 --ifhn=192.168.2.45 --ncpus=1 --myhost=node45.beicip.co.fr --myip=192.168.2.45 -e -d -s 32
LAUNCHED mpd on node44.beicip.co.fr via node58.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node44.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node58.beicip.co.fr -p 34860 --ifhn=192.168.2.44 --ncpus=1 --myhost=node44.beicip.co.fr --myip=192.168.2.44 -e -d -s 32
LAUNCHED mpd on node43.beicip.co.fr via node58.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node43.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node58.beicip.co.fr -p 34860 --ifhn=192.168.2.43 --ncpus=1 --myhost=node43.beicip.co.fr --myip=192.168.2.43 -e -d -s 32
LAUNCHED mpd on node42.beicip.co.fr via node47.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node42.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node47.beicip.co.fr -p 47804 --ifhn=192.168.2.42 --ncpus=1 --myhost=node42.beicip.co.fr --myip=192.168.2.42 -e -d -s 32
LAUNCHED mpd on node41.beicip.co.fr via node47.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node41.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node47.beicip.co.fr -p 47804 --ifhn=192.168.2.41 --ncpus=1 --myhost=node41.beicip.co.fr --myip=192.168.2.41 -e -d -s 32
LAUNCHED mpd on node40.beicip.co.fr via node47.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node40.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node47.beicip.co.fr -p 47804 --ifhn=192.168.2.40 --ncpus=1 --myhost=node40.beicip.co.fr --myip=192.168.2.40 -e -d -s 32
LAUNCHED mpd on node35.beicip.co.fr via node47.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node35.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node47.beicip.co.fr -p 47804 --ifhn=192.168.2.35 --ncpus=1 --myhost=node35.beicip.co.fr --myip=192.168.2.35 -e -d -s 32
debug: mpd on node46.beicip.co.fr on port 44691
RUNNING: mpd on node46.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.46', 'ncpus': 1, 'list_port': 44691, 'entry_port': 34860, 'host': 'node46.beicip.co.fr', 'entry_host': 'node58.beicip.co.fr', 'ifhn': '', 'pid': 12507}
LAUNCHED mpd on node34.beicip.co.fr via node48.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node34.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node48.beicip.co.fr -p 42140 --ifhn=192.168.2.34 --ncpus=1 --myhost=node34.beicip.co.fr --myip=192.168.2.34 -e -d -s 32
debug: mpd on node45.beicip.co.fr on port 33161
RUNNING: mpd on node45.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.45', 'ncpus': 1, 'list_port': 33161, 'entry_port': 34860, 'host': 'node45.beicip.co.fr', 'entry_host': 'node58.beicip.co.fr', 'ifhn': '', 'pid': 12508}
debug: mpd on node44.beicip.co.fr on port 59527
RUNNING: mpd on node44.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.44', 'ncpus': 1, 'list_port': 59527, 'entry_port': 34860, 'host': 'node44.beicip.co.fr', 'entry_host': 'node58.beicip.co.fr', 'ifhn': '', 'pid': 12509}
debug: mpd on node43.beicip.co.fr on port 34862
RUNNING: mpd on node43.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.43', 'ncpus': 1, 'list_port': 34862, 'entry_port': 34860, 'host': 'node43.beicip.co.fr', 'entry_host': 'node58.beicip.co.fr', 'ifhn': '', 'pid': 12510}
debug: mpd on node42.beicip.co.fr on port 34885
RUNNING: mpd on node42.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.42', 'ncpus': 1, 'list_port': 34885, 'entry_port': 47804, 'host': 'node42.beicip.co.fr', 'entry_host': 'node47.beicip.co.fr', 'ifhn': '', 'pid': 12511}
debug: mpd on node41.beicip.co.fr on port 35605
RUNNING: mpd on node41.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.41', 'ncpus': 1, 'list_port': 35605, 'entry_port': 47804, 'host': 'node41.beicip.co.fr', 'entry_host': 'node47.beicip.co.fr', 'ifhn': '', 'pid': 12512}
debug: mpd on node40.beicip.co.fr on port 37542
RUNNING: mpd on node40.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.40', 'ncpus': 1, 'list_port': 37542, 'entry_port': 47804, 'host': 'node40.beicip.co.fr', 'entry_host': 'node47.beicip.co.fr', 'ifhn': '', 'pid': 12513}
LAUNCHED mpd on node33.beicip.co.fr via node48.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node33.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node48.beicip.co.fr -p 42140 --ifhn=192.168.2.33 --ncpus=1 --myhost=node33.beicip.co.fr --myip=192.168.2.33 -e -d -s 32
LAUNCHED mpd on node32.beicip.co.fr via node48.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node32.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node48.beicip.co.fr -p 42140 --ifhn=192.168.2.32 --ncpus=1 --myhost=node32.beicip.co.fr --myip=192.168.2.32 -e -d -s 32
debug: mpd on node34.beicip.co.fr on port 41901
RUNNING: mpd on node34.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.34', 'ncpus': 1, 'list_port': 41901, 'entry_port': 42140, 'host': 'node34.beicip.co.fr', 'entry_host': 'node48.beicip.co.fr', 'ifhn': '', 'pid': 12515}
LAUNCHED mpd on node31.beicip.co.fr via node48.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node31.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node48.beicip.co.fr -p 42140 --ifhn=192.168.2.31 --ncpus=1 --myhost=node31.beicip.co.fr --myip=192.168.2.31 -e -d -s 32
debug: mpd on node33.beicip.co.fr on port 51580
RUNNING: mpd on node33.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.33', 'ncpus': 1, 'list_port': 51580, 'entry_port': 42140, 'host': 'node33.beicip.co.fr', 'entry_host': 'node48.beicip.co.fr', 'ifhn': '', 'pid': 12516}
debug: mpd on node32.beicip.co.fr on port 50123
RUNNING: mpd on node32.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.32', 'ncpus': 1, 'list_port': 50123, 'entry_port': 42140, 'host': 'node32.beicip.co.fr', 'entry_host': 'node48.beicip.co.fr', 'ifhn': '', 'pid': 12517}
LAUNCHED mpd on node30.beicip.co.fr via node33.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node30.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node33.beicip.co.fr -p 51580 --ifhn=192.168.2.30 --ncpus=1 --myhost=node30.beicip.co.fr --myip=192.168.2.30 -e -d -s 32
debug: mpd on node31.beicip.co.fr on port 39621
RUNNING: mpd on node31.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.31', 'ncpus': 1, 'list_port': 39621, 'entry_port': 42140, 'host': 'node31.beicip.co.fr', 'entry_host': 'node48.beicip.co.fr', 'ifhn': '', 'pid': 12518}
LAUNCHED mpd on node29.beicip.co.fr via node33.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node29.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node33.beicip.co.fr -p 51580 --ifhn=192.168.2.29 --ncpus=1 --myhost=node29.beicip.co.fr --myip=192.168.2.29 -e -d -s 32
LAUNCHED mpd on node28.beicip.co.fr via node33.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node28.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node33.beicip.co.fr -p 51580 --ifhn=192.168.2.28 --ncpus=1 --myhost=node28.beicip.co.fr --myip=192.168.2.28 -e -d -s 32
debug: mpd on node30.beicip.co.fr on port 58795
RUNNING: mpd on node30.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.30', 'ncpus': 1, 'list_port': 58795, 'entry_port': 51580, 'host': 'node30.beicip.co.fr', 'entry_host': 'node33.beicip.co.fr', 'ifhn': '', 'pid': 12519}
LAUNCHED mpd on node27.beicip.co.fr via node33.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node27.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node33.beicip.co.fr -p 51580 --ifhn=192.168.2.27 --ncpus=1 --myhost=node27.beicip.co.fr --myip=192.168.2.27 -e -d -s 32
debug: mpd on node29.beicip.co.fr on port 58977
RUNNING: mpd on node29.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.29', 'ncpus': 1, 'list_port': 58977, 'entry_port': 51580, 'host': 'node29.beicip.co.fr', 'entry_host': 'node33.beicip.co.fr', 'ifhn': '', 'pid': 12520}
debug: mpd on node28.beicip.co.fr on port 52638
RUNNING: mpd on node28.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.28', 'ncpus': 1, 'list_port': 52638, 'entry_port': 51580, 'host': 'node28.beicip.co.fr', 'entry_host': 'node33.beicip.co.fr', 'ifhn': '', 'pid': 12521}
LAUNCHED mpd on node26.beicip.co.fr via node29.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node26.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node29.beicip.co.fr -p 58977 --ifhn=192.168.2.26 --ncpus=1 --myhost=node26.beicip.co.fr --myip=192.168.2.26 -e -d -s 32
debug: mpd on node27.beicip.co.fr on port 39645
RUNNING: mpd on node27.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.27', 'ncpus': 1, 'list_port': 39645, 'entry_port': 51580, 'host': 'node27.beicip.co.fr', 'entry_host': 'node33.beicip.co.fr', 'ifhn': '', 'pid': 12522}
LAUNCHED mpd on node25.beicip.co.fr via node29.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node25.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node29.beicip.co.fr -p 58977 --ifhn=192.168.2.25 --ncpus=1 --myhost=node25.beicip.co.fr --myip=192.168.2.25 -e -d -s 32
LAUNCHED mpd on node24.beicip.co.fr via node29.beicip.co.fr
debug: launch cmd= /usr/bin/ssh -n node24.beicip.co.fr env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME MPD_CON_EXT=6733.hpc1.beicip.co.fr_12462 I_MPI_JOB_CONTEXT= TMPDIR=/data/users/sja6401/UZ1A/Puma_UZ/PumaModel/pmn32390.dfirst12412 I_MPI_MPD_TMPDIR=/tmp /apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpd.py -h node29.beicip.co.fr -p 58977 --ifhn=192.168.2.24 --ncpus=1 --myhost=node24.beicip.co.fr --myip=192.168.2.24 -e -d -s 32
debug: mpd on node26.beicip.co.fr on port 55563
RUNNING: mpd on node26.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.26', 'ncpus': 1, 'list_port': 55563, 'entry_port': 58977, 'host': 'node26.beicip.co.fr', 'entry_host': 'node29.beicip.co.fr', 'ifhn': '', 'pid': 12523}
debug: mpd on node25.beicip.co.fr on port 45295
RUNNING: mpd on node25.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.25', 'ncpus': 1, 'list_port': 45295, 'entry_port': 58977, 'host': 'node25.beicip.co.fr', 'entry_host': 'node29.beicip.co.fr', 'ifhn': '', 'pid': 12524}
debug: mpd on node24.beicip.co.fr on port 39300
RUNNING: mpd on node24.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.24', 'ncpus': 1, 'list_port': 39300, 'entry_port': 58977, 'host': 'node24.beicip.co.fr', 'entry_host': 'node29.beicip.co.fr', 'ifhn': '', 'pid': 12525}
debug: mpd on node35.beicip.co.fr on port 36465
RUNNING: mpd on node35.beicip.co.fr
debug: info for running mpd: {'ip': '192.168.2.35', 'ncpus': 1, 'list_port': 36465, 'entry_port': 47804, 'host': 'node35.beicip.co.fr', 'entry_host': 'node47.beicip.co.fr', 'ifhn': '', 'pid': 12514}
mpiexec_node67.beicip.co.fr (mpiexec 1034): no msg recvd from mpd when expecting ack of request. Please examine the /tmp/mpd2.logfile_sja6401 log file on each node of the ring.
sja6401
Traceback (most recent call last):
File "/apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpdcleanup", line 239, in ?
mpdcleanup()
File "/apps/puma/Frs400f/IntelMPI_3.1/intel64/bin/mpdcleanup", line 215, in mpdcleanup
pid = re.split(r'\\s+', first_string)[5]
IndexError: list index out of range
0 Kudos
8 Replies
Dmitry_K_Intel2
Employee
798 Views
Hi Laurent,

It looks like something happens at the system level.

>Tests have been done with 3.1, 3.2 and 4.0 version
Could you please explain how you run and check this?
Running 'mpirun --rsh=/usr/bin/ssh -np 1' you just start one MPI process.

Can I take a look at the '/apps/puma/Frs400f/../Frs400f/bin/Linux/RHEL4/x86_64/ref-mpi-IntelMPI3.1-intel9.1/pre.ref'? Is it executable or script? Could you also provide the pbs script you run.

Have you tried to run HelloWorld example on the same nodes?

Regards!
Dmitry

0 Kudos
lpa
Beginner
798 Views
Hi Dmitry

Thanks a lot for your reply.

The simulation script we submit has 3 steps:pre, pro, and pos.
The pre and pos are done on only 1 cpu, as the pro runs on256 cpus in the above example. But the hostfile is generated for the256 cpus.

IntelMPI libraries are provided with the simulator. So I was able to replace the 3.1 version with the 4.0 one. (not compiled) On other clusters, this works fine, even if it's not the best.

pre.ref, pro.ref and pos.ref are executables.

Yes, I tried to run a test with `hostname` command, running on the nodes. (not all the nodes, only 150 cpus) This was fast and correct.

I can't have yet an open PBS script to show you. I'm asking for it and will provide it tomorrow.
We are using the 8 cpus on each node for the runs.

The pbs queue is used for 3 simulators. Do you think there could be aconflict ?

Regards
Laurent
0 Kudos
Dmitry_K_Intel2
Employee
798 Views
Hi Laurent

Do you see this problem with node66 only? Or this is different node each time?

From the log you can see that I_MPI_JOB_CONTEXT is not set. Probably this is becase of you scheme of runninng. 'mpirun' should take jobid from PBS and in this case different mpds will not disturb each other.
If you run 3 different simulators and each starts mpd without I_MPI_JOB_CONTEXT that may cause a conflict.
Could you try to run only 1 simulator at a time. Do you see the same issue?

Regards!
Dmitry

0 Kudos
lpa
Beginner
798 Views
Hi Dmitry,

thanks again for your kind reply.

Perhaps you've seen the problem with I_MPI_JOB_CONTEXT !
The script is automatically generated, but was previously done for SGE.
So, I have in the script: export I_MPI_JOB_CONTEXT=${JOB_ID}

This is working fine with SGE, but doesn't seem to work with Open PBS, seeing the previous log.
I will look after the correct value then. Please, let me know if you've already got the information.

Concerning the first question: no, we've got the problem with many nodes.. It seems now that old processes were always on the nodes, and killing them permits to progress. Now, sometimes it works, which wasn't the case before. These nodes were rebooted a very long time ago. But it's not easy tocheck through 80 nodes... the CONTEXT variable could really help.

Regards
Laurent
0 Kudos
Dmitry_K_Intel2
Employee
798 Views
Laurent,

Could yout try to use:
export I_MPI_JOB_CONTEXT=${PBS_JOBID}


Regards!
Dmitry
0 Kudos
lpa
Beginner
798 Views
Dmitry

Thanks a lot. I will let you know the results.

Regards
Laurent
0 Kudos
lpa
Beginner
798 Views
Hi Dmitry

After having set the context variable, and also having cleaned old processes on some nodes, the message doesn't appear now.

We have 3 runs on 128 cpus at the same time.
On one of them, we've still have an error, but which is now very different. (MPI_Allreduce...)
This is another reason.

Thanks a lot for your help.
Laurent
0 Kudos
Dmitry_K_Intel2
Employee
798 Views
Hi Laurent,

Very good news. It means that different tasks don't affect each other.

Running different tasks on the same nodes may lead to perfomance degradation. In the Intel MPI Library pinning is ON by default. It means that MPI processes from different tasks may be placed on the same cores. You can swtich pinning OFF by:
export I_MPI_PIN=0

To get more information from Intel MPI Library (not only about pinning) you can use I_MPI_DEBUG environment variable. Parameter is a number in the range 2 - 1000.

Fill free to ask about you problem with MPI_Allreduce.

Regards!
Dmitry
0 Kudos
Reply