Re: I can't run program on slave node

camiyu917gmail_com · ‎05-20-2009

I can't run program on slave node.

execute "mpirun -f ./mpd.hosts -np 2 ./testcpp"
=====================================================
Hello world: rank 0 of 2 running on cluster-master
Hello world: rank 1 of 2 running on cluster-master
=====================================================
It is just run on master

execute "sshconnectivity.exp machines.LINUX"
=====================================================
Node count = 2
Secure shell connectivity was established on all nodes.
See the log output listing "/tmp/sshconnectivity.user.log" for details.
Version number: $Revision: 1.18 $
Version date: $Date: 2008/10/19 04:06:21 $
=====================================================

the content of mpd.hosts & machines.LINUX is
=====================================================
cluster-master
cluster-slave1
=====================================================
and save in /home/user at master & slave.

but have a problem happen when I execute "mpdboot -f ./mpd.hosts -n 2"
=====================================================
mpdboot_cluster-master (handle_mpd_output 739): failed to ping mpd on cluster-slave1; received output={}
=====================================================

I can use ssh login master from slave without password, also can use ssh login slave from master without password,
and I already close firewall.

Please Help me.... Thanks...

Dmitry_K_Intel2 · ‎05-21-2009

Quoting - camiyu917gmail.com

I can't run program on slave node.

execute "mpirun -f ./mpd.hosts -np 2 ./testcpp"
but have a problem happen when I execute "mpdboot -f ./mpd.hosts -n 2"

Hello Camiyu917,

Seems you have configured your cluster for using ssh, so I think that you cantry to add "-r ssh" to both commands?
If you need to start 1 process per host you can set I_MPI_PERHOST to 1 and checkit launching "mpirun -n 2..." - it should start 2 process on 2 different nodes.

Regards!
Dmitry

camiyu917gmail_com · ‎05-21-2009

Quoting - Dmitry Kuzmin (Intel)

Hello Camiyu917,

Seems you have configured your cluster for using ssh, so I think that you cantry to add "-r ssh" to both commands?
If you need to start 1 process per host you can set I_MPI_PERHOST to 1 and checkit launching "mpirun -n 2..." - it should start 2 process on 2 different nodes.

Regards!
Dmitry

Dmitry, thanks for your replication.

I met a another problem.

I try to execute "mpdboot -n 2 -f ./mpd.hosts -r ssh", but have this problem.
=============================================================
mpdboot_cluster-master (handle_mpd_output 730): Failed to establish a socket connection with cluster-slave1:33736 : (111, 'Connection refused')
mpdboot_cluster-master (handle_mpd_output 747): failed to connect to mpd on cluster-slave1
=============================================================

After "export I_MPI_PERHOST=1", I execute "mpirun -n 2 -f ./mpd.hosts -r ssh ./testcpp".
I get this problem
=============================================================
mpiexec_cluster-master (mpiexec 841): no msg recvd from mpd when expecting ack of request. Please examine the /tmp/mpd2.logfile_user log file on each node of the ring.
=============================================================

Thanks for your help ~ ^__^

Dmitry_K_Intel2 · ‎05-21-2009

Camiyu917, could you send /tmp/mpd2.logfile_user files? This is very strange error.

Best wishes!
Dmitry

camiyu917gmail_com · ‎05-21-2009

I am execute "export I_MPI_PERHOST=1" and "mpirun -n 2 -f ./mpd.hosts -r ssh ./testcpp".

============== mpd2.logfile_user_090522.055102_15438 ================
logfile for mpd with pid 15491
cluster-master_37718 (handle_rhs_input 2145): connection with the right neighboring mpd daemon was lost; attempting to re-enter the mpd ring
cluster-master_37718 (reenter_ring 691): reenter_ring returned 0 after 1 tries
cluster-master_37718 (handle_rhs_input 2152): the daemon successfully reentered the mpd ring
=========================================================

thanks you~

Dmitry_K_Intel2 · ‎05-25-2009

Hi Camiyu917,

Could you try to run `mpdboot -r ssh -f mpd.hosts -n 2 --chkuponly`
If it doesn't work it means that ssh doesn't work properly.

And try please the following commands:
`mpdboot -r ssh -f mpd.hosts -n 2`

`mpdtrace`

`mpiexec -genv I_MPI_PERHOST 1 -n 2 hostname`

`mpiexec -genv I_MPI_PERHOST 1 -n 2 ./testcpp`

`mpdallexit`

`mpirun -r ssh -f ./mpd.hosts -genv I_MPI_PERHOST 1 -n 2 ./testcpp`

Let me know the output we've seen.

And please sendme /etc/hosts as well.

Best wishes.
Dmitry

camiyu917gmail_com · ‎05-25-2009

Hello Dmitry:

mpdboot -r ssh -f mpd.hosts -n 2 --chkuponly
====================================================
checking cluster-slave1
there are 2 hosts up (counting local)
====================================================

mpdboot -r ssh -f mpd.hosts -n 2
====================================================
mpdboot_cluster-master (handle_mpd_output 730): Failed to establish a socket connection with cluster-slave1:33674 : (111, 'Connection refused')
mpdboot_cluster-master (handle_mpd_output 747): failed to connect to mpd on cluster-slave1
====================================================

/etc/hosts
====================================================
127.0.0.1 localhost
192.168.2.150 cluster-master cluster-master
192.168.2.151 cluster-slave1 cluster-slave1
====================================================

thanks for your help.

Dmitry_K_Intel2 · ‎05-25-2009

Hi Camiyu917,

Please login to both servers and check that there is no running mpd.py. Execute 'killall -9 mpd.py' to be sure. (Existing mpd.py processes can prevent a ring creation).

Start 'mpdboot -r ssh -f mpd.hosts -n 2 --debug' (and let me know the output).

Port on your machines (33674) was closed somehow. Might be this is firewall or some other settings. Could you switch off your firewall for short period of time just to check mpi commands?

Best wishes!
Dmitry

camiyu917gmail_com · ‎05-26-2009

Hello Dmitry:

I hava close firewall and check no mpd.py process on master and slave1.

Then I execute "mpdboot -r ssh -f mpd.hosts -n 2 --debug" on master and slave1.

=================== execute on cluster-master =========================
debug: starting
running mpdallexit on cluster-master
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.0.011/bin64/mpd.py --ncpus=1 --myhost=cluster-master -e -d -s 2
debug: mpd on cluster-master on port 45068
debug: info for running mpd: {'ip': '192.168.2.150', 'ncpus': 1, 'list_port': 45068, 'entry_port': '', 'host': 'cluster-master', 'entry_host': '', 'ifhn': ''}
debug: launch cmd= ssh -x -n cluster-slave1 'env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.0.011/bin64/mpd.py -h cluster-master -p 45068 --ifhn=192.168.2.151 --ncpus=1 --myhost=cluster-slave1 --myip=192.168.2.151 -e -d -s 2'
debug: mpd on cluster-slave1 on port 55976
debug: info for running mpd: {'ip': '192.168.2.151', 'ncpus': 1, 'list_port': 55976, 'entry_port': 45068, 'host': 'cluster-slave1', 'entry_host': 'cluster-master', 'ifhn': '', 'pid': 11783}
==============================================================

=================== execute on cluster-slave1 =========================
debug: starting
running mpdallexit on cluster-slave1
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/3.2.0.011/bin64/mpd.py --ncpus=1 --myhost=cluster-slave1 -e -d -s 2
debug: mpd on cluster-slave1 on port 60362
debug: info for running mpd: {'ip': '192.168.2.151', 'ncpus': 1, 'list_port': 60362, 'entry_port': '', 'host': 'cluster-slave1', 'entry_host': '', 'ifhn': ''}
debug: launch cmd= ssh -x -n cluster-master 'env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 HOSTNAME=$HOSTNAME HOSTTYPE=$HOSTTYPE MACHTYPE=$MACHTYPE OSTYPE=$OSTYPE /opt/intel/impi/3.2.0.011/bin64/mpd.py -h cluster-slave1 -p 60362 --ifhn=192.168.2.150 --ncpus=1 --myhost=cluster-master --myip=192.168.2.150 -e -d -s 2'
debug: mpd on cluster-master on port 42052
debug: info for running mpd: {'ip': '192.168.2.150', 'ncpus': 1, 'list_port': 42052, 'entry_port': 60362, 'host': 'cluster-master', 'entry_host': 'cluster-slave1', 'ifhn': '', 'pid': 30178}
==============================================================

thank you

Best regard
John

Dmitry_K_Intel2 · ‎05-26-2009

Hi John,

Seems you were able to start MPD ring with firewal switched off. To be sure you can run mpdtrace.

Could you try to do the same with firewall switched on. We do NOT recommend to use firewall for MPI application or configure it so that all ports will be available for internal connections.

Best wishes,
Dmitry

smtp12357 · ‎05-28-2009

Hi

I have the same problem with command mpdboot. After execution I got the following log

[shell]Runing on the host n2114.nodes
This jobs runs on the following processors:
n2114.nodes n2114.nodes n2113.nodes n2113.nodes n2112.nodes n2112.nodes
running mpdallexit on n2114.nodes
LAUNCHED mpd on n2114.nodes via
RANNING: mpd on n2114.nodes
LAUNCHED mpd on n2113.nodes via n2114.nodes
LAUNCHED mpd on n2112.nodes via n2114.nodes
mpd_boot_n2114.nodes (handle_mpd_output 730): Failed to establish a socket connection with n2112.nodes:43606 : (111, 'Connection refused')
mpd_boot_n2114.nodes (handle_mpd_output 747): Failed to connect to mpd on n2112.nodes
[/shell]

Does anybody know how to fix this problem using only user access to cluster?

Dmitry_K_Intel2 · ‎05-28-2009

Quoting - smtp12357

HLAUNCHED mpd on n2114.nodes via
LAUNCHED mpd on n2112.nodes via n2114.nodes
mpd_boot_n2114.nodes (handle_mpd_output 730): Failed to establish a socket connection with n2112.nodes:43606 : (111, 'Connection refused')

Hi smtp12357,

Yeah, seems you have the same problem. You can start mpds but you mpds cannot open connection- it looks like ports are closed. Might be this is firewall. Could you ask sysadmin to open tcp ports for internal connections or just switch firewall off.

Best wishes,
Dmitry

smtp12357 · ‎05-30-2009

Dmitry, Thank you for explanation.
I asked our sysadmin and he solved this problem. Now mpdboot works well

camiyu917gmail_com · ‎05-31-2009

Quoting - Dmitry Kuzmin (Intel)

Hi John,

Seems you were able to start MPD ring with firewal switched off. To be sure you can run mpdtrace.

Could you try to do the same with firewall switched on. We do NOT recommend to use firewall for MPI application or configure it so that all ports will be available for internal connections.

Best wishes,
Dmitry

Hello Dmitry:

I have firewall switch on, then I sure master can user ssh login slave1 without password and slave1 can login master too.

I execute follow instruction:

mpdboot -r ssh -f mpd.hosts -n 2 --chkuponly
====================================================
checking cluster-slave1
there are 2 hosts up (counting local)
====================================================

mpdboot -r ssh -f mpd.hosts -n 2
-- no message

mpdtrace
====================================================
cluster-master
cluster-slave1
====================================================

mpiexec -genv I_MPI_PERHOST 1 -n 2 hostname
====================================================
mpiexec_cluster-master (mpiexec 841): no msg recvd from mpd when expecting ack of request. Please examine the /tmp/mpd2.logfile_user log file on each node of the ring.
====================================================

mpiexec -genv I_MPI_PERHOST 1 -n 2 ./testcpp
====================================================
Hello world: rank 0 of 2 running on cluster-master
Hello world: rank 1 of 2 running on cluster-master
====================================================

mpdallexit
-- no message

mpirun -r ssh -f ./mpd.hosts -genv I_MPI_PERHOST 1 -n 2 ./testcpp
====================================================
(mpiexec 841): no msg recvd from mpd when expecting ack of request. Please examine the /tmp/mpd2.logfile_user log file on each node of the ring.
====================================================

=============== mpd2.logfile_user_090601.094125_3962 ======================
cluster-master_48124 (handle_rhs_input 2145): connection with the right neighboring mpd daemon was lost; attempting to re-enter the mpd ring
cluster-master_48124 (reenter_ring 691): reenter_ring returned 0 after 1 tries
cluster-master_48124 (handle_rhs_input 2152): the daemon successfully reentered the mpd ring
===========================================================================

I do not know how to solve this problem...
I try execute follow instruction. I get very strange message.

mpdboot -r ssh -f mpd.hosts -n 2
-- no message

mpiexec -n 8 ./testcpp
====================================================
mpiexec_cluster-master (mpiexec 841): no msg recvd from mpd when expecting ack of request. Please examine the /tmp/mpd2.logfile_user log file on each node of the ring.
====================================================

mpiexec -n 2 ./testcpp
====================================================
Hello world: rank 0 of 2 running on cluster-master
Hello world: rank 1 of 2 running on cluster-master
====================================================

mpiexec -n 4 ./testcpp
====================================================
Hello world: rank 0 of 4 running on cluster-master
Hello world: rank 1 of 4 running on cluster-master
Hello world: rank 2 of 4 running on cluster-master
Hello world: rank 3 of 4 running on cluster-master
====================================================

mpiexec -n 8 ./testcpp
====================================================
Hello world: rank 0 of 8 running on cluster-master
Hello world: rank 1 of 8 running on cluster-master
Hello world: rank 2 of 8 running on cluster-master
Hello world: rank 3 of 8 running on cluster-master
Hello world: rank 4 of 8 running on cluster-master
Hello world: rank 5 of 8 running on cluster-master
Hello world: rank 6 of 8 running on cluster-master
Hello world: rank 7 of 8 running on cluster-master
====================================================

The cluster-master's CPU is Intel Pentium D 925+, this CPU just have 1 core and 2 hyper-thread.
First, I execute `mpiexec -n 8` I got error message, but finall this program run on cluster-master user 8 core. Is this bug?

If you need, I can email our Teamviewer's ID and password to you.
thanks for your help.

Best Regard
John

Dmitry_K_Intel2 · ‎06-01-2009

Quoting - camiyu917gmail.com

Hello Dmitry:

I have firewall switch on, then I sure master can user ssh login slave1 without password and slave1 can login master too.

I execute follow instruction:

mpdallexit
-- no message

mpirun -r ssh -f ./mpd.hosts -genv I_MPI_PERHOST 1 -n 2 ./testcpp
====================================================
(mpiexec 841): no msg recvd from mpd when expecting ack of request. Please examine the /tmp/mpd2.logfile_user log file on each node of the ring.
====================================================

=============== mpd2.logfile_user_090601.094125_3962 ======================
cluster-master_48124 (handle_rhs_input 2145): connection with the right neighboring mpd daemon was lost; attempting to re-enter the mpd ring
cluster-master_48124 (reenter_ring 691): reenter_ring returned 0 after 1 tries
cluster-master_48124 (handle_rhs_input 2152): the daemon successfully reentered the mpd ring
===========================================================================

I do not know how to solve this problem...
I try execute follow instruction. I get very strange message.

mpdboot -r ssh -f mpd.hosts -n 2
-- no message

mpiexec -n 8 ./testcpp
====================================================
mpiexec_cluster-master (mpiexec 841): no msg recvd from mpd when expecting ack of request. Please examine the /tmp/mpd2.logfile_user log file on each node of the ring.
====================================================

mpiexec -n 2 ./testcpp
====================================================
Hello world: rank 0 of 2 running on cluster-master
Hello world: rank 1 of 2 running on cluster-master
====================================================

The cluster-master's CPU is Intel Pentium D 925+, this CPU just have 1 core and 2 hyper-thread.
First, I execute `mpiexec -n 8` I got error message, but finall this program run on cluster-master user 8 core. Is this bug?

If you need, I can email our Teamviewer's ID and password to you.
thanks for your help.

Best Regard
John

Hi John,

Fromthe first part ofyour question it seems to me that Firewall doesn't allow to establish connetion between mpiexec and mpd. To get more info you can use --verbose switch.
Could you send log file from cluster-slave? This is the most interesting file.

Second part: mpd itself is smart enough to change anmpd-ring. And the message: "with the right neighboring mpd daemon was lost; attempting to re-enter the mpd ring" says that there will be new ring - in your case only one node has left - only your task will be executed on one node only.

You can start all processes on one node - no problem, but you'll get performance not as good as you start them in parallel.

You can write me directly dmitry.kuzmin (at) intel.com

Best wishes,
Dmitry