Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Problem with mpiexec

rabbitsoft
Beginner
685 Views
Hello,

I have 5 machines - `master` and `node1-4`. I compiled Intel's test.cpp file wtih mpicxx on `master`.

MPD works well as below:

root@master:/opt/intel/impi/4.0.0.027/test# mpdboot -n 5 -v -d -f /etc/mpi/mpd.hosts
debug: starting
running mpdallexit on master
LAUNCHED mpd on master via
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py --ncpus=1 --myhost=master -e -d -s 5
debug: mpd on master on port 44368
RUNNING: mpd on master
debug: info for running mpd: {'ip': '127.0.0.1', 'ncpus': 1, 'list_port': 44368, 'entry_port': '', 'host': 'master', 'entry_host': '', 'ifhn': ''}
LAUNCHED mpd on node1 via master
debug: launch cmd= rsh -n node1 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.32 --ncpus=1 --myhost=node1 --myip=192.168.20.32 -e -d -s 5
LAUNCHED mpd on node2 via master
debug: launch cmd= rsh -n node2 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.34 --ncpus=1 --myhost=node2 --myip=192.168.20.34 -e -d -s 5
LAUNCHED mpd on node3 via master
debug: launch cmd= rsh -n node3 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.35 --ncpus=1 --myhost=node3 --myip=192.168.20.35 -e -d -s 5
LAUNCHED mpd on node4 via master
debug: launch cmd= rsh -n node4 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.36 --ncpus=1 --myhost=node4 --myip=192.168.20.36 -e -d -s 5
debug: mpd on node1 on port 33106
RUNNING: mpd on node1
debug: info for running mpd: {'ip': '192.168.20.32', 'ncpus': 1, 'list_port': 33106, 'entry_port': 44368, 'host': 'node1', 'entry_host': 'master', 'ifhn': '', 'pid': 10340}
debug: mpd on node2 on port 58926
RUNNING: mpd on node2
debug: info for running mpd: {'ip': '192.168.20.34', 'ncpus': 1, 'list_port': 58926, 'entry_port': 44368, 'host': 'node2', 'entry_host': 'master', 'ifhn': '', 'pid': 10342}
debug: mpd on node3 on port 42305
RUNNING: mpd on node3
debug: info for running mpd: {'ip': '192.168.20.35', 'ncpus': 1, 'list_port': 42305, 'entry_port': 44368, 'host': 'node3', 'entry_host': 'master', 'ifhn': '', 'pid': 10344}
debug: mpd on node4 on port 39297
RUNNING: mpd on node4
debug: info for running mpd: {'ip': '192.168.20.36', 'ncpus': 1, 'list_port': 39297, 'entry_port': 44368, 'host': 'node4', 'entry_host': 'master', 'ifhn': '', 'pid': 10346}


MPIEXEC run on master also works:

root@master:/opt/intel/impi/4.0.0.027/test# mpiexec -n 1 ./a.out
Hello world: rank 0 of 1 running on master


However I have problem when try to run in on cluster:

root@master:/opt/intel/impi/4.0.0.027/test# mpiexec -n 5 ./a.out
problem with execution of ./a.out on node2: [Errno 2] No such file or directory
problem with execution of ./a.out on node4: [Errno 2] No such file or directory
problem with execution of ./a.out on node4: [Errno 2] No such file or directory


Tell me please, what could be a reason for this problem? Perhaps ./a.out should be placed on each machine?

Thank you for help and best wishes,
Milosz

0 Kudos
1 Solution
Dmitry_K_Intel2
Employee
685 Views
Hi Milosz,

>Does it mean executable file should be always shared by all nodes? (e.g by nfs)
Yes, it does. Usually an application is located on a shared drive (directory) and you should not care about versions of this application, recompile it on each node, copy to all nodes and so on. It's just more convenient.

>The other problem I just saw is node3 and node1 does not respond
Pay attention, that you create 5 processes, but it doesn't mean that 1 process will be placed on each node. 2 processes ran on master, 2 processes ran on node 4 and 1 on node2.
If you need to run only 1 process per node you need to use '-ppn 1' option. Please try it out.

Regards!
Dmitry

View solution in original post

0 Kudos
2 Replies
rabbitsoft
Beginner
685 Views
I have copied for testing purpose a.out to all nodes and it started to work. Does it mean executable file should be always shared by all nodes? (e.g by nfs)

The other problem I just saw is node3 and node1 does not respond:

root@master:/opt/intel/impi/4.0.0.027/test# mpiexec -n 5 ./a.out
Hello world: rank 0 of 5 running on master
Hello world: rank 1 of 5 running on master
Hello world: rank 2 of 5 running on node4
Hello world: rank 3 of 5 running on node4
Hello world: rank 4 of 5 running on node2


root@master:/opt/intel/impi/4.0.0.027/test# ping node3
PING node3 (192.168.20.35) 56(84) bytes of data.
64 bytes from node3 (192.168.20.35): icmp_seq=1 ttl=64 time=0.583 ms

root@master:/opt/intel/impi/4.0.0.027/test# ping node1
PING node1 (192.168.20.32) 56(84) bytes of data.
64 bytes from node1 (192.168.20.32): icmp_seq=1 ttl=64 time=0.617 ms


It is strange because as you see below mpd is running on node3 and 1 and of couse ssh works without password

root@node3:~# ps aux | grep mpd
root 20453 0.0 0.4 42644 8516 ? S 10:07 0:00 python /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.35 --ncpus=1 --myhost=node3 --myip=192.168.20.35 -e -d -s 5

root@node1:~# ps aux | grep mpd
root 19624 0.0 0.5 42640 8504 ? S 10:07 0:00 python /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.32 --ncpus=1 --myhost=node1 --myip=192.168.20.32 -e -d -s 5


Could you help me please?
0 Kudos
Dmitry_K_Intel2
Employee
686 Views
Hi Milosz,

>Does it mean executable file should be always shared by all nodes? (e.g by nfs)
Yes, it does. Usually an application is located on a shared drive (directory) and you should not care about versions of this application, recompile it on each node, copy to all nodes and so on. It's just more convenient.

>The other problem I just saw is node3 and node1 does not respond
Pay attention, that you create 5 processes, but it doesn't mean that 1 process will be placed on each node. 2 processes ran on master, 2 processes ran on node 4 and 1 on node2.
If you need to run only 1 process per node you need to use '-ppn 1' option. Please try it out.

Regards!
Dmitry
0 Kudos
Reply