- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have 5 machines - `master` and `node1-4`. I compiled Intel's test.cpp file wtih mpicxx on `master`.
MPD works well as below:
root@master:/opt/intel/impi/4.0.0.027/test# mpdboot -n 5 -v -d -f /etc/mpi/mpd.hosts
debug: starting
running mpdallexit on master
LAUNCHED mpd on master via
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py --ncpus=1 --myhost=master -e -d -s 5
debug: mpd on master on port 44368
RUNNING: mpd on master
debug: info for running mpd: {'ip': '127.0.0.1', 'ncpus': 1, 'list_port': 44368, 'entry_port': '', 'host': 'master', 'entry_host': '', 'ifhn': ''}
LAUNCHED mpd on node1 via master
debug: launch cmd= rsh -n node1 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.32 --ncpus=1 --myhost=node1 --myip=192.168.20.32 -e -d -s 5
LAUNCHED mpd on node2 via master
debug: launch cmd= rsh -n node2 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.34 --ncpus=1 --myhost=node2 --myip=192.168.20.34 -e -d -s 5
LAUNCHED mpd on node3 via master
debug: launch cmd= rsh -n node3 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.35 --ncpus=1 --myhost=node3 --myip=192.168.20.35 -e -d -s 5
LAUNCHED mpd on node4 via master
debug: launch cmd= rsh -n node4 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.36 --ncpus=1 --myhost=node4 --myip=192.168.20.36 -e -d -s 5
debug: mpd on node1 on port 33106
RUNNING: mpd on node1
debug: info for running mpd: {'ip': '192.168.20.32', 'ncpus': 1, 'list_port': 33106, 'entry_port': 44368, 'host': 'node1', 'entry_host': 'master', 'ifhn': '', 'pid': 10340}
debug: mpd on node2 on port 58926
RUNNING: mpd on node2
debug: info for running mpd: {'ip': '192.168.20.34', 'ncpus': 1, 'list_port': 58926, 'entry_port': 44368, 'host': 'node2', 'entry_host': 'master', 'ifhn': '', 'pid': 10342}
debug: mpd on node3 on port 42305
RUNNING: mpd on node3
debug: info for running mpd: {'ip': '192.168.20.35', 'ncpus': 1, 'list_port': 42305, 'entry_port': 44368, 'host': 'node3', 'entry_host': 'master', 'ifhn': '', 'pid': 10344}
debug: mpd on node4 on port 39297
RUNNING: mpd on node4
debug: info for running mpd: {'ip': '192.168.20.36', 'ncpus': 1, 'list_port': 39297, 'entry_port': 44368, 'host': 'node4', 'entry_host': 'master', 'ifhn': '', 'pid': 10346}
MPIEXEC run on master also works:
root@master:/opt/intel/impi/4.0.0.027/test# mpiexec -n 1 ./a.out
Hello world: rank 0 of 1 running on master
However I have problem when try to run in on cluster:
root@master:/opt/intel/impi/4.0.0.027/test# mpiexec -n 5 ./a.out
problem with execution of ./a.out on node2: [Errno 2] No such file or directory
problem with execution of ./a.out on node4: [Errno 2] No such file or directory
problem with execution of ./a.out on node4: [Errno 2] No such file or directory
Tell me please, what could be a reason for this problem? Perhaps ./a.out should be placed on each machine?
Thank you for help and best wishes,
Milosz
I have 5 machines - `master` and `node1-4`. I compiled Intel's test.cpp file wtih mpicxx on `master`.
MPD works well as below:
root@master:/opt/intel/impi/4.0.0.027/test# mpdboot -n 5 -v -d -f /etc/mpi/mpd.hosts
debug: starting
running mpdallexit on master
LAUNCHED mpd on master via
debug: launch cmd= env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py --ncpus=1 --myhost=master -e -d -s 5
debug: mpd on master on port 44368
RUNNING: mpd on master
debug: info for running mpd: {'ip': '127.0.0.1', 'ncpus': 1, 'list_port': 44368, 'entry_port': '', 'host': 'master', 'entry_host': '', 'ifhn': ''}
LAUNCHED mpd on node1 via master
debug: launch cmd= rsh -n node1 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.32 --ncpus=1 --myhost=node1 --myip=192.168.20.32 -e -d -s 5
LAUNCHED mpd on node2 via master
debug: launch cmd= rsh -n node2 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.34 --ncpus=1 --myhost=node2 --myip=192.168.20.34 -e -d -s 5
LAUNCHED mpd on node3 via master
debug: launch cmd= rsh -n node3 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.35 --ncpus=1 --myhost=node3 --myip=192.168.20.35 -e -d -s 5
LAUNCHED mpd on node4 via master
debug: launch cmd= rsh -n node4 env I_MPI_JOB_TAGGED_PORT_OUTPUT=1 /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.36 --ncpus=1 --myhost=node4 --myip=192.168.20.36 -e -d -s 5
debug: mpd on node1 on port 33106
RUNNING: mpd on node1
debug: info for running mpd: {'ip': '192.168.20.32', 'ncpus': 1, 'list_port': 33106, 'entry_port': 44368, 'host': 'node1', 'entry_host': 'master', 'ifhn': '', 'pid': 10340}
debug: mpd on node2 on port 58926
RUNNING: mpd on node2
debug: info for running mpd: {'ip': '192.168.20.34', 'ncpus': 1, 'list_port': 58926, 'entry_port': 44368, 'host': 'node2', 'entry_host': 'master', 'ifhn': '', 'pid': 10342}
debug: mpd on node3 on port 42305
RUNNING: mpd on node3
debug: info for running mpd: {'ip': '192.168.20.35', 'ncpus': 1, 'list_port': 42305, 'entry_port': 44368, 'host': 'node3', 'entry_host': 'master', 'ifhn': '', 'pid': 10344}
debug: mpd on node4 on port 39297
RUNNING: mpd on node4
debug: info for running mpd: {'ip': '192.168.20.36', 'ncpus': 1, 'list_port': 39297, 'entry_port': 44368, 'host': 'node4', 'entry_host': 'master', 'ifhn': '', 'pid': 10346}
MPIEXEC run on master also works:
root@master:/opt/intel/impi/4.0.0.027/test# mpiexec -n 1 ./a.out
Hello world: rank 0 of 1 running on master
However I have problem when try to run in on cluster:
root@master:/opt/intel/impi/4.0.0.027/test# mpiexec -n 5 ./a.out
problem with execution of ./a.out on node2: [Errno 2] No such file or directory
problem with execution of ./a.out on node4: [Errno 2] No such file or directory
problem with execution of ./a.out on node4: [Errno 2] No such file or directory
Tell me please, what could be a reason for this problem? Perhaps ./a.out should be placed on each machine?
Thank you for help and best wishes,
Milosz
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Milosz,
>Does it mean executable file should be always shared by all nodes? (e.g by nfs)
Yes, it does. Usually an application is located on a shared drive (directory) and you should not care about versions of this application, recompile it on each node, copy to all nodes and so on. It's just more convenient.
>The other problem I just saw is node3 and node1 does not respond
Pay attention, that you create 5 processes, but it doesn't mean that 1 process will be placed on each node. 2 processes ran on master, 2 processes ran on node 4 and 1 on node2.
If you need to run only 1 process per node you need to use '-ppn 1' option. Please try it out.
Regards!
Dmitry
>Does it mean executable file should be always shared by all nodes? (e.g by nfs)
Yes, it does. Usually an application is located on a shared drive (directory) and you should not care about versions of this application, recompile it on each node, copy to all nodes and so on. It's just more convenient.
>The other problem I just saw is node3 and node1 does not respond
Pay attention, that you create 5 processes, but it doesn't mean that 1 process will be placed on each node. 2 processes ran on master, 2 processes ran on node 4 and 1 on node2.
If you need to run only 1 process per node you need to use '-ppn 1' option. Please try it out.
Regards!
Dmitry
Link Copied
2 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have copied for testing purpose a.out to all nodes and it started to work. Does it mean executable file should be always shared by all nodes? (e.g by nfs)
The other problem I just saw is node3 and node1 does not respond:
root@master:/opt/intel/impi/4.0.0.027/test# mpiexec -n 5 ./a.out
Hello world: rank 0 of 5 running on master
Hello world: rank 1 of 5 running on master
Hello world: rank 2 of 5 running on node4
Hello world: rank 3 of 5 running on node4
Hello world: rank 4 of 5 running on node2
root@master:/opt/intel/impi/4.0.0.027/test# ping node3
PING node3 (192.168.20.35) 56(84) bytes of data.
64 bytes from node3 (192.168.20.35): icmp_seq=1 ttl=64 time=0.583 ms
root@master:/opt/intel/impi/4.0.0.027/test# ping node1
PING node1 (192.168.20.32) 56(84) bytes of data.
64 bytes from node1 (192.168.20.32): icmp_seq=1 ttl=64 time=0.617 ms
It is strange because as you see below mpd is running on node3 and 1 and of couse ssh works without password
root@node3:~# ps aux | grep mpd
root 20453 0.0 0.4 42644 8516 ? S 10:07 0:00 python /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.35 --ncpus=1 --myhost=node3 --myip=192.168.20.35 -e -d -s 5
root@node1:~# ps aux | grep mpd
root 19624 0.0 0.5 42640 8504 ? S 10:07 0:00 python /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.32 --ncpus=1 --myhost=node1 --myip=192.168.20.32 -e -d -s 5
Could you help me please?
The other problem I just saw is node3 and node1 does not respond:
root@master:/opt/intel/impi/4.0.0.027/test# mpiexec -n 5 ./a.out
Hello world: rank 0 of 5 running on master
Hello world: rank 1 of 5 running on master
Hello world: rank 2 of 5 running on node4
Hello world: rank 3 of 5 running on node4
Hello world: rank 4 of 5 running on node2
root@master:/opt/intel/impi/4.0.0.027/test# ping node3
PING node3 (192.168.20.35) 56(84) bytes of data.
64 bytes from node3 (192.168.20.35): icmp_seq=1 ttl=64 time=0.583 ms
root@master:/opt/intel/impi/4.0.0.027/test# ping node1
PING node1 (192.168.20.32) 56(84) bytes of data.
64 bytes from node1 (192.168.20.32): icmp_seq=1 ttl=64 time=0.617 ms
It is strange because as you see below mpd is running on node3 and 1 and of couse ssh works without password
root@node3:~# ps aux | grep mpd
root 20453 0.0 0.4 42644 8516 ? S 10:07 0:00 python /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.35 --ncpus=1 --myhost=node3 --myip=192.168.20.35 -e -d -s 5
root@node1:~# ps aux | grep mpd
root 19624 0.0 0.5 42640 8504 ? S 10:07 0:00 python /opt/intel/impi/4.0.0.027/intel64/bin/mpd.py -h master -p 44368 --ifhn=192.168.20.32 --ncpus=1 --myhost=node1 --myip=192.168.20.32 -e -d -s 5
Could you help me please?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Milosz,
>Does it mean executable file should be always shared by all nodes? (e.g by nfs)
Yes, it does. Usually an application is located on a shared drive (directory) and you should not care about versions of this application, recompile it on each node, copy to all nodes and so on. It's just more convenient.
>The other problem I just saw is node3 and node1 does not respond
Pay attention, that you create 5 processes, but it doesn't mean that 1 process will be placed on each node. 2 processes ran on master, 2 processes ran on node 4 and 1 on node2.
If you need to run only 1 process per node you need to use '-ppn 1' option. Please try it out.
Regards!
Dmitry
>Does it mean executable file should be always shared by all nodes? (e.g by nfs)
Yes, it does. Usually an application is located on a shared drive (directory) and you should not care about versions of this application, recompile it on each node, copy to all nodes and so on. It's just more convenient.
>The other problem I just saw is node3 and node1 does not respond
Pay attention, that you create 5 processes, but it doesn't mean that 1 process will be placed on each node. 2 processes ran on master, 2 processes ran on node 4 and 1 on node2.
If you need to run only 1 process per node you need to use '-ppn 1' option. Please try it out.
Regards!
Dmitry
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page