Hello, we have a little cluster with Rocks Cluster Distribution.
Intel Cluster Toolkit installed to shared filesystem, path's is ok, ssh pass-less access working.
I select 3 nodes: 1 headnode and 2 computational, and put into file mach 3 lines: headnode, node1, node2
on headnode i run:
mpirun -r ssh -machinefile mach -np 3 ./test.mpi
WARNING: Unable to read mpd.hosts or list of hosts isn't provided. MPI job will be run on the current machine only.
mpiexec: unable to start all procs; may have invalid machine names
remaining specified hosts:
Has anybody info so what is the problem?
Intel MPI Library for Linux, 64-bit applications, Version 3.2.2 Build 20090827
Copyright (C) 2003-2009 Intel Corporation. All rights reserved.
mpirun is a utility which runs mpdboot after that mpiexec. So, options for mpdboot comes first and after that options for mpixec. '-machinefile' is an option for mpiexec.
Could you try to change '-machinefile' to '-f'? Does it work?
I try with replace -machinefile to -f :
mpirun -r ssh -f mach -np 3 ./test.mpi
there are not enough hosts on which to start all processes
I think that count of starting mpd processes increment automatically (local mpd process adds). How to avoid this?
You can use these commands:
1. mpdboot -r ssh -f mach -n 3 - will start an mpd ring on 3 nodes including local host.
2. mpiexec -nolocal -n 3 ./test.mpi - will start your application on node1 and node2
Pay attention that '-n' for mpdboot and for mpiexec has different meaning.
The command 'hostname -s' returns the same string as in mach file on all nodes. However, the command 'hostname' returns node_name.domain_name on our head node, and node_name.local on other nodes - could this be the reason?
We want to use only mpirun command, because it can be implemented in our PBS(Torque) - mpirun command understands $PBS_NODEFILE variable. Also, the command 'mpirun -r ssh -np $proc ./test.mpi' (when $proc = cat $PBS_NODEFILE | wc -l) runs normally in the PBS script only if requested resources don't include head node - otherwise, it hangs up. We think it can be connected to the hostname problem. Could you also help us with this?
If you don't mind you could updgrade you Intel MPI Library to version 4.0 update 1 and use mpiexec.hydra instead of mpirun. It should be optimal for your purposes. Just run:
mpiexec.hydra -rmk pbs ./your_application
and this new process manager will read needed information from PBS' environment.
4.0 update 1 will be installed in another directory so can use either 4.0.1 or 3.2.2