Intel MPI error((

license_soft · ‎12-12-2010

Hello, we have a little cluster with Rocks Cluster Distribution.

Intel Cluster Toolkit installed to shared filesystem, path's is ok, ssh pass-less access working.

I select 3 nodes: 1 headnode and 2 computational, and put into file mach 3 lines: headnode, node1, node2

on headnode i run:

mpirun -r ssh -machinefile mach -np 3 ./test.mpi

WARNING: Unable to read mpd.hosts or list of hosts isn't provided. MPI job will be run on the current machine only.
mpiexec: unable to start all procs; may have invalid machine names
remaining specified hosts:
(headnode.local)
(node1.local)
(node2.local)

Has anybody info so what is the problem?

mpiexec -V
Intel MPI Library for Linux, 64-bit applications, Version 3.2.2 Build 20090827
Copyright (C) 2003-2009 Intel Corporation. All rights reserved.

Thanks!

Dmitry_K_Intel2 · ‎12-13-2010

Hi,

mpirun is a utility which runs mpdboot after that mpiexec. So, options for mpdboot comes first and after that options for mpixec. '-machinefile' is an option for mpiexec.
Could you try to change '-machinefile' to '-f'? Does it work?

Regards!
Dmitry

license_soft · ‎12-13-2010

I try with replace -machinefile to -f :

mpirun -r ssh -f mach -np 3 ./test.mpi
totalnum=4 numhosts=3
there are not enough hosts on which to start all processes

I think that count of starting mpd processes increment automatically (local mpd process adds). How to avoid this?

Dmitry_K_Intel2 · ‎12-13-2010

You probably have 2 different names for one node. Looks like 'hostname -s' returns another string (not the same as you wrote in mach file).

You can use these commands:
1. mpdboot -r ssh -f mach -n 3 - will start an mpd ring on 3 nodes including local host.
2. mpiexec -nolocal -n 3 ./test.mpi - will start your application on node1 and node2
3. mpdcleanup

Pay attention that '-n' for mpdboot and for mpiexec has different meaning.

Regards!
Dmitry

license_soft · ‎12-16-2010

The command 'hostname -s' returns the same string as in mach file on all nodes. However, the command 'hostname' returns node_name.domain_name on our head node, and node_name.local on other nodes - could this be the reason?

We want to use only mpirun command, because it can be implemented in our PBS(Torque) - mpirun command understands $PBS_NODEFILE variable. Also, the command 'mpirun -r ssh -np $proc ./test.mpi' (when $proc = cat $PBS_NODEFILE | wc -l) runs normally in the PBS script only if requested resources don't include head node - otherwise, it hangs up. We think it can be connected to the hostname problem. Could you also help us with this?

Dmitry_K_Intel2 · ‎12-16-2010

If names from 'hostaname' do not coincide with ones in mach file it can be the reason of such behavior, especially in 3.2.2.

If you don't mind you could updgrade you Intel MPI Library to version 4.0 update 1 and use mpiexec.hydra instead of mpirun. It should be optimal for your purposes. Just run:
mpiexec.hydra -rmk pbs ./your_application
and this new process manager will read needed information from PBS' environment.

4.0 update 1 will be installed in another directory so can use either 4.0.1 or 3.2.2

Regards!
Dmitry