Intel MPI Library recognizes SLURM_JOBID, SLURM_NNODES, SLURM_NODELIST and some other environment variables. But you need to use mpirun to start your application. Only in this case mpd ring will be created.
You probably need to add '-nolocal' option because a node on which you start your application will be added to the ring automatically.
Are you using rsh or ssh connection between nodes?
If you are using ssh you need to provide '-r ssh' option.
Please be sure that there is passwordless connection and you can login both from rm1203 to rm1024 and vice versa.
passwordless (better: passphraseless) ssh needs to be set up for each user, and they might be tempted to copy these keypairs to other systems to ease their login. It's advantageous to have a hostbased login instead:
--rsh specifies the name of the command used to start remote mpds; it
defaults to rsh; an alternative is ssh
--shell says that the Bourne shell is your default for rsh'
--verbose shows the ssh attempts as they occur; it does not provide
confirmation that the sshs were successful
Intel MPI Library for Linux Version 4.0
Build 20100422 Platform Intel 64 64-bit applications
As Dmitry said, you must try stand-alone ssh in both directions among the offending nodes, from the relevant account, to guard against problems such as ~/.ssh/known_hosts containing stale information.
mpdboot (with the appropriate nodelist) followed by mpdtrace and mpdallexit can be used for one-time check on this problem without actually waiting for a chance to run the entire application.
By default, mpdboot is looking for mpd.hosts in the current directory to get information about nodes. Mpdboot doesn't recognize SLURM settings!
If you don't have mpd.hosts file use '-f hosts_file.txt'
In your case it might look like: 'mpdboot -f $SLURM_NODELIST -n 2 -r ssh'
Just one thought:
Do you run your commands after salloc? Or might be you use sbatch or srun commands?
Could you provide details about all commands used to start an application?
In your script intel-rr.batch, please remove "mpdboot -f $SLURM_NODELIST -n 2 -v -r ssh" - you cannot use 'mpdboot'! You need to use mpirun instead.
Change your command line for mpirun:
mpirun -r ssh -nolocal -np 2 ./a.out
Check the log after "sbatch intel-rr.batch" - the format of node list ('rr[10,72]') should be parsed correctly.
Please provide the output if the problem persists.
mpirun was able to parse host name correctly.
Please check that you are able to login on node rr77 from rr76 and vice versa without entering password:
(from rr77) ssh rr76
Could you try the following scenario?
I assume that you have hello application. It may be any simple MPI test
1. Create test.sh file. For instance,
$ cat test.sh
srun hostname -s | sort -u >mpd.hosts
# Example 1
# Launch application using hydra process manager
mpiexec.hydra -f mpd.hosts -n $SLURM_NPROCS -env I_MPI_DEBUG 5 ./hello
# Example 2
# Launch application using MPD process manager
mpdboot -n $SLURM_NNODES -r ssh
mpiexec -n $SLURM_NPROCS -env I_MPI_DEBUG 5 ./hello
2. submit a job using sbatch command. For instance,
$ sbatch -n 4 test.sh
Please let me know how the suggestion help.