I guess a workaround for SGE

Marina_K_ · ‎10-21-2014

-perhost option does not work as expected with IntelMPI v.5.0.1.035, though it does work with IntelMPI v.4.1.0.024:

$ qsub -I -lnodes=2:ppn=16:compute,walltime=0:15:00
qsub: waiting for job 5731.hpc-class.its.iastate.edu to start
qsub: job 5731.hpc-class.its.iastate.edu ready

$ mpirun -n 2 -perhost 1 uname -n
hpc-class-40.its.iastate.edu
hpc-class-40.its.iastate.edu

$ export I_MPI_ROOT=/shared/intel//impi/4.1.0.024
$ PATH="${I_MPI_ROOT}/intel64/bin:${PATH}"; export PATH
$ mpirun -n 2 -perhost 1 uname -n
hpc-class-40.its.iastate.edu
hpc-class-39.its.iastate.edu

I also ran the same commands with I_MPI_HYDRA_DEBUG set to 1 (see attached files mpirun-perhost.txt and mpirun-perhost-4.1.0.024.txt). Note that the first two lines of the output in mpirun-perhost.txt suggest that -perhost works (two different hostnames are printed), but at the end it's still printing the same hostname twice.

In mpirun-perhost.txt I_MPI_PERHOST said to be allcores. In another run (see attached file mpirun-perhost-PERHOST1.txt) I set I_MPI_PERHOST to 1, however still at the end only one hostname is printed twice.

To prove that both hostnames are available, I ran the command with 17 processes (there are 16 cores on a node):

[grl@hpc-class-39 ~]$ mpirun -n 17 uname -n | uniq -c
16 hpc-class-39.its.iastate.edu
1 hpc-class-38.its.iastate.edu

Comparing mpirun-perhost.txt and mpirun-perhost-4.1.0.024.txt one can see the following difference:

mpirun-perhost.txt :

     Proxy information:
    *********************
      [1] proxy: hpc-class-40.its.iastate.edu (16 cores)
      Exec list: uname (2 processes);

mpirun-perhost-4.1.0.024.txt :

    Proxy information:
    *********************
      [1] proxy: hpc-class-39.its.iastate.edu (1 cores)
      Exec list: uname (1 processes);

[2] proxy: hpc-class-38.its.iastate.edu (1 cores)
Exec list: uname (1 processes);

So, somehow exec list in the IntelMPI v.5.0.1.035 run does not take into account -perhost value.

Can anyone reproduce the problem?

James_T_Intel · ‎10-31-2014

Try running outside of PBS. It looks like the PBS environment is overriding the -perhost option.

RD1 · ‎06-11-2015

Hello James and Marina,

I do have the same problem: any parallel code running under SGE and mpirun (version 5.0.0) will not respect any of the -ppn, -pehost, -rr options. This is undesirable because any application running on our cluster has to run under the scheduler control and some code (which for example does mpi between nodes and openmp within nodes) needs to run with only so many processes per node! Having to rewrite a hostfile (to use with the -machinefile option) for each application is impractical and undesirable and we relay on the built in capabilities of mpirun to dispatch to a node only as many processes as needed. Could you please offer a workaround that would work within the hydra-scheduler integration?

Thanks,

Raffaella.

RD1 · ‎06-11-2015

I guess a workaround for SGE would be:

hosts=`cat $PE_HOSTFILE | awk '{print $1}' | xargs | sed -e "s/ /,/g"`
mpirun -hosts $hosts -ppn 1 hostname

James_T_Intel · ‎07-07-2015

RD,

Please provide output (via attached text file) with I_MPI_HYDRA_DEBUG=1. This will show the environment seen by the launcher, and should help identify why it isn't detecting SGE.

-perhost not working with IntelMPI v.5.0.1.035