Hi Michael,

Raghu_R_ · ‎10-08-2015

Our cluster has 2 Haswell sockets per node, each with 12 cores (24 cores/node).

Using: intel/15.1.133, impi/5.0.3.048

Irrespective of which of the options mentioned in the subject line are used, ranks are always being placed in round-robin fashion. The commands are being run in batch job that generates a host file that contains lines like the following when submitted with:

qsub -l nodes=2:ppn=1 ...

tfe02.% cat hostfile
t0728
t0731
tfe02.%

As an aside, looks like "-ordered-output" is also being ignored. I understand that is a little difficult to achieve, but just wanted to use that for better readability. So please note that the ranks are not printed out in order.

With "-perhost 2" I was expecting ranks 0 on 1 to be on the same node:

-------------

cat /var/spool/torque/aux//889322.bqs5
s0014
s0015
mpirun -ordered-output -np 4 -perhost 2 ./hello_mpi_c-intel-impi
Hello from rank 01 out of 4; procname = s0015, cpuid = 12
Hello from rank 03 out of 4; procname = s0015, cpuid = 24
Hello from rank 02 out of 4; procname = s0014, cpuid = 0
Hello from rank 00 out of 4; procname = s0014, cpuid = 12
---------

The help output from mpirun indicates "-perhost" and "-ppn" are equivalent:

----------

cat /var/spool/torque/aux//889321.bqs5
s0014
s0015
mpirun -ordered-output -np 4 -ppn 2 ./hello_mpi_c-intel-impi
Hello from rank 00 out of 4; procname = s0014, cpuid = 12
Hello from rank 02 out of 4; procname = s0014, cpuid = 0
Hello from rank 01 out of 4; procname = s0015, cpuid = 12
Hello from rank 03 out of 4; procname = s0015, cpuid = 24

--------

Again, "-grr" output is not what was expected:

----------------

cat /var/spool/torque/aux//889323.bqs5
s0014
s0015
mpirun -ordered-output -np 4 -grr 2 ./hello_mpi_c-intel-impi
Hello from rank 02 out of 4; procname = s0014, cpuid = 2
Hello from rank 00 out of 4; procname = s0014, cpuid = 12
Hello from rank 03 out of 4; procname = s0015, cpuid = 24
Hello from rank 01 out of 4; procname = s0015, cpuid = 12

I'm including code that has not been cleaned up below :-(

Please ignore parts that are note relevant.

#include <stdio.h>
#include <mpi.h>
#define _GNU_SOURCE         /* See feature_test_macros(7) */
#include <sched.h>

int main(int argc, char **argv)
{
   int ierr, myid, npes;
   int len, i;
   char name[MPI_MAX_PROCESSOR_NAME];

   ierr = MPI_Init(&argc, &argv);
#ifdef MACROTEST
#define MACROTEST 10
#endif
   ierr = MPI_Comm_rank(MPI_COMM_WORLD, &myid);
   ierr = MPI_Comm_size(MPI_COMM_WORLD, &npes);
   ierr = MPI_Get_processor_name( name, &len );

#ifdef SLEEP
   for (i=1; i<1e1150; i++)
     ;
#endif

     printf("Hello from rank %2.2d out of %d; procname = %s, cpuid = %d\n", myid, npes, name, sched_getcpu());

#ifdef MACROTEST
     printf("Test Macro: %d\n", MACROTEST);
#endif
#ifdef BUG
     {
       int* x = (int*)malloc(10 * sizeof(int));
       x[10] = 0;        // problem 1: heap block overrun
       printf("Print something %d\n",x[10]);
     }                    // problem 2: memory leak -- x not freed
#endif

   ierr = MPI_Finalize();

}

Michael_Intel · ‎10-09-2015

Hello,

By default, Intel MPI respects the job scheduler settings over those that you provide in form of parameters or environment variables.

Therefore "qsub -l nodes=2:ppn=1 ..." leads to 1 rank per node until it runs out of nodes and starts over on the first node again (round-robin).

This behavior helps to prevent users from accidentally using non-allocated resources which might lead the job scheduler to kill these jobs.

In order to change the behavior and make Intel MPI prioritize your parameters over these of the batch system, you can explicitly set I_MPI_JOB_RESPECT_PROCESS_PLACEMENT to 0.

Please let us know if that helps.

Best regards,

Michael

Raghu_R_ · ‎10-09-2015

Hi Michael,

Thank you very much for the prompt response! Confirming that it worked exactly like you stated!

The first one was run like we always do, and the second one was run by setting the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT to 0..

tfe08.% grep ^Hello sifs-2-1-2-4.o2698582 | sort -n -k4
Hello from rank 00 out of 4; procname = t1010, cpuid = 24
Hello from rank 01 out of 4; procname = t1012, cpuid = 24
Hello from rank 02 out of 4; procname = t1010, cpuid = 14
Hello from rank 03 out of 4; procname = t1012, cpuid = 12
tfe08.%
tfe08.% grep ^Hello sifs-2-1-2-4.o2698605 | sort -n -k4
Hello from rank 00 out of 4; procname = t0920, cpuid = 24
Hello from rank 01 out of 4; procname = t0920, cpuid = 12
Hello from rank 02 out of 4; procname = t0928, cpuid = 24
Hello from rank 03 out of 4; procname = t0928, cpuid = 12
tfe08.%

And now I see that this is documented in the Intel MPI Reference Manual!

Thanks a bunch!

Mpirun is treating -perhost, -ppn, -grr the same: always round-robin