I just tried this with Intel

Matt_Thompson · ‎04-30-2015

All,

(Note: I'm also asking this on the slurm-dev list.)

I'm hoping you can help me with a question. Namely, I'm on a cluster that uses SLURM and lets say I ask for 2 28-core Haswell nodes to run interactively and I get them. Great, so my environment now has things like:

SLURM_NTASKS_PER_NODE=28
SLURM_TASKS_PER_NODE=28(x2)
SLURM_JOB_CPUS_PER_NODE=28(x2)
SLURM_CPUS_ON_NODE=28

Now, let's run a simple HelloWorld on, say, 48 processors (and pipe through sort to see things a bit better):

(1047) $ mpirun -np 48 -print-rank-map ./helloWorld.exe | sort -k2 -g
srun.slurm: cluster configuration lacks support for cpu binding
(borgj102:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27)
(borgj105:28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47)
Process    0 of   48 is on borgj102
Process    1 of   48 is on borgj102
Process    2 of   48 is on borgj102
Process    3 of   48 is on borgj102
Process    4 of   48 is on borgj102
Process    5 of   48 is on borgj102
Process    6 of   48 is on borgj102
Process    7 of   48 is on borgj102
Process    8 of   48 is on borgj102
Process    9 of   48 is on borgj102
Process   10 of   48 is on borgj102
Process   11 of   48 is on borgj102
Process   12 of   48 is on borgj102
Process   13 of   48 is on borgj102
Process   14 of   48 is on borgj102
Process   15 of   48 is on borgj102
Process   16 of   48 is on borgj102
Process   17 of   48 is on borgj102
Process   18 of   48 is on borgj102
Process   19 of   48 is on borgj102
Process   20 of   48 is on borgj102
Process   21 of   48 is on borgj102
Process   22 of   48 is on borgj102
Process   23 of   48 is on borgj102
Process   24 of   48 is on borgj102
Process   25 of   48 is on borgj102
Process   26 of   48 is on borgj102
Process   27 of   48 is on borgj102
Process   28 of   48 is on borgj105
Process   29 of   48 is on borgj105
Process   30 of   48 is on borgj105
Process   31 of   48 is on borgj105
Process   32 of   48 is on borgj105
Process   33 of   48 is on borgj105
Process   34 of   48 is on borgj105
Process   35 of   48 is on borgj105
Process   36 of   48 is on borgj105
Process   37 of   48 is on borgj105
Process   38 of   48 is on borgj105
Process   39 of   48 is on borgj105
Process   40 of   48 is on borgj105
Process   41 of   48 is on borgj105
Process   42 of   48 is on borgj105
Process   43 of   48 is on borgj105
Process   44 of   48 is on borgj105
Process   45 of   48 is on borgj105
Process   46 of   48 is on borgj105
Process   47 of   48 is on borgj105

As you can see, the first 28 processes are on node 1, and the last 20 are on node 2. Okay. Now, I want to do some load balancing, so I want 24 on each. In the past, I always used -perhost and it worked, but now:

(1048) $ mpirun -np 48 -perhost 24 -print-rank-map ./helloWorld.exe | sort -k2 -g
srun.slurm: cluster configuration lacks support for cpu binding
(borgj102:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27)
(borgj105:28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47)
Process    0 of   48 is on borgj102
Process    1 of   48 is on borgj102
Process    2 of   48 is on borgj102
Process    3 of   48 is on borgj102
Process    4 of   48 is on borgj102
Process    5 of   48 is on borgj102
Process    6 of   48 is on borgj102
Process    7 of   48 is on borgj102
Process    8 of   48 is on borgj102
Process    9 of   48 is on borgj102
Process   10 of   48 is on borgj102
Process   11 of   48 is on borgj102
Process   12 of   48 is on borgj102
Process   13 of   48 is on borgj102
Process   14 of   48 is on borgj102
Process   15 of   48 is on borgj102
Process   16 of   48 is on borgj102
Process   17 of   48 is on borgj102
Process   18 of   48 is on borgj102
Process   19 of   48 is on borgj102
Process   20 of   48 is on borgj102
Process   21 of   48 is on borgj102
Process   22 of   48 is on borgj102
Process   23 of   48 is on borgj102
Process   24 of   48 is on borgj102
Process   25 of   48 is on borgj102
Process   26 of   48 is on borgj102
Process   27 of   48 is on borgj102
Process   28 of   48 is on borgj105
Process   29 of   48 is on borgj105
Process   30 of   48 is on borgj105
Process   31 of   48 is on borgj105
Process   32 of   48 is on borgj105
Process   33 of   48 is on borgj105
Process   34 of   48 is on borgj105
Process   35 of   48 is on borgj105
Process   36 of   48 is on borgj105
Process   37 of   48 is on borgj105
Process   38 of   48 is on borgj105
Process   39 of   48 is on borgj105
Process   40 of   48 is on borgj105
Process   41 of   48 is on borgj105
Process   42 of   48 is on borgj105
Process   43 of   48 is on borgj105
Process   44 of   48 is on borgj105
Process   45 of   48 is on borgj105
Process   46 of   48 is on borgj105
Process   47 of   48 is on borgj105

Huh. No change and still 28,20. Do you know if there is a way to "override" what appears to be SLURM beating the -perhost flag? I suppose there is that srun.slurm warning being thrown, but that usually is a warning for more "tasks-per-core" sort of manipulations.

Thanks,

Matt

Matt_Thompson · ‎04-30-2015

Oh, and since I forgot, I'm running Intel MPI 5.0.3.048. Sorry!

Matt_Thompson · ‎05-04-2015

Addendum: Per an admin here at NASA on the SLURM List:

I'm pretty confident in saying this is entirely in Intel MPI land:

aknister@borgj157:~> I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=enable mpiexec.hydra -np 48 -ppn 24 -print-rank-map /bin/true
(borgj157:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27)
(borgj164:28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47)

aknister@borgj157:~> I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=disable mpiexec.hydra -np 48 -ppn 24 -print-rank-map /bin/true
(borgj157:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23)
(borgj164:24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47)

However, if a machinefile argument is passed to mpiexec.hydra (which mpirun does by default) 
the I_MPI_JOB_RESPECT_PROCESS_PLACEMENT variable isn't respected (see below). 
Maybe we need an I_MPI_JOB_RESPECT_I_MPI_JOB_RESPECT_PROCESS_PLACEMENT_VARIABLE variable.

aknister@borgj157:~> I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=enable mpiexec.hydra -machinefile $PBS_NODEFILE -np 48 -ppn 24 --print-rank-map true
(borgj157:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27)
(borgj164:28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47)

aknister@borgj157:~> I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=disable mpiexec.hydra -machinefile $PBS_NODEFILE -np 48 -ppn 24 --print-rank-map true
(borgj157:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27)
(borgj164:28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47)

Does anyone here at Intel know how to get mpirun to respect this so -ppn can work with SLURM?

Nico_Mittenzwey · ‎05-12-2016

Overriding works with Intel MPI 5.1.3.181

I just tried this with Intel MPI 5.1.3.181. It seems, "I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=disable" is no longer ignored . When this variable is set, SLURM process placement is overwritten by "-ppn" or "-perhost".

Intel MPI, perhost, and SLURM: Can I override SLURM?