Hello,

Ronald_G_2 · ‎11-04-2016

I was having issues with Intel MPI 5.x (5.2.1 and older) not respecting -ppn or -perhost. Searching this forum I found this post:

https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/557016

So the original behavior is ignoring -ppn. I have 2 nodes, ml036 and ml311. My SLURM_NODELIST is;

SLURM_JOB_NODELIST=ml[036,311]

without setting I_MPI_JOB_RESPECT_PROCESS_PLACEMENT I see ppn ignored:

[green@ml036 ~]$ mpirun -n 2 -ppn 1 ./hello_mpi
hello_parallel.f: Number of tasks= 2 My rank= 0 My name=ml036.localdomain
hello_parallel.f: Number of tasks= 2 My rank= 1 My name=ml036.localdomain

Following that previous post, I

setenv I_MPI_JOB_RESPECT_PROCESS_PLACEMENT disable

then ppn works as expected

[green@ml036 ~]$ setenv I_MPI_JOB_RESPECT_PROCESS_PLACEMENT disable
[green@ml036 ~]$ mpirun -n 2 -ppn 1 ./hello_mpi
hello_parallel.f: Number of tasks= 2 My rank= 0 My name=ml036.localdomain
hello_parallel.f: Number of tasks= 2 My rank= 1 My name=ml311.localdomain

So is this a local configuration issue? It's easy enough to set this I_MPI_JOB_RESPECT_PROCESS_PLACEMENT env var, but curious what it is and why I have to manually set this. Shouldn't iMPI figure out I'm on a SLURM system and 'automatically' do the right thing w/o this env var?

with I_MPI_DEBUG 6 and w/o I_MPI_JOB_RESPECT_PROCESS_PLACEMENT I got this:

$ mpirun -n 2 -ppn 1 ./hello_mpi
[0] MPI startup(): Intel(R) MPI Library, Version 2017 Update 1 Build 20161016 (id: 16418)
[0] MPI startup(): Copyright (C) 2003-2016 Intel Corporation. All rights reserved.
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): Device_reset_idx=8
[0] MPI startup(): Allgather: 2: 0-0 & 0-2147483647
[0] MPI startup(): Allgather: 3: 1-256 & 0-2147483647
[0] MPI startup(): Allgather: 1: 257-2147483647 & 0-2147483647
[0] MPI startup(): Allgather: 3: 257-5851 & 0-2147483647
[0] MPI startup(): Allgather: 1: 5852-57344 & 0-2147483647
[0] MPI startup(): Allgather: 3: 57345-388846 & 0-2147483647
[0] MPI startup(): Allgather: 1: 388847-1453707 & 0-2147483647
[0] MPI startup(): Allgather: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 0-1901 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 1902-2071 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 2072-32768 & 0-2147483647
[0] MPI startup(): Allreduce: 8: 32769-65536 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 65537-131072 & 0-2147483647
[0] MPI startup(): Allreduce: 2: 131073-524288 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 524289-1048576 & 0-2147483647
[0] MPI startup(): Allreduce: 2: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 0-131072 & 0-2147483647
[0] MPI startup(): Alltoall: 4: 131073-529941 & 0-2147483647
[0] MPI startup(): Alltoall: 2: 529942-1756892 & 0-2147483647
[0] MPI startup(): Alltoall: 4: 1756893-2097152 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Barrier: 2: 0-2147483647 & 0-2147483647
[0] MPI startup(): Bcast: 1: 0-0 & 0-2147483647
[0] MPI startup(): Bcast: 8: 1-3938 & 0-2147483647
[0] MPI startup(): Bcast: 1: 3939-4274 & 0-2147483647
[0] MPI startup(): Bcast: 8: 4275-12288 & 0-2147483647
[0] MPI startup(): Bcast: 3: 12289-36805 & 0-2147483647
[0] MPI startup(): Bcast: 7: 36806-95325 & 0-2147483647
[0] MPI startup(): Bcast: 1: 95326-158190 & 0-2147483647
[0] MPI startup(): Bcast: 7: 158191-2393015 & 0-2147483647
[0] MPI startup(): Bcast: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gather: 3: 0-874 & 0-2147483647
[0] MPI startup(): Gather: 1: 875-2048 & 0-2147483647
[0] MPI startup(): Gather: 3: 2049-4096 & 0-2147483647
[0] MPI startup(): Gather: 1: 4097-65536 & 0-2147483647
[0] MPI startup(): Gather: 3: 65537-297096 & 0-2147483647
[0] MPI startup(): Gather: 1: 297097-524288 & 0-2147483647
[0] MPI startup(): Gather: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gatherv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 1: 0-6 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatter: 3: 0-0 & 0-2147483647
[0] MPI startup(): Scatter: 1: 1-48 & 0-2147483647
[0] MPI startup(): Scatter: 3: 49-91 & 0-2147483647
[0] MPI startup(): Scatter: 0: 92-201 & 0-2147483647
[0] MPI startup(): Scatter: 3: 202-2048 & 0-2147483647
[0] MPI startup(): Scatter: 1: 2049-2147483647 & 0-2147483647
[0] MPI startup(): Scatter: 3: 2049-4751 & 0-2147483647
[0] MPI startup(): Scatter: 0: 4752-12719 & 0-2147483647
[0] MPI startup(): Scatter: 3: 12720-20604 & 0-2147483647
[0] MPI startup(): Scatter: 0: 20605-32768 & 0-2147483647
[0] MPI startup(): Scatter: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatterv: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Rank    Pid      Node name          Pin cpu
[0] MPI startup(): 0       99166    ml036.localdomain {0,1,2,3,4,5,6,7}
[0] MPI startup(): 1       99167    ml036.localdomain {8,9,10,11,12,13,14,15}
[0] MPI startup(): Recognition=2 Platform(code=8 ippn=1 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[1] MPI startup(): Recognition=2 Platform(code=8 ippn=1 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[0] MPI startup(): I_MPI_DEBUG=6
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=qib0:0
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 8
hello_parallel.f: Number of tasks= 2 My rank= 1 My name=ml036.localdomain
hello_parallel.f: Number of tasks= 2 My rank= 0 My name=ml036.localdomain

SLURM vars

[green@ml015 ~]$ env | grep SLURM
SLURM_NTASKS_PER_NODE=16
SLURM_SUBMIT_DIR=/users/green
SLURM_JOB_ID=534349
SLURM_JOB_NUM_NODES=2
SLURM_JOB_NODELIST=ml[015,017]
SLURM_JOB_CPUS_PER_NODE=16(x2)
SLURM_JOBID=534349
SLURM_NNODES=2
SLURM_NODELIST=ml[015,017]
SLURM_TASKS_PER_NODE=16(x2)
SLURM_NTASKS=32
SLURM_NPROCS=32
SLURM_PRIO_PROCESS=0
SLURM_DISTRIBUTION=cyclic
SLURM_STEPID=0
SLURM_SRUN_COMM_PORT=41294
SLURM_PTY_PORT=43155
SLURM_PTY_WIN_COL=143
SLURM_PTY_WIN_ROW=33
SLURM_STEP_ID=0
SLURM_STEP_NODELIST=ml015
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_STEP_LAUNCHER_PORT=41294
SLURM_SRUN_COMM_HOST=192.168.0.153
SLURM_TOPOLOGY_ADDR=ml015
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_TASK_PID=129118
SLURM_CPUS_ON_NODE=16
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
SLURM_LAUNCH_NODE_IPADDR=192.168.0.153
SLURM_GTIDS=0
SLURM_CHECKPOINT_IMAGE_DIR=/users/green
SLURMD_NODENAME=ml015

Michael_Intel · ‎11-09-2016

Hello,

The behavior of Intel MPI is as expected, it does respect the job scheduler which is SLURM in your case. In your SLURM job script, you define to use 16 MPI ranks per node (SLURM_NTASKS_PER_NODE=16). While you only run 2 MPI ranks, they will be executed on the first node since that is what you requested from the job scheduler. Therefore Intel MPI will ignore your PPN parameter and stick with the SLURM configuration, unless you overwrite that by setting I_MPI_JOB_RESPECT_PROCESS_PLACEMENT to 0 (/disable).

The reason for the change compared to older IMPI versions is that we have observed issues where job schedulers terminated user jobs when the user started to claim resources that were not requested.

Best regards,

Michael

SLURM and I_MPI_JOB_RESPECT_PROCESS_PLACEMENT