Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
2276 Discussions

mpirun corrupts SLURM_NNODES environment variable when run on more than 16 nodes

nickw1
Beginner
1,540 Views

When you submit to run on more than 16 nodes of a Slurm cluster the value of the SLURM_NNODES environment variable in the MPI processes becomes corrupted:

#!/bin/sh
#SBATCH --nodes=18 --ntasks-per-node=1
mpirun -prepend-rank /usr/bin/env | grep SLURM_NNODES

gives:

[17] SLURM_NNODES: 16
[8] SLURM_NNODES: 16
[9] SLURM_NNODES: 16
[6] SLURM_NNODES: 16
[13] SLURM_NNODES: 16
[7] SLURM_NNODES: 16
[15] SLURM_NNODES: 16
[12] SLURM_NNODES: 16
[16] SLURM_NNODES: 16
[0] SLURM_NNODES: 16
[1] SLURM_NNODES: 1
[4] SLURM_NNODES: 16
[14] SLURM_NNODES: 16
[10] SLURM_NNODES: 16
[11] SLURM_NNODES: 16
[3] SLURM_NNODES: 1
[5] SLURM_NNODES: 16
[2] SLURM_NNODES: 16

 

The SLURM_JOB_NUM_NODES environment variable gives the correct value and setting:

export I_MPI_HYDRA_BRANCH_COUNT=0

works around the issue

0 Kudos
1 Reply
TobiasK
Moderator
1,514 Views

@nickw1 
can you please give more information on your environment? Please also add the output of I_MPI_DEBUG=10

0 Kudos
Reply