Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2293 ディスカッション

mpirun corrupts SLURM_NNODES environment variable when run on more than 16 nodes

nickw1
ビギナー
2,061件の閲覧回数

When you submit to run on more than 16 nodes of a Slurm cluster the value of the SLURM_NNODES environment variable in the MPI processes becomes corrupted:

#!/bin/sh
#SBATCH --nodes=18 --ntasks-per-node=1
mpirun -prepend-rank /usr/bin/env | grep SLURM_NNODES

gives:

[17] SLURM_NNODES: 16
[8] SLURM_NNODES: 16
[9] SLURM_NNODES: 16
[6] SLURM_NNODES: 16
[13] SLURM_NNODES: 16
[7] SLURM_NNODES: 16
[15] SLURM_NNODES: 16
[12] SLURM_NNODES: 16
[16] SLURM_NNODES: 16
[0] SLURM_NNODES: 16
[1] SLURM_NNODES: 1
[4] SLURM_NNODES: 16
[14] SLURM_NNODES: 16
[10] SLURM_NNODES: 16
[11] SLURM_NNODES: 16
[3] SLURM_NNODES: 1
[5] SLURM_NNODES: 16
[2] SLURM_NNODES: 16

 

The SLURM_JOB_NUM_NODES environment variable gives the correct value and setting:

export I_MPI_HYDRA_BRANCH_COUNT=0

works around the issue

0 件の賞賛
1 解決策
TobiasK
モデレーター
94件の閲覧回数

@nickw1

The root cause of this issue is coming from srun. For the upcoming release Intel MPI 2021.18 included in oneAPI 2026.0 we decided to disable hydra branching when slurm is used (set always I_MPI_HYDRA_BRANCH_COUNT=0 if slurm is detected and the env variable is not set by the user). This will in a sense fix the issue.


元の投稿で解決策を見る

2 返答(返信)
TobiasK
モデレーター
2,035件の閲覧回数

@nickw1 
can you please give more information on your environment? Please also add the output of I_MPI_DEBUG=10

TobiasK
モデレーター
95件の閲覧回数

@nickw1

The root cause of this issue is coming from srun. For the upcoming release Intel MPI 2021.18 included in oneAPI 2026.0 we decided to disable hydra branching when slurm is used (set always I_MPI_HYDRA_BRANCH_COUNT=0 if slurm is detected and the env variable is not set by the user). This will in a sense fix the issue.


返信