- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you submit to run on more than 16 nodes of a Slurm cluster the value of the SLURM_NNODES environment variable in the MPI processes becomes corrupted:
#!/bin/sh
#SBATCH --nodes=18 --ntasks-per-node=1
mpirun -prepend-rank /usr/bin/env | grep SLURM_NNODES
gives:
[17] SLURM_NNODES: 16
[8] SLURM_NNODES: 16
[9] SLURM_NNODES: 16
[6] SLURM_NNODES: 16
[13] SLURM_NNODES: 16
[7] SLURM_NNODES: 16
[15] SLURM_NNODES: 16
[12] SLURM_NNODES: 16
[16] SLURM_NNODES: 16
[0] SLURM_NNODES: 16
[1] SLURM_NNODES: 1
[4] SLURM_NNODES: 16
[14] SLURM_NNODES: 16
[10] SLURM_NNODES: 16
[11] SLURM_NNODES: 16
[3] SLURM_NNODES: 1
[5] SLURM_NNODES: 16
[2] SLURM_NNODES: 16
The SLURM_JOB_NUM_NODES environment variable gives the correct value and setting:
export I_MPI_HYDRA_BRANCH_COUNT=0
works around the issue
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@nickw1
can you please give more information on your environment? Please also add the output of I_MPI_DEBUG=10

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page