Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2154 Discussions

IntelMPI not following machinefile with slurm

mpiuser1
Beginner
2,079 Views

Hi Intel community,

I am using IntelMPI 2019.8 with slurm. I have noticed that when running with a machinefile, it does not follow the assigned nodes exactly. For example, all the processes assigned to node1 are all assigned to node2, and all the processes assigned to node2 are assigned to another node.  How do we make it follow the machinefile exactly? I am attaching the sample program we are running to test the machinefile along with the slurm script. 

Thanks,

Erica

0 Kudos
7 Replies
PrasanthD_intel
Moderator
2,066 Views

Hi Erica,


Thanks for reporting this to us.

We have observed similar behaviour in SLURM. The process placement is accurate for other job schedulers (we have checked for PBS).

So, we are transferring this to our internal team for better support.


Regards

Prasanth


0 Kudos
James_T_Intel
Moderator
2,053 Views

When I tested with 2019 Update 8 on an internal cluster, I am seeing the expected behavior. Can you please send the full output with I_MPI_DEBUG=16?


0 Kudos
mpiuser1
Beginner
2,045 Views

Hi James,

Here is the output with corresponding machinefile.

Thanks,

Erica

0 Kudos
mpiuser1
Beginner
2,005 Views

Hi James, 

Do you know why it differs between your run internally and our run? Is there any setting we're missing for our run?

Thanks,

Erica

0 Kudos
mpiuser1
Beginner
2,003 Views

Hi James,

Could you share your slurm job script with us so we can test it? Which version of slurm did you test it on?

Thanks,

Erica

0 Kudos
James_T_Intel
Moderator
1,692 Views

I apologize for dropping this. Here is the script I used for testing. I randomized the order of hosts in order to ensure that the machinefile is being used rather than the SLURM nodelist. Tested on a customized version of SLURM 20.11.7. The output matches the order in the machinefile.


#!/bin/bash


#SBATCH -N 8


scontrol show hostnames $SLURM_JOB_NODELIST | shuf > machinefile.txt

scontrol show hostnames $SLURM_JOB_NODELIST | shuf >> machinefile.txt


source /opt/intel/oneAPI/latest/setvars.sh


mpirun -n 16 -machinefile machinefile.txt -genv I_MPI_DEBUG 3 -bootstrap ssh ./a.out



0 Kudos
James_T_Intel
Moderator
1,607 Views

I am closing the Intel support case related to this thread. Everything appears to be functioning as expected in multiple test scenarios. Any further replies on this thread will be considered community only. If you require additional support assistance on this issue, please start a new thread with current details and logs.


0 Kudos
Reply