Community
cancel
Showing results for 
Search instead for 
Did you mean: 
mpiuser1
Beginner
607 Views

IntelMPI not following machinefile with slurm

Hi Intel community,

I am using IntelMPI 2019.8 with slurm. I have noticed that when running with a machinefile, it does not follow the assigned nodes exactly. For example, all the processes assigned to node1 are all assigned to node2, and all the processes assigned to node2 are assigned to another node.  How do we make it follow the machinefile exactly? I am attaching the sample program we are running to test the machinefile along with the slurm script. 

Thanks,

Erica

0 Kudos
7 Replies
PrasanthD_intel
Moderator
594 Views

Hi Erica,


Thanks for reporting this to us.

We have observed similar behaviour in SLURM. The process placement is accurate for other job schedulers (we have checked for PBS).

So, we are transferring this to our internal team for better support.


Regards

Prasanth


James_T_Intel
Moderator
581 Views

When I tested with 2019 Update 8 on an internal cluster, I am seeing the expected behavior. Can you please send the full output with I_MPI_DEBUG=16?


mpiuser1
Beginner
573 Views

Hi James,

Here is the output with corresponding machinefile.

Thanks,

Erica

mpiuser1
Beginner
533 Views

Hi James, 

Do you know why it differs between your run internally and our run? Is there any setting we're missing for our run?

Thanks,

Erica

mpiuser1
Beginner
531 Views

Hi James,

Could you share your slurm job script with us so we can test it? Which version of slurm did you test it on?

Thanks,

Erica

James_T_Intel
Moderator
220 Views

I apologize for dropping this. Here is the script I used for testing. I randomized the order of hosts in order to ensure that the machinefile is being used rather than the SLURM nodelist. Tested on a customized version of SLURM 20.11.7. The output matches the order in the machinefile.


#!/bin/bash


#SBATCH -N 8


scontrol show hostnames $SLURM_JOB_NODELIST | shuf > machinefile.txt

scontrol show hostnames $SLURM_JOB_NODELIST | shuf >> machinefile.txt


source /opt/intel/oneAPI/latest/setvars.sh


mpirun -n 16 -machinefile machinefile.txt -genv I_MPI_DEBUG 3 -bootstrap ssh ./a.out



James_T_Intel
Moderator
135 Views

I am closing the Intel support case related to this thread. Everything appears to be functioning as expected in multiple test scenarios. Any further replies on this thread will be considered community only. If you require additional support assistance on this issue, please start a new thread with current details and logs.


Reply