Hi Intel community,
I am using IntelMPI 2019.8 with slurm. I have noticed that when running with a machinefile, it does not follow the assigned nodes exactly. For example, all the processes assigned to node1 are all assigned to node2, and all the processes assigned to node2 are assigned to another node. How do we make it follow the machinefile exactly? I am attaching the sample program we are running to test the machinefile along with the slurm script.
Thanks for reporting this to us.
We have observed similar behaviour in SLURM. The process placement is accurate for other job schedulers (we have checked for PBS).
So, we are transferring this to our internal team for better support.
I apologize for dropping this. Here is the script I used for testing. I randomized the order of hosts in order to ensure that the machinefile is being used rather than the SLURM nodelist. Tested on a customized version of SLURM 20.11.7. The output matches the order in the machinefile.
#SBATCH -N 8
scontrol show hostnames $SLURM_JOB_NODELIST | shuf > machinefile.txt
scontrol show hostnames $SLURM_JOB_NODELIST | shuf >> machinefile.txt
mpirun -n 16 -machinefile machinefile.txt -genv I_MPI_DEBUG 3 -bootstrap ssh ./a.out
I am closing the Intel support case related to this thread. Everything appears to be functioning as expected in multiple test scenarios. Any further replies on this thread will be considered community only. If you require additional support assistance on this issue, please start a new thread with current details and logs.