Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

IntelMPI incorrectly oversubscribing nodes with machine file

mpiuser1
Beginner
2,363 Views

Hi community, 

I am using a machinefile to experiment with assigning processes to different nodes and have noticed that while it works at 4 nodes (144 processes), it doesn't work as expected at 8 nodes (288 processes). 

With the 8 node case, I checked the I_MPI_DEBUG assignment, and it oversubscribes some nodes while undersubscribing others, not following the machine file at all. For example, even though the machinefile has each node only listed 36 times, the resulting debug file shows one node being assigned 56 ranks. I've tried running it with I_MPI_JOB_RESPECT_PLACEMENT=off, but it still doesn't work. I am using Slurm and mpirun to run the test. 

Does anyone know what might be the problem?

Labels (1)
0 Kudos
1 Solution
AbhishekD_Intel
Moderator
2,253 Views

Hi Erica,


We tried running the MPI code for 288 ranks with your provided machine file on  Intel MPI 2019.7.217 and got the oversubscription and under-subscription with some nodes.


And we found that this issue has been fixed with the latest MPI version i.e. Intel MPI 2019.8.254 (2019 update 8). You will not get any oversubscription and under-subscription with Update 8.


Please try to update your MPI version to 2019 update 8, to resolve this issue. And update us if you still have the same issue.



Warm Regards,

Abhishek


View solution in original post

0 Kudos
9 Replies
AbhishekD_Intel
Moderator
2,347 Views

Hi Erica,


Please check the environment variable you have used. Set I_MPI_JOB_RESPECT_PROCESS_PLACEMENT to no/off and then try experimenting with your application and let us know if its working or still oversubscribing.


Warm Regards,

Abhishek


0 Kudos
mpiuser1
Beginner
2,342 Views

Hi Abhishek,

Thanks for your reply. I tried setting the following

export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off

but the results are still the same (it is still oversubscribing). What else do you suggest?

Thanks,
Erica

0 Kudos
mpiuser1
Beginner
2,325 Views

Hi, just checking in and seeing if you have any other suggestions for how to fix the oversubscription error. Thank you!

0 Kudos
AbhishekD_Intel
Moderator
2,318 Views

Hi Erica,

 

Sorry for the delay, I tried running some examples with 8 nodes and 288 processes (using both machinefile, hostfile) but I didn't see any oversubscription or under-subscription in my case. All nodes were executing 36 times only. I tried this on the latest MPI version ie 2019 Update 7.

 

So, can you please give us details of the MPI version you are using and the debug log showing the details of oversubscription and under-subscription.

Also, try using hostfile with -n 288 -ppn 36 and check you are getting the same oversubscription and update us with your findings.

 

 

Warm Regards,

Abhishek

 

0 Kudos
mpiuser1
Beginner
2,300 Views

Hi Abhishek,

I am using Intel MPI 2019.7.217

I tried running with just mpirun -hostfile hostfile.txt -n 288 -ppn 36 hostname and there was no oversubscription.

I tried with both mpirun -hostfile hostfile.txt -machinefile machinefile.txt -n 288 -ppn 36 hostname and mpirun -machinefile machinefile.txt -n 288 -ppn 36 hostname and there was over and undersubscription. 

I attached the machine file below, it is randomly generated and I've tried different examples, but there is always strange behavior (other than when I list out node0 36 times in a row, node1 36 times in a row, etc. all the way up to node7).

When I run with the --verbose flag, this is what I see:

[proxy:0:3@node3] Warning - oversubscription detected: 44 processes will be placed on 36 cores
[proxy:0:2@node2] Warning - oversubscription detected: 60 processes will be placed on 36 cores
[proxy:0:4@node4] Warning - oversubscription detected: 37 processes will be placed on 36 cores
[proxy:0:1@node1] Warning - oversubscription detected: 60 processes will be placed on 36 cores

Any help would be much appreciated!

Thanks!

0 Kudos
AbhishekD_Intel
Moderator
2,257 Views

Hi Erica,


Please give us a debug log and the CPU details of your nodes so that we could get more insights into the problem.


Thank You.


0 Kudos
AbhishekD_Intel
Moderator
2,254 Views

Hi Erica,


We tried running the MPI code for 288 ranks with your provided machine file on  Intel MPI 2019.7.217 and got the oversubscription and under-subscription with some nodes.


And we found that this issue has been fixed with the latest MPI version i.e. Intel MPI 2019.8.254 (2019 update 8). You will not get any oversubscription and under-subscription with Update 8.


Please try to update your MPI version to 2019 update 8, to resolve this issue. And update us if you still have the same issue.



Warm Regards,

Abhishek


0 Kudos
mpiuser1
Beginner
2,247 Views

Great, updating it to 2019.8 solved it! Thanks!

0 Kudos
AbhishekD_Intel
Moderator
2,241 Views

Hi Erica,


Thank you for your confirmation. Glad to know that your issue is resolved. We won't be monitoring this thread anymore. Kindly raise a new thread if you need further assistance.



Warm Regards,

Abhishek


0 Kudos
Reply