- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi community,
I am using a machinefile to experiment with assigning processes to different nodes and have noticed that while it works at 4 nodes (144 processes), it doesn't work as expected at 8 nodes (288 processes).
With the 8 node case, I checked the I_MPI_DEBUG assignment, and it oversubscribes some nodes while undersubscribing others, not following the machine file at all. For example, even though the machinefile has each node only listed 36 times, the resulting debug file shows one node being assigned 56 ranks. I've tried running it with I_MPI_JOB_RESPECT_PLACEMENT=off, but it still doesn't work. I am using Slurm and mpirun to run the test.
Does anyone know what might be the problem?
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Erica,
We tried running the MPI code for 288 ranks with your provided machine file on Intel MPI 2019.7.217 and got the oversubscription and under-subscription with some nodes.
And we found that this issue has been fixed with the latest MPI version i.e. Intel MPI 2019.8.254 (2019 update 8). You will not get any oversubscription and under-subscription with Update 8.
Please try to update your MPI version to 2019 update 8, to resolve this issue. And update us if you still have the same issue.
Warm Regards,
Abhishek
コピーされたリンク
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Erica,
Please check the environment variable you have used. Set I_MPI_JOB_RESPECT_PROCESS_PLACEMENT to no/off and then try experimenting with your application and let us know if its working or still oversubscribing.
Warm Regards,
Abhishek
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Abhishek,
Thanks for your reply. I tried setting the following
export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off
but the results are still the same (it is still oversubscribing). What else do you suggest?
Thanks,
Erica
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi, just checking in and seeing if you have any other suggestions for how to fix the oversubscription error. Thank you!
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Erica,
Sorry for the delay, I tried running some examples with 8 nodes and 288 processes (using both machinefile, hostfile) but I didn't see any oversubscription or under-subscription in my case. All nodes were executing 36 times only. I tried this on the latest MPI version ie 2019 Update 7.
So, can you please give us details of the MPI version you are using and the debug log showing the details of oversubscription and under-subscription.
Also, try using hostfile with -n 288 -ppn 36 and check you are getting the same oversubscription and update us with your findings.
Warm Regards,
Abhishek
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Abhishek,
I am using Intel MPI 2019.7.217
I tried running with just mpirun -hostfile hostfile.txt -n 288 -ppn 36 hostname and there was no oversubscription.
I tried with both mpirun -hostfile hostfile.txt -machinefile machinefile.txt -n 288 -ppn 36 hostname and mpirun -machinefile machinefile.txt -n 288 -ppn 36 hostname and there was over and undersubscription.
I attached the machine file below, it is randomly generated and I've tried different examples, but there is always strange behavior (other than when I list out node0 36 times in a row, node1 36 times in a row, etc. all the way up to node7).
When I run with the --verbose flag, this is what I see:
[proxy:0:3@node3] Warning - oversubscription detected: 44 processes will be placed on 36 cores
[proxy:0:2@node2] Warning - oversubscription detected: 60 processes will be placed on 36 cores
[proxy:0:4@node4] Warning - oversubscription detected: 37 processes will be placed on 36 cores
[proxy:0:1@node1] Warning - oversubscription detected: 60 processes will be placed on 36 cores
Any help would be much appreciated!
Thanks!
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Erica,
Please give us a debug log and the CPU details of your nodes so that we could get more insights into the problem.
Thank You.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Erica,
We tried running the MPI code for 288 ranks with your provided machine file on Intel MPI 2019.7.217 and got the oversubscription and under-subscription with some nodes.
And we found that this issue has been fixed with the latest MPI version i.e. Intel MPI 2019.8.254 (2019 update 8). You will not get any oversubscription and under-subscription with Update 8.
Please try to update your MPI version to 2019 update 8, to resolve this issue. And update us if you still have the same issue.
Warm Regards,
Abhishek
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi Erica,
Thank you for your confirmation. Glad to know that your issue is resolved. We won't be monitoring this thread anymore. Kindly raise a new thread if you need further assistance.
Warm Regards,
Abhishek