I would like to ask about the maximum number of MPI processes supported in Intel MPI 2019.0.5
Case in point, we was conducting massive parallel tests on Knight-Landings cluster. The total number of cores was 4096 x 64 ~ 260K at which the test hung without updating output file for extended period of time. We suspect this may be due an intrinsic limitation of Intel MPI.
Clarification is much appreciated.
By default, Intel IMPI does not limit the number of processes you launch.
Please refer to the Intel MPI Developer reference
Refer the section 2.3.1
Could give please provide more details about the error you were getting so that we can help you.
Thanks for clarification. The following error messages were observed at 512 nodes, which was approximately 32K processes.
At the moment, we could not push beyond this limit.
[proxy:0:0@node2948] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:362): write error (Broken pipe) [mpiexec@node2948] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:362): write error (Bad file descriptor) [mpiexec@node2948] wait_proxies_to_terminate (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:528): downstream from host node4001 exited with status 255
Can you provide details of the application that you are running?
Can you please provide how much time the IMPI job ran before giving this error?
In some cases, the job scheduler has a time limit after which it kills the job. So when are you getting the errors(immediately after you give the launch command or is it after a delay of say one hour after launch of the program)?
You can go through this link for more details regarding this error:
Can you tell us which application you were using for this test?
Please check whether there are any limitations set in your system?
Could you please let us know if your issue is resolved.
If not do let us know. So that we will be able to help you regarding the same.
I apologize for my bad form.
Since the forum transition, I have been waiting for comments from Intel staff as well. The last email I received via notification dated back to June, 25th as you can see the attachments. Thus, I completely misses the your replies until now.
The link provided by Goutham helps us narrow the issue to out-of-memory manager (OOM) since VASP5 crashes during initialization state. There was no timing restriction imposed on the scheduler part.
We are monitoring the job. If no issue arises, I will accept Goutham's answer as solution and close the thread.