I launch the parallel program by Intel MPI 4.1.3.047. I launched 240 processes in 10 calculate nodes. Everytime I launched the program, The execute files are launched in every node after 2-3 seconds. However, the CPU usage of every process is 0% and the program is waiting for something. The wait time could reach to 10 minutes. To further check the location of waiting. I have the following test code in my program:
program test ! define variables write(*,*)1 call MPI_Init ( ierr ) write(*,*)2 comm = MPI_COMM_WORLD call MPI_COMM_SIZE (comm, mysize, ierr) write(*,*)3 call MPI_COMM_RANK (comm, myid, ierr) if(myid==0)write(*,*)4 ... end
It seems that the number 1 was printed soon (about 2-3 seconds after launched the program). However, it will wait for about 10 minutes the number 2 be printed. So my problem is: what lead to the MPI_Init take so long time?
I am still trying to solve the problem. Forgot to say, I have created a domain in my cluster and 10 nodes are included in the domain, they are N01, N02,..., N10. The IP address of these nodes are:
I installed Windows 2012 HPC on N05 (10.0.0.1) and N06 (10.0.0.6) and the head node is N05. By further test I found that if I launch processes without N05, i.e., the head node, the processes begin very fast (about 3 seconds after I entered the command line). but if the head node is launched, the wait time is more than 10 minutes. What could lead to this problem?
You use pretty old version of Intel MPI Library 4.1.3 - is it possible for you to switch to the latest one?
Possibly this delay was caused by some specific network settings - check the connections between the compute nodes (for example with ping utility).
Thank you very much for your kindly reply. I have tested many times and found latest version doesn't work. Only the version 4.1.3.047 works for me. See here:
Could you please give more details on how to check the connections?