Hybrid MPI/OpenMP : program seems to stall in non blocking communications
I have a MPI Fortran90 CFD application parallelized in X-Y (Cartesian 2D topology) that works well and I decide to parallelize it in Z using OpenMP. With the MPI 2D topology, each subdomain may have up to 8 neighbours, there's no periodicity. That is : NW NN NE WW ME EE SW SS SE with the convention NW is North West, SE is South East and so on. ME is equal to my_MPI_Rank2d, the MPI rank of the current process. my_OMP_Thd contains the OpenMP rank of each thread in the thread team of each MPI process.
A call to MPI_Init_Thread gives me back MPI_THREAD_MULTIPLE level for thread support in MPI.
MPI communications are non blocking ones (MPI_ISend, MPI_IRecv) and are all put in a SECTIONS ... END SECTIONS construct, but only one per SECTION. So for each MPI process, the communications with the 8 potential neighbours are distributed among the team of threads. A call to MPI_WaitAll is done after them by the MASTER thread. Each thread keeps its informations about the requests it has in a private storage. That is
The write / flush calls are put here for a debug purpose and of course will be removed after debugging. But here, they help me to show what is wrong. I run this code on a SGI Altix machine, using 2 nodes, having each 2 processors with 6 cores. I run this code using 12 MPI processes, 6 on each node. Each MPI process creates a team of 2 threads.
What is strange is that OpenMP threads seem to be blocked in non blocking MPI calls, in the fort.4xx files, I get outputs like : ==> fort.400 <== before IRecv WW After IRecv WW Before ISend EE After ISend EE Before IRecv EE <<<< end of this file
==> fort.401 <== Before IRecv SW After IRecv SW Before ISend NE <<<<<< end of this file
And all the 24 threads behave like this, they enter the communication routine, do some MPI calls (with real neighbours, not only MPI_PROC_NULL ones) ; it may not be the same number for each thread. None reaches the writing of the message after the END SECTIONS directive.
The data exchanged between the MPI processes are ghost cells of a 4D array (5,Nx,Ny,Nz), so faces or 'corner's columns' with a depth of at least 3 layers. Send buffers may overlap but not receive ones. Typically, Nx=112, Ny=204, Nz=32
I use ifort (IFORT) 12.1.0 20111011 and intel-mpi 4.0.0.028
1. I check the topology. 2. I check the data scope attribute of the different variables. 3. I try replacing the SECTIONS contruct by a set of SINGLE / END SINGLE NOWAIT ones, but it behaves badly too. 4. I use ITAC and the -mpi_check option but I get nothing interesting 5. I run the code whith 12 cores and only 1 thread per MPI process : It works like the pure MPI code.
But I don't understand why it freezes.
Any help will be appreciated.
If you need further informations, please let me know.