Community
cancel
Showing results for 
Search instead for 
Did you mean: 
John_Young
New Contributor I
466 Views

MPI: Too Many Communicators

I have a code that has a time-stepping algorithm in which a distributed matrix is solved at each time step.  After a number of time steps, Intel MPI crashes with the error message
 
MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Cart_create: Other MPI error, error stack:

I've attached a simple test case that exhibits the problem.  I'm using Intel MPI 5.3.3, MKL 11.3.3, and ifort 6.0.3.  I've experienced the problem  on both windows and linux. 

Is this a bug and/or is there a workaround?    This is really a show-stopper for use as our simulations can have a huge number of time-steps and it doesn't take many simulation steps to exhaust all the communicators for some problems.

Thanks,
John

0 Kudos
5 Replies
John_Young
New Contributor I
466 Views

One more point, the test case I attached is for 16 mpi nodes.  You can use different number of nodes but nprow/npcol in test.F90 must be adjusted accordingly.  For a smaller number of nodes, you may have to run more than 400 steps.

 

John_Young
New Contributor I
466 Views

After more testing, the problem is with the p*getrf call and not the p*getrs call.  I can replace the LU factorization with another factorization such as QR (p*geqrf) and no problem occurs.  It seems that p*getrf is possibly not freeing communicators when it is finished. 

466 Views

Hi John,

The issue is presumably related to the Intel MKL bug reported at https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/634884. The fix will be available in the next release. Many thanks for providing the test code.

Best regards
Klaus-Dieter

John_Young
New Contributor I
466 Views

Hi Klaus-Dieter,

Thanks for letting me know the fix is coming.

 

tuo__abby
Beginner
466 Views

Is their a way to get matrix inversion other than p*getri (which requires a LU factorization)

Reply