MPI: Too Many Communicators

John_Young · ‎06-22-2016

I have a code that has a time-stepping algorithm in which a distributed matrix is solved at each time step. After a number of time steps, Intel MPI crashes with the error message

MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Cart_create: Other MPI error, error stack:

I've attached a simple test case that exhibits the problem. I'm using Intel MPI 5.3.3, MKL 11.3.3, and ifort 6.0.3. I've experienced the problem on both windows and linux.

Is this a bug and/or is there a workaround? This is really a show-stopper for use as our simulations can have a huge number of time-steps and it doesn't take many simulation steps to exhaust all the communicators for some problems.

Thanks,
John

John_Young · ‎06-22-2016

One more point, the test case I attached is for 16 mpi nodes. You can use different number of nodes but nprow/npcol in test.F90 must be adjusted accordingly. For a smaller number of nodes, you may have to run more than 400 steps.

John_Young · ‎06-23-2016

After more testing, the problem is with the p*getrf call and not the p*getrs call. I can replace the LU factorization with another factorization such as QR (p*geqrf) and no problem occurs. It seems that p*getrf is possibly not freeing communicators when it is finished.

Klaus-Dieter_O_Intel · ‎06-23-2016

Hi John,

The issue is presumably related to the Intel MKL bug reported at https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/634884. The fix will be available in the next release. Many thanks for providing the test code.

Best regards
Klaus-Dieter

John_Young · ‎06-23-2016

Hi Klaus-Dieter,

Thanks for letting me know the fix is coming.

tuo__abby · ‎01-05-2017

Is their a way to get matrix inversion other than p*getri (which requires a LU factorization)