- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
I have a code that has a time-stepping algorithm in which a distributed matrix is solved at each time step. After a number of time steps, Intel MPI crashes with the error message
MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)
Fatal error in PMPI_Cart_create: Other MPI error, error stack:
I've attached a simple test case that exhibits the problem. I'm using Intel MPI 5.3.3, MKL 11.3.3, and ifort 6.0.3. I've experienced the problem on both windows and linux.
Is this a bug and/or is there a workaround? This is really a show-stopper for use as our simulations can have a huge number of time-steps and it doesn't take many simulation steps to exhaust all the communicators for some problems.
Thanks,
John
링크가 복사됨
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
One more point, the test case I attached is for 16 mpi nodes. You can use different number of nodes but nprow/npcol in test.F90 must be adjusted accordingly. For a smaller number of nodes, you may have to run more than 400 steps.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
After more testing, the problem is with the p*getrf call and not the p*getrs call. I can replace the LU factorization with another factorization such as QR (p*geqrf) and no problem occurs. It seems that p*getrf is possibly not freeing communicators when it is finished.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi John,
The issue is presumably related to the Intel MKL bug reported at https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/634884. The fix will be available in the next release. Many thanks for providing the test code.
Best regards
Klaus-Dieter
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi Klaus-Dieter,
Thanks for letting me know the fix is coming.
