I am having difficulty using MPI from parallel studio cluster edition 2016 in conjunction with Quantum Espresso PWSCF v 6.3.
I think the problems may be inter-related and are to do with MPI-communicators. I compiled pw.x with Intel compilers, with Intel MPI, Intel ScaLapack and MKL, but without OpenMP.
I have been running pw.x with multiple processes quite successfully, however when the number of processes is high enough, such that the space group has more than 7 processes, where the subspace diagonalization no longer uses a serial algorithm, the program crashes abruptly at about the 10th iteration with the following errors;
Fatal error in PMPI_Cart_sub: Other MPI error, error stack: PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7ffe0b27a6e8, comm_new=0x7ffe0b27a640) failed PMPI_Cart_sub(178)...................: MPIR_Comm_split_impl(270)............: MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0) Fatal error in PMPI_Cart_sub: Other MPI error, error stack: PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf3, remain_dims=0x7ffefaee7ce8, comm_new=0x7ffefaee7c40) failed PMPI_Cart_sub(178)...................: MPIR_Comm_split_impl(270)............:
On the pw forum, I got this response;
'a careful look at the error message reveals, that you are running out of space for MPI communicators for which a fixed maximum number (16k) seems to be allowed. this hints at a problem somewhere that communicators are generated with MPI_Comm_split() and not properly cleared afterwards.'
But I don't know how to fix this..
Please kindly advise,
Apologies if this is a repost:
Hi James T,
I'm afraid I am not the administrator of this machine so am unable to put the newer software on. On my own PC I have been running Quantum Espresso with the 2019 version without difficulty. I believe (without validation) that the errors exist as I have built the software in a directory that may not be accessible by all Intel Xeon blades. I am trying the administrators to build it in an alternative location. Do the errors included in my original post back my theory up?