Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Iegor_P_
Beginner
651 Views

MPI error while running SIESTA code

Hello,

I have compiled parallel version of siesta-3.2 code with Intel Parallel Studio XE 2016 Cluster Edition. There were not any problems, but every siesta run ended with the next error:

Fatal error in PMPI_Cart_create: Other MPI error, error stack:
PMPI_Cart_create(332).........: MPI_Cart_create(comm=0x84000007, ndims=2, dims=0x7fffe76b6288, periods=0x7fffe76b62a0, reorder=0, comm_cart=0x7fffe76b61e0) failed
MPIR_Cart_create_impl(189)....:
MPIR_Cart_create(115).........:
MPIR_Comm_copy(1070)..........:
MPIR_Get_contextid(543).......:
MPIR_Get_contextid_sparse(608): Too many communicators

displayed for every node (if I am using X nodes, I will see X same messages in output file).

I supposed I have used too many nodes, but decreasing the number of nodes have not solved this problem. Such kind of error appeared when I decided to update release version up to "Update 3" using online installer. People from other forums have given me advise to reinstall MPI env. So I did this, but the problem still occurs. Are there any ideas? My arch.make file is attached.

Thanks a lot for any help!

0 Kudos
12 Replies
Ade_F_
Beginner
651 Views

Hi

Just to back this up, I get the same with a run of Quantum Espresso.  The compiled binary works OK with Intel MPI 5.1.1 but with 5.1.3:

Fatal error in PMPI_Cart_sub: Other MPI error, error stack:

PMPI_Cart_sub(242)...................: MPI_Cart_sub(comm=0xc400fcf4, remain_dims=0x7fffffffa7e8, comm_new=0x7fffffffa740) failed

PMPI_Cart_sub(178)...................:

MPIR_Comm_split_impl(270)............:

MPIR_Get_contextid_sparse_group(1330): Too many communicators (0/16384 free on this process; ignore_id=0)

Fatal error in PMPI_Cart_sub: Other MPI error, error stack:

Image              PC                Routine            Line        Source

pw.x               0000000000C94025  Unknown               Unknown  Unknown

libpthread.so.0    0000003568E0F790  Unknown               Unknown  Unknown

libmpi.so.12       00002AAAAF8E7B50  Unknown               Unknown  Unknown

 

This seems to be a bug.....?

~~
Ade

651 Views

Hi Iegor and Ade, Let me confirm that this is a known bug in Intel MKL, namely in P?GEMM function which is a part of ScaLAPACK. Please expect the fix to be available in the next release of Intel MKL (we already fixed the issue internally). Best regards, Konstantin
Miguel_Costa
Beginner
651 Views

Hello,

is this fixed in the 2017 release?

Don't see it mentioned at https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2017-bug-fixes-list

Cheers

Ron_Cohen
Beginner
651 Views

This does not seem to be fixed even in 2017 beta3:

studio2017_beta3/compilers_and_libraries_2017.0.064

and doesn't work in anything after 11.2 . I have lost many hours from the bug--please fix it in the next release!

 

Gennady_F_Intel
Moderator
651 Views

yes, this issue has been fixed in MKL v.2017 ( released Sep 6th,2016). Please check and give us the update in the case if issue is still exists. thanks, Gennady

tuo__abby
Beginner
651 Views

I just downloaded the latest version, same issue still occurs.

Ronald_C_
Beginner
651 Views

Yes, I am running 2017 -4 and I get:

 

Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc4027cd4, color=0, key=1, new_comm=0x7ffe9a6df1f0) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(676): Too many communicators (4/16384 free on this process; ignore_id=0)

This is with ELPA latest version elpa-2017.05.001.rc2 and quantum espresso latest version 6.1 and latest version Intel MKL.

 

Ron Cohen

 

ksmith54
Beginner
651 Views

I am also experiencing the exact same problem with 2017 update 4.

 

All jobs continue to crash with this Too many communicators error.

None useable!

 

 

 

 

651 Views

//This does not seem to be fixed even in 2017 beta3

// I am also experiencing the exact same problem with 2017 update 4.

// Yes, I am running 2017 -4 and I get:

Can you please clarify about used MKL version? MKL 2017.4 is not released yet, the very latest available MKL is 2017.3. For example, can you please set MKL_VERBOSE=1 before running Siesta or QE, and report MKL version reported?

We're pretty sure that the issue with too many communicators had been fixed more than year ago, and all MKL 2017 official releases (not Beta definitely) should have the fix. Just recently I was making some Siesta runs and didn't see any issues. In case you see this problem with MKL 2017 and later, please let us know and provide a reproducer or instructions how to reproduce the problem.

Regards,

Konstantin

 

Sandgren__Åke
Beginner
651 Views

I can still see this using IntelMPI 2018.1.163 whn running QuantumEspresso 6.1 built with

icc/ifort/imkl/impi 2018.1.163

MKL_VERBOSE Intel(R) MKL 2018.0 Update 1 Product build 20171007 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.60GHz lp64 intel_thread NMICDev:0

...

Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(532)................: MPI_Comm_split(comm=0xc4000012, color=0, key=0, new_comm=0x7ffe71e4075c) failed
PMPI_Comm_split(508)................: fail failed
MPIR_Comm_split_impl(260)...........: fail failed
MPIR_Get_contextid_sparse_group(676): Too many communicators (16361/16384 free on this process; ignore_id=0)

Sandgren__Åke
Beginner
651 Views

A bit more info on this is that when compiling the same QuantumEspresso 6.1 with

GCC 6.4, Scalapack from netlib, FFTW, OpenBLAS and Intel MPI 2018.1.163

I get the exact same error, so it is not MKL related but rather Intel MPI itself that has the problem.

Or at least that's the most likely culprit here.

Gennady_F_Intel
Moderator
651 Views

Actually the problem with p?gemm has been fixed in MKL 2018 u3. Here is the link to MKL Bug fix list: https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2018-bug-fixes-list. MKLD-3445  Fixed run-time failure of P?GEMM routine for specific problem sizes. 

 

Reply