Intel® HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2144 Discussions

Crash with custom reduction operator in Intel MPI

jackson__adrian
Beginner
1,141 Views

We've encountered what we think is a bug with Intel MPI. We have some code that uses a custom reduction operator, but it crashes when run with certain process counts with MPI 2021.8.0 (384 processes) and with any core count for MPI 2021.10.0.

 

The code that crashes is:

 

    inline static void SPARKY_MPI_QUAD_SUM(void* pIn, void* pInOut, int* nCount, MPI_Datatype* mDatatype)

    {

        quad* qIn = static_cast<quad*>(pIn);

        quad* qInOut = static_cast<quad*>(pInOut);

        for (int i = 0; i < *nCount; ++i) *qInOut[i] = *qIn[i] + *qInOut[i]; // quiet NaNs assumed  

    }  

 

Which is called in an MPI_Reduce.  The specific line causing a fault is the for loop, although nCount is 1 for these calls.

 

It crashes with errors like this:

Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))

==== backtrace (tid: 87374) ====
0 0x0000000000414370 SPATRIX_MPI<quad>::SPARKY_MPI_QUAD_SUM() ???:0
1 0x00000000006af6f9 MPIR_Reduce_local() /build/impi/_buildspace/release/../../src/mpi/coll/reduce_local/reduce_local.c:164
2 0x000000000010cf45 MPIR_Allreduce_intra_rec_multiplying() /build/impi/_buildspace/release/../../src/mpi/coll/intel/allreduce/allreduce_intra_rec_multiplying.c:375
3 0x000000000018ed5e MPIDI_OFI_Allreduce_intra_rec_multiplying() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_coll_impl_ext.h:16
4 0x000000000018ed5e MPIDI_NM_mpi_allreduce() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_coll.h:98
5 0x000000000018ed5e MPIDI_Allreduce_intra_composition_zeta() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:1069
6 0x000000000018ed5e MPID_Allreduce_invoke() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:1788
7 0x000000000018ed5e MPIDI_coll_invoke() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3269
8 0x000000000016960a MPIDI_coll_select() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:143
9 0x0000000000271d27 MPID_Allreduce() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:77
10 0x0000000000114b90 PMPI_Allreduce() /build/impi/_buildspace/release/../../src/mpi/coll/allreduce/allreduce.c:373
11 0x00000000004197f4 SPATRIX_MPI<quad>::Fnorm() ???:0
12 0x00000000004113dc TestingRoutines<quad>() DrivrutinMPI.cpp:0
13 0x0000000000406e4a main() ???:0
14 0x0000000000022555 __libc_start_main() /usr/src/debug/glibc-2.17-c758a686/csu/../csu/libc-start.c:266
15 0x0000000000408cf5 _start() ???:0

 

But works with the OpenMPI and MVAPICH libraries, which suggests it could be Intel MPI related.

I've attached a small(ish) reproducer of the crash.

0 Kudos
10 Replies
RabiyaSK_Intel
Moderator
1,112 Views

Hi,


Thanks for posting in Intel Communities.


Could you please provide the following details, so that we can reproduce your issue on our end:

1. CPU, Operating System and hardware details

2. The steps to reproduce for the provided reproducer


Thanks & Regards,

Shaik Rabiya


0 Kudos
jackson__adrian
Beginner
1,103 Views

Intel Cascade Lake, CentOS 7.9, Omnipath fabric with PSM2.

For the reproducer you need to compiler using Intel MPI and GNU (we used GNU 11.2.0) as the compiler (Intel is possible, you'd need to modify the compiler flags in the makefile).

An example batch script is in the reproducer showing how to run it.

0 Kudos
RabiyaSK_Intel
Moderator
1,071 Views

Hi,


>>>Intel Cascade Lake, CentOS 7.9, Omnipath fabric with PSM2.

Can you please try on supported Operating System as CentOS is not supported by Intel HPC Toolkit and Intel MPI Library?


Please refer to the following links for system requirements:


Intel HPC Toolkit:

https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-hpc-toolkit-system-requirements.html


Intel MPI Library:

https://www.intel.com/content/www/us/en/developer/articles/system-requirements/mpi-library-system-requirements.html


If the issue still persists, can you please provide the makefile and slurm script that you have used with Intel MPI Library to reproduce your issue as effectively as possible?


Thanks & Regards,

Shaik Rabiya


0 Kudos
jackson__adrian
Beginner
1,026 Views

There is a makefile and slurm batch script within the archive originally provided with this test case.

0 Kudos
RabiyaSK_Intel
Moderator
988 Views

Hi,


>>>Intel Cascade Lake, CentOS 7.9, Omnipath fabric with PSM2

Can you please try on supported operating system if you haven't tried to do so.


>>>For the reproducer you need to compiler using Intel MPI and GNU (we used GNU 11.2.0) as the compiler (Intel is possible, you'd need to modify the compiler flags in the makefile).

I apologize for misunderstanding. By this sentence did you mean that you have been using Intel MPI Library with GNU 11.2.0 compiler? If so, can you please try with Intel oneAPI compiler wrappers for MPI and send us the output log? We have gone through the slurm script and the makefile and tried changing the compiler and flags to take in Intel compilers but we received errors. Hence, can you please share the modified sample reproducer for Intel MPI Library with Intel MPI wrappers/oneAPI compilers, so that we can reproduce your issue effectively at our end.


Thanks & Regards,

Shaik Rabiya 


0 Kudos
jackson__adrian
Beginner
913 Views

I've tried this on a supported platform now (Cray system, albeit with different network and processors) and it seems to work fine there. Moreover, if I use the Intel compiler rather than GNU 11.2.0 on the original system it works as well, as I think it must be a combination of GNU 11.2.0 + Intel MPI causing the problem.

 

As I have a workaround (using the Intel compiler) we can close this now unless you want to try and fix it with the GNU compiler as well.

0 Kudos
RabiyaSK_Intel
Moderator
884 Views

Hi,


We have informed the concerned the development team about the issue. We will get back to you soon.


Thanks & Regards,

Shaik Rabiya


0 Kudos
RabiyaSK_Intel
Moderator
626 Views

Hi,

 

Thank you for your patience.

 

>>>Moreover, if I use the Intel compiler rather than GNU 11.2.0 on the original system it works as well, as I think it must be a combination of GNU 11.2.0 + Intel MPI causing the problem.

Did you find the same behavior with older or newer versions of GNU compiler besides the version that you are currently using?

 

>>> As I have a workaround (using the Intel compiler) we can close this now unless you want to try and fix it with the GNU compiler as well

Could you please mention which Intel oneAPI compiler (Intel Classic C++ compiler(mpiicc) or Intel oneAPI DPC++/C++ compiler(mpiicc -cc=icx)) you have used? If you haven't tried with either of the compilers, could you please try and confirm back to us if you are seeing the same crash?

 

 

Thanks & Regards,

Shaik Rabiya

0 Kudos
RabiyaSK_Intel
Moderator
549 Views

Hi,


We have not heard back from you. Could you please respond to my previous reply?


Thanks & Regards,

Shaik Rabiya


0 Kudos
RabiyaSK_Intel
Moderator
364 Views

Hi,


We have not heard back from you. If you need any additional information, please post a new question in communities as this thread will no longer be monitored by Intel.


Thanks & Regards,

Shaik Rabiya


0 Kudos
Reply