Intel® HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2144 Discussions

Internal error when using MPI Intel library

pilou
Beginner
229 Views

Hello,

 

I am having some issues when using reduction operations on MPI communicators.

 

I have a lots of different communicators created using the algorithm this way :

 

 

                              MPI_ERR_SONDAGE(MPI_Group_incl(world_group, comm_size, &(on_going_communicator[0]), &local_group));
                              MPI_ERR_SONDAGE(MPI_Comm_create_group(MPI_COMM_WORLD, local_group, tag, &communicator)); tag++;

 

When I call a reduction operation like so :

 

MPI_ERR_SONDAGE(MPI_Allreduce(&(temporary[0]), &(temporary_glo[0]), (int)lignes.size(), MPI_DOUBLE, MPI_MAX, communicator));

 

I get

 

Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x2ace34033c8c]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x2ace33aaffe1]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(+0x24f609) [0x2ace337c6609]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(+0x19b518) [0x2ace33712518]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(+0x1686aa) [0x2ace336df6aa]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(+0x251ac7) [0x2ace337c8ac7]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(PMPI_Allreduce+0x562) [0x2ace33685712]

 

I only have this problem on big test case. Meaning lots of communicators with a reasonnable amount of data to reduce. So I cannot create a MCVE, sorry.

 

When I set the environment variables I_MPI_COLL_DIRECT=off and I_MPI_COLL_INTRANODE=pt2pt, the code works fine. Since I guess the problem is induced by the use of NUMA and I guess forcing point to point communication will inhibit the use of NUMA.

 

But my fear is that these options will lead to degraded performance, so I really would like to know the bottom problem.

I have tried with :

intel/2020.1.217

intel/2020.2.254

intel/2021.4.0

 

And they basically show the same error.

 

Could you tell me or give me a hint of what is going on ?

 

Thank you.

0 Kudos
0 Replies
Reply