Internal error when using MPI Intel library




I am having some issues when using reduction operations on MPI communicators.


I have a lots of different communicators created using the algorithm this way :



                              MPI_ERR_SONDAGE(MPI_Group_incl(world_group, comm_size, &(on_going_communicator[0]), &local_group));
                              MPI_ERR_SONDAGE(MPI_Comm_create_group(MPI_COMM_WORLD, local_group, tag, &communicator)); tag++;


When I call a reduction operation like so :


MPI_ERR_SONDAGE(MPI_Allreduce(&(temporary[0]), &(temporary_glo[0]), (int)lignes.size(), MPI_DOUBLE, MPI_MAX, communicator));


I get


Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/ [0x2ace34033c8c]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/ [0x2ace33aaffe1]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/ [0x2ace337c6609]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/ [0x2ace33712518]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/ [0x2ace336df6aa]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/ [0x2ace337c8ac7]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/ [0x2ace33685712]


I only have this problem on big test case. Meaning lots of communicators with a reasonnable amount of data to reduce. So I cannot create a MCVE, sorry.


When I set the environment variables I_MPI_COLL_DIRECT=off and I_MPI_COLL_INTRANODE=pt2pt, the code works fine. Since I guess the problem is induced by the use of NUMA and I guess forcing point to point communication will inhibit the use of NUMA.


But my fear is that these options will lead to degraded performance, so I really would like to know the bottom problem.

I have tried with :





And they basically show the same error.


Could you tell me or give me a hint of what is going on ?


Thank you.

