- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am having some issues when using reduction operations on MPI communicators.
I have a lots of different communicators created using the algorithm this way :
MPI_ERR_SONDAGE(MPI_Group_incl(world_group, comm_size, &(on_going_communicator[0]), &local_group));
MPI_ERR_SONDAGE(MPI_Comm_create_group(MPI_COMM_WORLD, local_group, tag, &communicator)); tag++;
When I call a reduction operation like so :
MPI_ERR_SONDAGE(MPI_Allreduce(&(temporary[0]), &(temporary_glo[0]), (int)lignes.size(), MPI_DOUBLE, MPI_MAX, communicator));
I get
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x2ace34033c8c]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x2ace33aaffe1]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(+0x24f609) [0x2ace337c6609]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(+0x19b518) [0x2ace33712518]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(+0x1686aa) [0x2ace336df6aa]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(+0x251ac7) [0x2ace337c8ac7]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(PMPI_Allreduce+0x562) [0x2ace33685712]
I only have this problem on big test case. Meaning lots of communicators with a reasonnable amount of data to reduce. So I cannot create a MCVE, sorry.
When I set the environment variables I_MPI_COLL_DIRECT=off and I_MPI_COLL_INTRANODE=pt2pt, the code works fine. Since I guess the problem is induced by the use of NUMA and I guess forcing point to point communication will inhibit the use of NUMA.
But my fear is that these options will lead to degraded performance, so I really would like to know the bottom problem.
I have tried with :
intel/2020.1.217
intel/2020.2.254
intel/2021.4.0
And they basically show the same error.
Could you tell me or give me a hint of what is going on ?
Thank you.
Link Copied
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page