Hello,
The Intel MPI library (version 2021.9.0) fails when creating an intercommunicator with an asynchronous progress thread. I included a test program which shows the following error:
$ I_MPI_ASYNC_PROGRESS=1 mpirun -n 10 ./a.out
Abort(204053775) on node 4 (rank 4 in comm 0): Fatal error in PMPI_Intercomm_create: Other MPI error, error stack:
PMPI_Intercomm_create(317)...........: MPI_Intercomm_create(comm=0x84000002, local_leader=0, MPI_COMM_WORLD, remote_leader=0, tag=1, newintercomm=0x7fff23cc61e4) failed
MPIR_Intercomm_create_impl(49).......:
MPID_Intercomm_exchange_map(645).....:
MPIDIU_Intercomm_map_bcast_intra(112):
MPIR_Bcast_intra_auto(85)............:
MPIR_Bcast_intra_binomial(131).......: message sizes do not match across processes in the collective routine: Received 4100 but expected 16
The program runs fine without the asynchronous progress thread.
Note that this does not happen every time and the probability increases with higher number of MPI ranks. Also, the numbers for 'received' and 'expected' both change, so it looks like a race condition.
連結已複製
3 回應
