Deadlock after many iterations (100-200k) of successful async point-to-point communication.

Spiros · ‎02-20-2024

Dear all,

I am testing my pure-MPI Fortran code which performs calculations on partitioned mesh-blocks for every iteration of the simulation. Therefore, rank-boundary data are transferred via Async send/recv with neighboring ranks. This can be seen below.

   reqs = MPI_REQUEST_NULL
   ir = 0
   is = 0
   ! ****************************************
   LOOP_RECV: do i=1,N

      nei => neighbours(i)
      nei_recv => neighbours_recv(i)

      tg = 1
      ir = ir + 1
      call MPI_IRECV(nei_recv%data(1), order*ndim*nei%number_of_elements, &
                     MPI_DOUBLE_PRECISION, nei%dest, tg, MPI_COMM_WORLD, reqr(ir), ierr)
   enddo LOOP_RECV
   ! ****************************************

   ! ****************************************
   LOOP_SEND: do i=1,N

      nei => neighbours(i)
      nei_send => neighbours_send(i)

      tg = 1
      is = is + 1
      call MPI_ISEND(nei_send%data(1), order*ndim*nei%number_of_elements, &
                     MPI_DOUBLE_PRECISION, nei%dest, tg, MPI_COMM_WORLD, reqs(is), ierr)
   enddo LOOP_SEND
   ! ****************************************
   call mpi_waitall(ir,reqr,statusr,ierr)

When I increase the size of the send/recv array (the order is increased and thus I have to send double or triple sized arrays), a deadlock occurs in a random iteration far from the start of the simulation (for order=1 the simulation finishes successfully!).

This occurs both on my workstation (Intel i5-10400, 6 processes) and on the cluster ( 2x Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz with 20 cores each) for number of processes ranging from 40-200.

In one of the cluster's simulations, the following message was printed while deadlocked.

dapl async_event CQ (0x22f6d70) ERR 0
dapl_evd_cq_async_error_callback (0x22c0d90, 0x22f6ed0, 0x7ff2de9f2bf0, 0x22f6d70)
dapl async_event QP (0x2275f00) Event 1

(On my PC I have intel mpiifort 2021.11.1 and on the cluster version 19.1.0)

Does anyone know what goes wrong and how to fix it?

Spiros

Spiros · ‎02-20-2024

This post is accidentally duplicate to https://community.intel.com/t5/Intel-HPC-Toolkit/Deadlock-after-many-iterations-100-200k-of-successful-async/m-p/1573832