Solved: Deadlock after many iterations (100-200k) of successful async point-to-point communication.

Spiros · ‎02-20-2024

Dear all,

I am testing my pure-MPI Fortran code which performs calculations on partitioned mesh-blocks for every iteration of the simulation. Therefore, rank-boundary data are transferred via Async send/recv with neighboring ranks. This can be seen below.

   reqs = MPI_REQUEST_NULL
   ir = 0
   is = 0
   ! ****************************************
   LOOP_RECV: do i=1,N

      nei => neighbours(i)
      nei_recv => neighbours_recv(i)

      tg = 1
      ir = ir + 1
      call MPI_IRECV(nei_recv%data(1), order*ndim*nei%number_of_elements, &
                     MPI_DOUBLE_PRECISION, nei%dest, tg, MPI_COMM_WORLD, reqr(ir), ierr)
   enddo LOOP_RECV
   ! ****************************************

   ! ****************************************
   LOOP_SEND: do i=1,N

      nei => neighbours(i)
      nei_send => neighbours_send(i)

      tg = 1
      is = is + 1
      call MPI_ISEND(nei_send%data(1), order*ndim*nei%number_of_elements, &
                     MPI_DOUBLE_PRECISION, nei%dest, tg, MPI_COMM_WORLD, reqs(is), ierr)
   enddo LOOP_SEND
   ! ****************************************
   call mpi_waitall(ir,reqr,statusr,ierr)

When I increase the size of the send/recv array (the order is increased and thus I have to send double or triple sized arrays), a deadlock occurs in a random iteration far from the start of the simulation (for order=1 the simulation finishes successfully!).

This occurs both on my workstation (Intel i5-10400, 6 processes) and on the cluster ( 2x Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz with 20 cores each) for number of processes ranging from 40-200.

In one of the cluster's simulations, the following message was printed while deadlocked.

dapl async_event CQ (0x22f6d70) ERR 0
dapl_evd_cq_async_error_callback (0x22c0d90, 0x22f6ed0, 0x7ff2de9f2bf0, 0x22f6d70)
dapl async_event QP (0x2275f00) Event 1

(On my PC I have intel mpiifort 2021.11.1 and on the cluster version 19.1.0)

Does anyone know what goes wrong and how to fix it?

Spiros

Spiros · ‎02-22-2024

I have found the solution to my problem.

For anyone interested, I did not call the waitall subroutine for the send request (I was only calling it for the recv request), and this subroutine deallocates the requests and sets the corresponding handles to MPI_REQUEST_NULL.

This created a memory leak, which, after many iterations, resulted in some type of deadlock for the MPI, which I cannot interpret.

If anyone can give an insight into why the deadlock happened instead of a crash due to excessive memory per rank (which is predefined in the batch script), it would be much appreciated.

View solution in original post

Barbara_P_Intel · ‎02-20-2024

I'll move this to the HPC Toolkit Forum. The people there will support your MPI questions better.

Spiros · ‎02-20-2024

Got it. Thank you!

TobiasK · ‎02-20-2024

@Spiros sorry, that is a very old version of Intel MPI, can you please try the latest version available?

Spiros · ‎02-21-2024

I tried the mpiifx compiler of 2024 overnight with 6 ranks, and unfortunately, after several hours a deadlock occurs.

However, I noticed from a monitor of processes of my system that, every process accumulates memory gradually (starts from 167 mb per process and when the deadlock occurs, every process has a memory usage of 1.5-3 gb).

Also, send/recv buffers are allocated and deallocated before and after the data exchange.

Spiros · ‎02-22-2024

I have found the solution to my problem.

For anyone interested, I did not call the waitall subroutine for the send request (I was only calling it for the recv request), and this subroutine deallocates the requests and sets the corresponding handles to MPI_REQUEST_NULL.

This created a memory leak, which, after many iterations, resulted in some type of deadlock for the MPI, which I cannot interpret.

If anyone can give an insight into why the deadlock happened instead of a crash due to excessive memory per rank (which is predefined in the batch script), it would be much appreciated.

TobiasK · ‎02-22-2024

@Spiros glad that you found your problem, for such a simple code, I highly doubt that the error is related to the MPI implementation.
You can still use -check_mpi to check for such errors.

Another note, for performance reasons you don't want to deallocate/allocate the send/recv buffer for every iterations, keeping them alive will enable performance benefits.

Deadlock after many iterations (100-200k) of successful async point-to-point communication.

Runtime error