- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
I am testing my pure-MPI Fortran code which performs calculations on partitioned mesh-blocks for every iteration of the simulation. Therefore, rank-boundary data are transferred via Async send/recv with neighboring ranks. This can be seen below.
reqs = MPI_REQUEST_NULL
ir = 0
is = 0
! ****************************************
LOOP_RECV: do i=1,N
nei => neighbours(i)
nei_recv => neighbours_recv(i)
tg = 1
ir = ir + 1
call MPI_IRECV(nei_recv%data(1), order*ndim*nei%number_of_elements, &
MPI_DOUBLE_PRECISION, nei%dest, tg, MPI_COMM_WORLD, reqr(ir), ierr)
enddo LOOP_RECV
! ****************************************
! ****************************************
LOOP_SEND: do i=1,N
nei => neighbours(i)
nei_send => neighbours_send(i)
tg = 1
is = is + 1
call MPI_ISEND(nei_send%data(1), order*ndim*nei%number_of_elements, &
MPI_DOUBLE_PRECISION, nei%dest, tg, MPI_COMM_WORLD, reqs(is), ierr)
enddo LOOP_SEND
! ****************************************
call mpi_waitall(ir,reqr,statusr,ierr)
When I increase the size of the send/recv array (the order is increased and thus I have to send double or triple sized arrays), a deadlock occurs in a random iteration far from the start of the simulation (for order=1 the simulation finishes successfully!).
This occurs both on my workstation (Intel i5-10400, 6 processes) and on the cluster ( 2x Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz with 20 cores each) for number of processes ranging from 40-200.
In one of the cluster's simulations, the following message was printed while deadlocked.
dapl async_event CQ (0x22f6d70) ERR 0
dapl_evd_cq_async_error_callback (0x22c0d90, 0x22f6ed0, 0x7ff2de9f2bf0, 0x22f6d70)
dapl async_event QP (0x2275f00) Event 1
(On my PC I have intel mpiifort 2021.11.1 and on the cluster version 19.1.0)
Does anyone know what goes wrong and how to fix it?
Spiros
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This post is accidentally duplicate to https://community.intel.com/t5/Intel-HPC-Toolkit/Deadlock-after-many-iterations-100-200k-of-successful-async/m-p/1573832
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page