Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

MPI Isendrecv & Waitall deadlock

youn__kihang
Novice
1,506 Views

Hi All,

 

I am going to report deadlock with nonblocking send/recv and waitall.

In the following code part, the model is deadlocked (infinite wait).

The number of MPI processes is not important, but about 8,000 are used, and if it is repeated 50 times, it occurs randomly about 1 or 2 times. I don't see any special hardware issues, and the runtime options like I_MPI_HYDRA_BRANCH_COUNT helped with the synchronization issue like MPI_FINALIZE, so I would like to know if there are any useful runtime options in this case.

Please let me know if you have any additional improvements for sync control in the code below.

  DO i = 0,nproc-1
    IF (n_recvfrom(i)  >   0) THEN
      CALL mpl_irecv(recv_array(1,i), n_recvfrom(i), send_type_interpolation, i, 100, GlobalComm, recv_reqs(n_recv_reqs), info)
      n_recv_reqs = n_recv_reqs + 1
    END IF
  END DO

  DO i = 0,nproc-1
    IF (n_sendto(i)  >   0) THEN
      CALL mpl_isend(send_array(1,i), n_sendto(i), send_type_interpolation, i, 100, GlobalComm, send_reqs(n_send_reqs), info)
      n_send_reqs = n_send_reqs + 1
    END IF
  END DO

  IF (n_recv_reqs > 0) THEN
    CALL mpl_waitall(n_recv_reqs, recv_reqs, recv_istat, info)
  END IF

 

Thank you in advance

 

Kihang

0 Kudos
6 Replies
VarshaS_Intel
Moderator
1,478 Views

Hi,

 

Thanks for reaching out to us.

 

Could you please let us know the OS, CPU details, and Intel oneAPI version you are using?

 

>>The number of MPI processes is not important, but about 8,000 are used, and if it is repeated 50 times, it occurs randomly about 1 or 2 times.

What does it mean to repeated 50 times? Could you please elaborate more on this statement?

 

Could you please provide us with the sample reproducer code along with the steps to reproduce the issue to investigate more on your issue?

 

If you have any logs please share them with us. If not, then could you please include I_MPI_DEBUG=30 and FI_LOG_LEVEL=debug in the running command. Please find the below example command:

I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -n <no-of-proc> -ppn <proc-per-node> ./a.out

 

Also, let us know the OFI provider you are using?

 

Thanks & Regards,

Varsha

 

0 Kudos
youn__kihang
Novice
1,445 Views

Hi Varsha,

 

Thank you for your responds.

 

>>Could you please let us know the OS, CPU details, and Intel oneAPI version you are using?

 

OS: CentOS linux 8.3.2011

CPU: Intel Xeon Platinum 8368Q

OneAPI: 2021.3.0

 

>>Could you please provide us with the sample reproducer code along with the steps to reproduce the issue to investigate more on your issue?

 

It seems difficult right now because that part refers to several source codes. Let me figure out that.

 

>>Also, let us know the OFI provider you are using?

 

I am using "mlx" provider.

 

 

Best Regards,

Kihang

0 Kudos
youn__kihang
Novice
1,427 Views

Hi All,

 

Could you recommend any suggestion?

Or Aren't there the information about my situation enough to clarify what problem it is?

 

Thanks,

Kihang

0 Kudos
VarshaS_Intel
Moderator
1,421 Views

Hi,


>>It seems difficult right now because that part refers to several source codes. Let me figure out that.

As you mentioned in the previous reply, we are waiting for the complete reproducer code. It would be a great help if you provide a complete reproducer to investigate more on your issue.


Thanks & Regards,

Varsha


0 Kudos
VarshaS_Intel
Moderator
1,404 Views

Hi,


We have not heard back from you. Could you please provide us with the reproducer code so that we can investigate more on your issue?


Thanks & Regards,

Varsha


0 Kudos
VarshaS_Intel
Moderator
1,367 Views

Hi,


We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.


Thanks & Regards,

Varsha


0 Kudos
Reply