Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2206 Discussions

MPIDI_OFI_handle_cq_error(1042): OFI poll failed

Lumos
Beginner
5,397 Views

When I used Intel MPI to run CESM2_3 (ESCOMP/CESM: The Community Earth System Model --- ESCOMP/CESM:社区地球系统模型 (github.com)), I could run it on a single node, but multiple nodes would throw errors:

Abort(806995855) on node 28 (rank 28 in comm 0): Fatal error in PMPI_Recv: Other MPI error, error stack:
PMPI_Recv(173).................: MPI_Recv(buf=0x2b0d17cef010, count=8838096, MPI_DOUBLE, src=0, tag=9, comm=0xc400012d, status=0x7ffcae86c930) failed
MPID_Recv(590).................:
MPIDI_recv_unsafe(205).........:
MPIDI_OFI_handle_cq_error(1042): OFI poll failed (ofi_events.c:1042:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)

 

My mpirun version: Intel(R) MPI Library for Linux* OS, Version 2021.1 Build 20201112 (id: b9c9d2fc5)
Copyright 2003-2020, Intel Corporation.

 

Could anyone suggest to me how can I resolve this problem?

0 Kudos
22 Replies
Lumos
Beginner
2,761 Views

Ok, thank you for your reply. I have no other questions for the time being.

0 Kudos
Guoqi_Ma
Beginner
207 Views

Dear Lumos, I also encounter similar errors when using MUMPS.  I would like to ask you if you have resolved this error?

 

Abort(135904911) on node 29 (rank 29 in comm 0): Fatal error in internal_Iprobe: Other MPI error, error stack:
internal_Iprobe(14309).........: MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm=0x84000006, flag=0x7ffd67c9e120, status=0x7ffd67c9e540) failed
MPID_Iprobe(389)...............:
MPIDI_Progress_test(105).......:
MPIDI_OFI_handle_cq_error(1127): OFI poll failed (ofi_events.c:1127:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)

0 Kudos
Reply