Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2204 Discussions

MPIDI_OFI_handle_cq_error(1042): OFI poll failed

Lumos
Beginner
5,285 Views

When I used Intel MPI to run CESM2_3 (ESCOMP/CESM: The Community Earth System Model --- ESCOMP/CESM:社区地球系统模型 (github.com)), I could run it on a single node, but multiple nodes would throw errors:

Abort(806995855) on node 28 (rank 28 in comm 0): Fatal error in PMPI_Recv: Other MPI error, error stack:
PMPI_Recv(173).................: MPI_Recv(buf=0x2b0d17cef010, count=8838096, MPI_DOUBLE, src=0, tag=9, comm=0xc400012d, status=0x7ffcae86c930) failed
MPID_Recv(590).................:
MPIDI_recv_unsafe(205).........:
MPIDI_OFI_handle_cq_error(1042): OFI poll failed (ofi_events.c:1042:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)

 

My mpirun version: Intel(R) MPI Library for Linux* OS, Version 2021.1 Build 20201112 (id: b9c9d2fc5)
Copyright 2003-2020, Intel Corporation.

 

Could anyone suggest to me how can I resolve this problem?

0 Kudos
22 Replies
VeenaJ_Intel
Moderator
4,965 Views

Hi,

 

Thanks for posting in Intel communities!

 

We kindly bring to your attention that the MPI version that you are using (2021.1) appears to be old. We would appreciate it if you could consider testing the same scenario with the latest MPI version. You may conveniently download the most recent version from the following link:

 

https://www.intel.com/content/www/us/en/developer/tools/oneapi/hpc-toolkit-download.html

 

For Standalone:

 

https://www.intel.com/content/www/us/en/developer/articles/tool/oneapi-standalone-components.html#mpi

 

Moreover, we would be grateful if you could run the Intel MPI Benchmark (IMB) on both single and multiple nodes and share the results with us. Please execute the IMB using the provided command:

 

mpirun -n 2 IMB-MPI1

 

To assist you effectively, kindly provide the following details:

 

  1. Linux OS Flavor
  2. Output of the "lscpu" command
  3. Hardware details
  4. Detailed steps for recreating the scenario.

 

Your cooperation is greatly appreciated!

 

Regards,

Veena

 

0 Kudos
Lumos
Beginner
4,951 Views

Hi Veena,

 

Thank you for your reply.

 

Does this have anything to do with the version of mpi? I tried to install the latest version, but since my system version is CentOS7, there will be some warnings during installation, what impact does this have? Therefore, I have not updated for the time being and still use mpi2021.1.

 

Since it's on a cluster, I don't know much about the hardware details. Some operation details you can see: BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES | DiscussCESM Forums

 

Output of the "lscpu" command: 

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
Stepping: 4
CPU MHz: 1000.000
CPU max MHz: 2601.0000
CPU min MHz: 1000.0000
BogoMIPS: 5200.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 19712K
NUMA node0 CPU(s): 0-13,28-41
NUMA node1 CPU(s): 14-27,42-55
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

 

In addition, the results of mpirun-n2 IMB-MPI1 are shown in the attachment. Thank you again for your reply.

 

 

Best regards,  
Lumos

 

0 Kudos
Lumos
Beginner
4,875 Views

Hi Veena,

 

I followed your prompt and used the new MPI 2021.11 to run the same case and got the following error:

 

[mpiexec@node20] Error: Unable to run bstrap_proxy on node20 (pid 112637, exit code 15)
[mpiexec@node20] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:157): check exit codes error
[mpiexec@node20] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:206): poll for event error
[mpiexec@node20] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1063): error waiting for event
[mpiexec@node20] Error setting up the bootstrap proxies
[mpiexec@node20] Possible reasons:
[mpiexec@node20] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node20] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts.
[mpiexec@node20] Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node20] 3. Firewall refused connection.
[mpiexec@node20] Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node20] 4. pbs bootstrap cannot launch processes on remote host.
[mpiexec@node20] You may try using -bootstrap option to select alternative launcher.
pbsdsh(): error from tm_poll() 17002
pbsdsh(): error from tm_poll() 17002
pbsdsh(): error from tm_poll() 17002
pbsdsh(): error from tm_poll() 17002

 

That might help? Thank you again for your reply.

 

Best regards,  
Lumos

 

0 Kudos
VeenaJ_Intel
Moderator
4,852 Views

Hi,

 

Thank you for utilizing the latest version and promptly sharing the results with us. To facilitate a more in-depth analysis by gathering more detailed logs, could you kindly consider executing the Intel MPI Benchmark (IMB) by setting the following environment variable:

 

export I_MPI_DEBUG=10

 

Kindly ensure that you use the same number of nodes and processes as in the case where the issue occurs. This step will be instrumental in pinpointing whether the problem lies on the MPI side or the application side.

 

We deeply appreciate your cooperation in undertaking this task. Should you have any further inquiries or encounter any challenges during this process, please don't hesitate to reach out.

 

Regards,

Veena

 

0 Kudos
Lumos
Beginner
4,844 Views

Hi Veena,

 

I did what you told me, but the mistake didn't change. Also, my necdf and pnetcdf is compiled with mpi2021.1.1, does this matter? Do I need to recompile with the latest mpi?

 

Best regards,   
Lumos

0 Kudos
VeenaJ_Intel
Moderator
4,805 Views

Hi,

 

We suggested a software upgrade because we believe it may address some issues through the fixes implemented in the latest version. It appears that the problem persists. We recommended running the IMB (Intel MPI Benchmarks) to gain insights into whether the issue originates from the MPI framework or the application itself. 

 

To conduct a more in-depth analysis of the issue, it would be beneficial to obtain more detailed logs during the execution of IMB. Additionally, it would be helpful if you use the same number of nodes or processes used when running your code. 

 

To facilitate the generation of comprehensive logs, we kindly request that you set the following environment variable:

 

export I_MPI_DEBUG=10

 

Kindly share these enhanced logs with us. Your collaboration in sharing this information is highly appreciated.

 

Regards,

Veena

 

0 Kudos
Lumos
Beginner
4,788 Views

Hi Veena,

 

Thank you for your reply.

 

Do updates really work? I may need a lot of time to recompile all the libraries, which is a big project. For now, the new version of mpi doesn't seem to help me solve the problem. Maybe some other advice? And I'll take the time to try out your suggestions.

 

Best regards,   
Lumos

0 Kudos
Guoqi_Ma
Beginner
1,385 Views

Dear Veerna, I also met the similar error, 

 

Abort(135904911) on node 29 (rank 29 in comm 0): Fatal error in internal_Iprobe: Other MPI error, error stack:
internal_Iprobe(14309).........: MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm=0x84000006, flag=0x7ffd67c9e120, status=0x7ffd67c9e540) failed
MPID_Iprobe(389)...............:
MPIDI_Progress_test(105).......:
MPIDI_OFI_handle_cq_error(1127): OFI poll failed (ofi_events.c:1127:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)

 

when I run in less than 10 nodes in our HPC, there is no such errors, while running large model across more than 10 nodes, I always meet this error. Also, before upgrade of our HPC, there was no such error either.  I have attached the error, details of HPC, and benchmark log. Pleae have a look at it.

 

0 Kudos
VeenaJ_Intel
Moderator
4,753 Views

Hi,

 

If MPI is dynamically linked, there is no need to recompile the code. You can simply source it and use it. In order to conduct a more in-depth analysis and identify the root cause of the issue you are encountering, we kindly request that you provide enhanced logs by running the IMB in the version you're using(2021.1). This will help us better understand the situation. Without finding the actual root cause, we cannot guarantee that the issue will be resolved in the latest version upgrade. Your cooperation in providing the enhanced log information is highly valued.

 

Regards,

Veena

 

0 Kudos
Lumos
Beginner
4,747 Views

Hi Veena,

 

Thank you for your reply.

 

The results of mpirun-n2 IMB-MPI1 are shown in the attachment. 

 

Best regards,   
Lumos

0 Kudos
VeenaJ_Intel
Moderator
4,689 Views

Hi,

 

Thanks for sharing the logs. We are working on this internally. We will get back to you soon with an update.

 

Regards,

Veena

 

0 Kudos
Lumos
Beginner
4,670 Views

Hi,

Thank you for your reply. Any help would be most appreciated.

Best regards,   
Lumos

0 Kudos
VeenaJ_Intel
Moderator
4,642 Views

Hi,

 

We have analyzed the logs which you have provided. As mentioned before, Could you please provide enhanced logs by running the IMB with the same set of nodes and ranks as in the reported issue? Your assistance in this matter is greatly appreciated.

 

Regards,

Veena

 

0 Kudos
Lumos
Beginner
4,625 Views

Hi,

Thank you for your reply. The results of mpirun-n 140 IMB-MPI1 are shown in the attachment. 

Best regards,    
Lumos

0 Kudos
Lumos
Beginner
4,623 Views

By the way. Some operation details you can see: BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES | DiscussCESM Forums

I recommend checking these details out.

0 Kudos
TobiasK
Moderator
3,982 Views

@Lumos 
In the thread I noticed that you can now run your simulation without problems, is that correct?
Just a note here, please just use mpiifx, mpiicx, mpiicpx when you use oneAPI 2024.0 or later.

0 Kudos
Lumos
Beginner
3,975 Views

Sorry, I forgot to sync up here. I used oneapi2022.3, and the simulation ran successfully.

But I found that icx generated errors when compiling an older version of jasper-1.900.1 and failed to make check hdf5-1_14_3. Moreover, I found that the inability to use oneapi2024.0 in the virtual machine Rocky9.3 system would cause the system to not boot up and run.

Do I have to update to the latest oneAPI?

Thank you for your attention and help all the time.

 

0 Kudos
TobiasK
Moderator
3,972 Views

regarding the error with icx we strongly recommend to use the latest version. There was a patch release, 2024.0.1 that contains the latest patches, please check if you still have the original release. Our system requirements do not list Rocky 9.3 so you might need to check with a supported version. Which hypervisor / host OS are you trying to run on?

0 Kudos
Lumos
Beginner
3,971 Views

Okay, I'll try the latest version when I have time. Where can I see the supported versions? Also, we have a default Intel 2017 compiler in our HPC, what impact will icc2017 have? I am using VMware Workstation 17 Pro 17.0.2 build-21581411.

Thanks again.

0 Kudos
TobiasK
Moderator
3,961 Views

icc 2017 is quite old and not supported anymore, if there are errors you have to rely on your local HPC support. The support matrix can be found here:

https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-base-toolkit-system-requirements.html

 

0 Kudos
Reply