MPIDI_OFI_handle_cq_error(1042): OFI poll failed

Lumos · ‎12-01-2023

When I used Intel MPI to run CESM2_3 (ESCOMP/CESM: The Community Earth System Model --- ESCOMP/CESM：社区地球系统模型 (github.com)), I could run it on a single node, but multiple nodes would throw errors:

Abort(806995855) on node 28 (rank 28 in comm 0): Fatal error in PMPI_Recv: Other MPI error, error stack:
PMPI_Recv(173).................: MPI_Recv(buf=0x2b0d17cef010, count=8838096, MPI_DOUBLE, src=0, tag=9, comm=0xc400012d, status=0x7ffcae86c930) failed
MPID_Recv(590).................:
MPIDI_recv_unsafe(205).........:
MPIDI_OFI_handle_cq_error(1042): OFI poll failed (ofi_events.c:1042:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)

Could anyone suggest to me how can I resolve this problem?

VeenaJ_Intel · ‎12-06-2023

Hi,

Thanks for posting in Intel communities!

We kindly bring to your attention that the MPI version that you are using (2021.1) appears to be old. We would appreciate it if you could consider testing the same scenario with the latest MPI version. You may conveniently download the most recent version from the following link:

https://www.intel.com/content/www/us/en/developer/tools/oneapi/hpc-toolkit-download.html

For Standalone:

https://www.intel.com/content/www/us/en/developer/articles/tool/oneapi-standalone-components.html#mpi

Moreover, we would be grateful if you could run the Intel MPI Benchmark (IMB) on both single and multiple nodes and share the results with us. Please execute the IMB using the provided command:

mpirun -n 2 IMB-MPI1

To assist you effectively, kindly provide the following details:

Linux OS Flavor
Output of the "lscpu" command
Hardware details
Detailed steps for recreating the scenario.

Your cooperation is greatly appreciated!

Regards,

Veena

Lumos · ‎12-06-2023

Hi Veena,

Thank you for your reply.

Does this have anything to do with the version of mpi? I tried to install the latest version, but since my system version is CentOS7, there will be some warnings during installation, what impact does this have? Therefore, I have not updated for the time being and still use mpi2021.1.

Since it's on a cluster, I don't know much about the hardware details. Some operation details you can see: BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES | DiscussCESM Forums

Output of the "lscpu" command:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
Stepping: 4
CPU MHz: 1000.000
CPU max MHz: 2601.0000
CPU min MHz: 1000.0000
BogoMIPS: 5200.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 19712K
NUMA node0 CPU(s): 0-13,28-41
NUMA node1 CPU(s): 14-27,42-55
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

In addition, the results of mpirun-n2 IMB-MPI1 are shown in the attachment. Thank you again for your reply.

Best regards,
Lumos

Lumos · ‎12-10-2023

Hi Veena,

I followed your prompt and used the new MPI 2021.11 to run the same case and got the following error:

[mpiexec@node20] Error: Unable to run bstrap_proxy on node20 (pid 112637, exit code 15)
[mpiexec@node20] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:157): check exit codes error
[mpiexec@node20] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:206): poll for event error
[mpiexec@node20] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1063): error waiting for event
[mpiexec@node20] Error setting up the bootstrap proxies
[mpiexec@node20] Possible reasons:
[mpiexec@node20] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node20] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts.
[mpiexec@node20] Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node20] 3. Firewall refused connection.
[mpiexec@node20] Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node20] 4. pbs bootstrap cannot launch processes on remote host.
[mpiexec@node20] You may try using -bootstrap option to select alternative launcher.
pbsdsh(): error from tm_poll() 17002
pbsdsh(): error from tm_poll() 17002
pbsdsh(): error from tm_poll() 17002
pbsdsh(): error from tm_poll() 17002

That might help? Thank you again for your reply.

Best regards,
Lumos

VeenaJ_Intel · ‎12-11-2023

Hi,

Thank you for utilizing the latest version and promptly sharing the results with us. To facilitate a more in-depth analysis by gathering more detailed logs, could you kindly consider executing the Intel MPI Benchmark (IMB) by setting the following environment variable:

export I_MPI_DEBUG=10

Kindly ensure that you use the same number of nodes and processes as in the case where the issue occurs. This step will be instrumental in pinpointing whether the problem lies on the MPI side or the application side.

We deeply appreciate your cooperation in undertaking this task. Should you have any further inquiries or encounter any challenges during this process, please don't hesitate to reach out.

Regards,

Veena

Lumos · ‎12-11-2023

Hi Veena,

I did what you told me, but the mistake didn't change. Also, my necdf and pnetcdf is compiled with mpi2021.1.1, does this matter? Do I need to recompile with the latest mpi?

Best regards,
Lumos

VeenaJ_Intel · ‎12-12-2023

Hi,

We suggested a software upgrade because we believe it may address some issues through the fixes implemented in the latest version. It appears that the problem persists. We recommended running the IMB (Intel MPI Benchmarks) to gain insights into whether the issue originates from the MPI framework or the application itself.

To conduct a more in-depth analysis of the issue, it would be beneficial to obtain more detailed logs during the execution of IMB. Additionally, it would be helpful if you use the same number of nodes or processes used when running your code.

To facilitate the generation of comprehensive logs, we kindly request that you set the following environment variable:

export I_MPI_DEBUG=10

Kindly share these enhanced logs with us. Your collaboration in sharing this information is highly appreciated.

Regards,

Veena

Lumos · ‎12-12-2023

Hi Veena,

Thank you for your reply.

Do updates really work? I may need a lot of time to recompile all the libraries, which is a big project. For now, the new version of mpi doesn't seem to help me solve the problem. Maybe some other advice? And I'll take the time to try out your suggestions.

Best regards,
Lumos

Guoqi_Ma · ‎06-29-2024

Dear Veerna, I also met the similar error,

Abort(135904911) on node 29 (rank 29 in comm 0): Fatal error in internal_Iprobe: Other MPI error, error stack:
internal_Iprobe(14309).........: MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm=0x84000006, flag=0x7ffd67c9e120, status=0x7ffd67c9e540) failed
MPID_Iprobe(389)...............:
MPIDI_Progress_test(105).......:
MPIDI_OFI_handle_cq_error(1127): OFI poll failed (ofi_events.c:1127:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)

when I run in less than 10 nodes in our HPC, there is no such errors, while running large model across more than 10 nodes, I always meet this error. Also, before upgrade of our HPC, there was no such error either. I have attached the error, details of HPC, and benchmark log. Pleae have a look at it.

VeenaJ_Intel · ‎12-13-2023

Hi,

If MPI is dynamically linked, there is no need to recompile the code. You can simply source it and use it. In order to conduct a more in-depth analysis and identify the root cause of the issue you are encountering, we kindly request that you provide enhanced logs by running the IMB in the version you're using(2021.1). This will help us better understand the situation. Without finding the actual root cause, we cannot guarantee that the issue will be resolved in the latest version upgrade. Your cooperation in providing the enhanced log information is highly valued.

Regards,

Veena

Lumos · ‎12-13-2023

Hi Veena,

Thank you for your reply.

The results of mpirun-n2 IMB-MPI1 are shown in the attachment.

Best regards,
Lumos

VeenaJ_Intel · ‎12-13-2023

Hi,

Thanks for sharing the logs. We are working on this internally. We will get back to you soon with an update.

Regards,

Veena

Lumos · ‎12-14-2023

Hi,

Thank you for your reply. Any help would be most appreciated.

Best regards,
Lumos

VeenaJ_Intel · ‎12-15-2023

Hi,

We have analyzed the logs which you have provided. As mentioned before, Could you please provide enhanced logs by running the IMB with the same set of nodes and ranks as in the reported issue? Your assistance in this matter is greatly appreciated.

Regards,

Veena

Lumos · ‎12-15-2023

Hi,

Thank you for your reply. The results of mpirun-n 140 IMB-MPI1 are shown in the attachment.

Best regards,
Lumos

Lumos · ‎12-15-2023

By the way. Some operation details you can see: BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES | DiscussCESM Forums

I recommend checking these details out.

TobiasK · ‎01-11-2024

@Lumos
In the thread I noticed that you can now run your simulation without problems, is that correct?
Just a note here, please just use mpiifx, mpiicx, mpiicpx when you use oneAPI 2024.0 or later.

Lumos · ‎01-11-2024

Sorry, I forgot to sync up here. I used oneapi2022.3, and the simulation ran successfully.

But I found that icx generated errors when compiling an older version of jasper-1.900.1 and failed to make check hdf5-1_14_3. Moreover, I found that the inability to use oneapi2024.0 in the virtual machine Rocky9.3 system would cause the system to not boot up and run.

Do I have to update to the latest oneAPI?

Thank you for your attention and help all the time.

TobiasK · ‎01-11-2024

regarding the error with icx we strongly recommend to use the latest version. There was a patch release, 2024.0.1 that contains the latest patches, please check if you still have the original release. Our system requirements do not list Rocky 9.3 so you might need to check with a supported version. Which hypervisor / host OS are you trying to run on?

Lumos · ‎01-11-2024

Okay, I'll try the latest version when I have time. Where can I see the supported versions? Also, we have a default Intel 2017 compiler in our HPC, what impact will icc2017 have? I am using VMware Workstation 17 Pro 17.0.2 build-21581411.

Thanks again.

TobiasK · ‎01-11-2024

icc 2017 is quite old and not supported anymore, if there are errors you have to rely on your local HPC support. The support matrix can be found here:

https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-base-toolkit-system-requirements.html