Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2154 Discussions

New MPI error with Intel 2019.3, unable to run MPIRUN

周__浩
Beginner
4,018 Views

Hello everyone

The error is as follows

Abort(1094543) on node 63 (rank 63 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(649)......:
MPID_Init(863).............:
MPIDI_NM_mpi_init_hook(705): OFI addrinfo() failed (ofi_init.h:705:MPIDI_NM_mpi_init_hook:No data available)
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
cp2k.popt 000000000C8CF8FB Unknown Unknown Unknown
libpthread-2.17.s 00002AC15DB4E5D0 Unknown Unknown Unknown
libpthread-2.17.s 00002AC15DB4D680 write Unknown Unknown
libibverbs.so.1.0 00002AC26A741E29 ibv_exp_cmd_creat Unknown Unknown
libmlx4-rdmav2.so 00002AC26B255355 Unknown Unknown Unknown
libmlx4-rdmav2.so 00002AC26B254199 Unknown Unknown Unknown
libibverbs.so.1.0 00002AC26A7401C3 ibv_create_qp Unknown Unknown
libverbs-fi.so 00002AC26A2E2017 Unknown Unknown Unknown
libverbs-fi.so 00002AC26A2E2CA0 Unknown Unknown Unknown
libverbs-fi.so 00002AC26A2D8FD1 fi_prov_ini Unknown Unknown
libfabric.so.1 00002AC1698A1189 Unknown Unknown Unknown
libfabric.so.1 00002AC1698A1610 Unknown Unknown Unknown
libfabric.so.1 00002AC1698A22AB fi_getinfo Unknown Unknown
libfabric.so.1 00002AC1698A6766 fi_getinfo Unknown Unknown
libmpi.so.12.0.0 00002AC15EF18EB6 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AC15EF0DD1C MPI_Init Unknown Unknown
libmpifort.so.12. 00002AC15E65ECFB MPI_INIT Unknown Unknown
cp2k.popt 000000000366B30E message_passing_m 747 message_passing.F
cp2k.popt 000000000139B7BA f77_interface_mp_ 234 f77_interface.F
cp2k.popt 000000000043CCA5 MAIN__ 198 cp2k.F

 

Attachment has debug details

0 Kudos
16 Replies
PrasanthD_intel
Moderator
4,002 Views

Hi,


Looks like you were using sockets provider, could you please check with TCP once. You can set FI_PROVIDER=tcp to use TCP as the provider.

eg: export FI_PROVIDER=tcp


Please provide us with a sample reproducer code so we can test it at our end.

Also, please provide us the mpirun command line you have used and your environment details (interconnect, total nodes, OS).


Regards

Prasanth


0 Kudos
周__浩
Beginner
3,995 Views

Thanks a lot for reply
My system is centos7.6, A total of 64 computing nodes in the cluster,Use 2 nodes in parallel. When using the cp2k program, the command: mpirun -n 128 cp2k.popt -i cp2k.inp 1>cp2k.out 2>cp2k.err

This is the result of fi_info query

[root@k0203 ~]# fi_info
provider: verbs;ofi_rxm
fabric: IB-0x18338657682652659712
domain: mlx4_0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs
fabric: IB-0x18338657682652659712
domain: mlx4_0
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0x18338657682652659712
domain: mlx4_0-rdm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_IB_RDM
provider: verbs
fabric: IB-0x18338657682652659712
domain: mlx4_0-dgram
version: 1.0
type: FI_EP_DGRAM
protocol: FI_PROTO_IB_UD
provider: UDP
fabric: UDP-IP
domain: udp
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: UDP-IP
domain: udp
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: UDP-IP
domain: udp
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: UDP-IP
domain: udp
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: UDP-IP
domain: udp
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: UDP-IP
domain: udp
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: UDP-IP
domain: udp
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: sockets
fabric: 172.25.0.0/16
domain: eno1
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 172.25.0.0/16
domain: eno1
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 172.25.0.0/16
domain: eno1
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.25.0.0/16
domain: ib0
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.25.0.0/16
domain: ib0
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.25.0.0/16
domain: ib0
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 192.168.122.0/24
domain: virbr0
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 192.168.122.0/24
domain: virbr0
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 192.168.122.0/24
domain: virbr0
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 127.0.0.0/8
domain: lo
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 127.0.0.0/8
domain: lo
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 127.0.0.0/8
domain: lo
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: shm
fabric: shm
domain: shm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_SHM

0 Kudos
周__浩
Beginner
3,976 Views

Hi

I can use export FI_PROVIDER=tcp to run normally, but the speed has a greater impact. I want to use the RDMA protocol and try to use export I_MPI_DEVICE=rdma:ofa-v2-ib0 . The above error occurred.

0 Kudos
PrasanthD_intel
Moderator
3,940 Views

Hi,


From the output of fi_info, we can observe that you have mlx and verbs as providers. which means you might be using Infiniband.

Please set FI_PROVIDER=mlx and I_MPI_FABRICS=shm:ofi for better performance over RDMA.

Let us know if you face any issues.


Regards

Prasanth



0 Kudos
周__浩
Beginner
3,915 Views

Hi happy New Year

According to the method you said, I added export FI_PROVIDER=mlx and
export I_MPI_FABRICS=shm:ofi but a new error occurred

"Abort(1094543) on node 45 (rank 45 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(649)......:
MPID_Init(863).............:
MPIDI_NM_mpi_init_hook(705): OFI addrinfo() failed (ofi_init.h:705:MPIDI_NM_mpi_init_hook:No data available)
"

0 Kudos
PrasanthD_intel
Moderator
3,899 Views

Hi,


Could you please set FI_LOG_LEVEL=debug and I_MPI_DEBUG=10 and provide us the debug logs.

eg: export FI_LOG_LEVEL=debug

export I_MPI_DEBUG=10


Regards

Prasanth


0 Kudos
周__浩
Beginner
3,894 Views

Hi

This is the log with debug mode enabled

0 Kudos
PrasanthD_intel
Moderator
3,878 Views

Hi,


You might be using an older Infiniband hardware. Please check if you have all the required transports installed for mlx to work

Please refer to this article (Improve Performance and Stability with Intel® MPI Library on...) for required transports for mlx to work. Let us know the transports you have using the command ucx_info -d | grep Transport

The minimum required UCX framework version is 1.4+. Please check your UCX version by using the command ucx_info -v.

Also, if possible, please update to the latest version of IMPI (2019u9).


Regards

Prasanth


0 Kudos
周__浩
Beginner
3,856 Views

Hi

I'm very sorry, but the reply was late due to work.

This is the result I found.

周__浩_0-1610103681814.png

 

0 Kudos
PrasanthD_intel
Moderator
3,838 Views

Hi,


After seeing the available transports in your system, we can see that dc transport is missing.

You can follow the steps mentioned in this article - Improve Performance and Stability with Intel® MPI Library on.... Let us know if you face any errors.

Are you facing the same error with the verbs transport? (Check with FI_PROVIDER=verbs)


Regards

Prasanth


0 Kudos
周__浩
Beginner
3,827 Views

Hi

Thank you very much, I will conduct a more detailed test, and I will let you know if there is any progress.

0 Kudos
PrasanthD_intel
Moderator
3,792 Views

Hi,


Let us know if the given workaround in the above-mentioned article works for you.


Regards

Prasanth


0 Kudos
周__浩
Beginner
3,789 Views

Hi

No error was reported using "FI_PROVIDER=verbs", thank you very much.

However, the cross-node efficiency is very low, and I am still looking for problems.

0 Kudos
PrasanthD_intel
Moderator
3,734 Views

Hi,


Have you tried the mlx provider with the given workaround? mlx would be recommended over verbs.


Regards

Prasanth


0 Kudos
PrasanthD_intel
Moderator
3,703 Views

Hi,


We haven't heard back from you. Let us know if your issue isn't resolved.


Regards

Prasanth


0 Kudos
PrasanthD_intel
Moderator
3,660 Views

Hi,


I am closing this thread assuming your issue has been resolved. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.


Regards

Prasanth


0 Kudos
Reply