Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
1890 Discussions

Segmentation fault according to #Node

youn__kihang
Novice
722 Views

 

Hi All,


As a segmentation fault occurs during model execution, please inquire how to give the I_MPI_FABRIC and FI_PROVIDER options.

 

1. Versions
intel/19.4
impi/2019.9.304

2. Options
export I_MPI_HYDRA_BOOTSTRAP=ssh
export I_MPI_FABRICS=ofi
export FI_PROVIDER=mlx
export I_MPI_DEBUG=9
export I_MPI_PIN_PROCESSOR_LIST=0-37,38-75

3. Errors
==== backtrace (tid: 81532) ====

0 0x0000000000056e59 ucs_debug_print_backtrace() ???:0
1 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
2 0x000000000000ba0a mlx_send_callback() osd.c:0
3 0x000000000004c742 ucp_tag_offload_unexp_eager() ???:0
4 0x00000000000502c4 uct_ud_ep_do_pending() ???:0
5 0x0000000000050084 ucs_arbiter_dispatch_nonempty() ???:0
6 0x0000000000056938 uct_ud_mlx5_ep_t_delete() ???:0
7 0x000000000002f54a ucp_worker_progress() ???:0
8 0x0000000000009b3c mlx_tagged_inject() mlx_tagged.c:0
9 0x00000000003382be fi_tinject() /usr/include/rdma/fi_tagged.h:136
10 0x00000000003382be MPIDI_OFI_inject_handler_vci() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_impl.h:670
11 0x00000000003382be MPIDI_OFI_send_lightweight_request() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_send.h:81
12 0x00000000003382be MPIDI_OFI_send() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_send.h:606
13 0x00000000003382be MPIDI_NM_mpi_isend() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_send.h:804
14 0x00000000003382be MPIDI_isend_unsafe() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:322
15 0x00000000003382be MPIDI_isend_safe() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:609
16 0x00000000003382be MPID_Isend() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:828
17 0x00000000003382be MPID_Isend_coll() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:847
18 0x00000000003382be MPIC_Isend() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/helper_fns.c:504
19 0x0000000000136ca3 MPIR_Bcast_intra_tree_generic() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/intel/bcast/bcast_intra_tree.c:148
20 0x000000000013609e MPIR_Bcast_intra_tree() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/intel/bcast/bcast_intra_tree.c:202
21 0x0000000000174165 MPIDI_NM_mpi_bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_coll.h:127
22 0x0000000000174165 MPIDI_Bcast_intra_composition_alpha() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:297
23 0x0000000000174165 MPID_Bcast_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:1726
24 0x0000000000174165 MPIDI_coll_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3356
25 0x0000000000154f1e MPIDI_coll_select() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:129
26 0x000000000021d12d MPID_Bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:51
27 0x0000000000137ad9 PMPI_Bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/bcast/bcast.c:416
28 0x00000000000e8924 pmpi_bcast_() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/binding/fortran/mpif_h/bcastf.c:270
29 0x00000000008d05a8 controls_mp_ini_() ???:0
30 0x0000000000410816 MAIN__() ???:0
31 0x00000000004107a2 main() ???:0
32 0x00000000000237b3 __libc_start_main() ???:0
33 0x00000000004106ae _start() ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
KIM 00000000012F11CA for__signal_handl Unknown Unknown
libpthread-2.28.s 000014FA2B3CEB20 Unknown Unknown Unknown
libmlx-fi.so 000014F8F383CA0A Unknown Unknown Unknown
libucp.so.0.0.0 000014F8F2AAC742 Unknown Unknown Unknown
libuct_ib.so.0.0. 000014F8F227F2C4 uct_ud_ep_do_pend Unknown Unknown
libucs.so.0.0.0 000014F8F24F1084 ucs_arbiter_dispa Unknown Unknown
libuct_ib.so.0.0. 000014F8F2285938 Unknown Unknown Unknown
libucp.so.0.0.0 000014F8F2A8F54A ucp_worker_progre Unknown Unknown
libmlx-fi.so 000014F8F383AB3C Unknown Unknown Unknown
libmpi.so.12.0.0 000014FA2BD202BE Unknown Unknown Unknown
libmpi.so.12.0.0 000014FA2BB1ECA3 Unknown Unknown Unknown
libmpi.so.12.0.0 000014FA2BB1E09E Unknown Unknown Unknown
libmpi.so.12.0.0 000014FA2BB5C165 Unknown Unknown Unknown
libmpi.so.12.0.0 000014FA2BB3CF1E Unknown Unknown Unknown
libmpi.so.12.0.0 000014FA2BC0512D Unknown Unknown Unknown
libmpi.so.12.0.0 000014FA2BB1FAD9 MPI_Bcast Unknown Unknown
libmpifort.so.12. 000014FA2CE30924 pmpi_bcast Unknown Unknown
KIM 00000000008D05A8 Unknown Unknown Unknown
KIM 0000000000410816 Unknown Unknown Unknown
KIM 00000000004107A2 Unknown Unknown Unknown
libc-2.28.so 000014FA2AC9A7B3 __libc_start_main Unknown Unknown
KIM 00000000004106AE Unknown Unknown Unknown

 

Do you need fi_info and ucx_info information?
Please share your ideas for anything.
Thank you.

0 Kudos
5 Replies
PrasanthD_intel
Moderator
688 Views

Hi Kihang,


We can infer from the dump that the error is from the broadcast function in your Fortran program.

Are all the nodes reachable in your network?

Check the correctness of the program using ITAC. Source ITAC and then run the command given below.

command : mpirun -check_mpi -n <> ./foo


>> Do you need fi_info and ucx_info information?  

  Yes, please provide the above details along with the command line that you used to encounter this segmentation fault.


Also if possible please provide us with a sample reproduce that would help in debugging the code.


Regards

Prasanth


Kyle_Kyu-Young_Choi
664 Views

 

Here are addtional infomation.
I fully understand  that the information I'm giving you isn't enough.

Symptoms are as follows:
When executed with mlx provider, there may be cases where it stops at collective communication or an error occurs. There are many varieties of errors. I browsed and tried several options, but nothing worked. When it is executed with tcp provider, it is executed but the performance is poor (even if it is executed with FI_TCP_IFACE=ib0).
We are suspicious of compatibility issues with new UCX and MOFED versions. Please look at the contents and check if there are any parts that can be solved with the options or something.

- Version

MPI Library 2019 Update 9
ifort version 19.1.3.304
MLNX_OFED_LINUX-5.2-1.0.4.0
UCX version=1.10.0 revision a212a09

- Fi_info(except sockets, shm)

*provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_1
version: 110.10
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
*provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_1-xrc
version: 110.10
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
*provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_1-dgram
version: 110.10
type: FI_EP_DGRAM
protocol: FI_PROTO_IB_UD
*provider: tcp
fabric: *******
domain: ib0
version: 110.10
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
*provider: tcp
fabric: 10.110.200.0/21
domain: bond0
version: 110.10
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
*provider: tcp
fabric: *******
domain: bond0.2000
version: 110.10
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
*provider: tcp
fabric: *******
domain: bond0.2100
version: 110.10
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
*provider: tcp
fabric: *******
domain: enp0s20f0u1u6
version: 110.10
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
*provider: tcp
fabric: *******
domain: lo
version: 110.10
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
*provider: tcp
fabric: ::1/128
domain: lo
version: 110.10
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP

- UCX info
[root@maru0001 ~]# ucx_info -d
# Memory domain: tcp
# Component: tcp
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
#
# Transport: tcp
# Device: ib0
# System device: <unknown>
#
# capabilities:
# bandwidth: 23045.61/ppn + 0.00 MB/sec
# latency: 5203 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 16 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure
#
# Memory domain: mlx5_0
# Component: ib
# register: unlimited, cost: 180 nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
#
# Transport: rc_verbs
# Device: mlx5_0:1
# System device: 0000:17:00.0 (0)
#
# capabilities:
# bandwidth: 13923.72/ppn + 0.00 MB/sec
# latency: 600 + 1.000 * N nsec
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 3 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 3 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 123
# am_bcopy: <= 8255
# am_zcopy: <= 8255, up to 2 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 127
# domain: device
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# device priority: 50
# device num paths: 1
# max eps: 256
# device address: 3 bytes
# ep address: 17 bytes
# error handling: peer failure
#
#
# Transport: rc_mlx5
# Device: mlx5_0:1
# System device: 0000:17:00.0 (0)
#
# capabilities:
# bandwidth: 13923.72/ppn + 0.00 MB/sec
# latency: 600 + 1.000 * N nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 14 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 14 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 186
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to ep
# device priority: 50
# device num paths: 1
# max eps: 256
# device address: 3 bytes
# ep address: 7 bytes
# error handling: buffer (zcopy), remote access, peer failure
#
#
# Transport: dc_mlx5
# Device: mlx5_0:1
# System device: 0000:17:00.0 (0)
#
# capabilities:
# bandwidth: 13923.72/ppn + 0.00 MB/sec
# latency: 660 nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 11 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 11 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 138
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 50
# device num paths: 1
# max eps: inf
# device address: 3 bytes
# iface address: 5 bytes
# error handling: buffer (zcopy), remote access, peer failure
#
#
# Transport: ud_verbs
# Device: mlx5_0:1
# System device: 0000:17:00.0 (0)
#
# capabilities:
# bandwidth: 13923.72/ppn + 0.00 MB/sec
# latency: 630 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 1 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 3952
# connection: to ep, to iface
# device priority: 50
# device num paths: 1
# max eps: inf
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure
#
#
# Transport: ud_mlx5
# Device: mlx5_0:1
# System device: 0000:17:00.0 (0)
#
# capabilities:
# bandwidth: 13923.72/ppn + 0.00 MB/sec
# latency: 630 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 132
# connection: to ep, to iface
# device priority: 50
# device num paths: 1
# max eps: inf
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure
#

- If_config

eno2np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet ******* netmask 255.255.248.0 broadcast 10.110.7.255
ether 7c:8a:e1:d1:ac:41 txqueuelen 1000 (Ethernet)
RX packets 1279605 bytes 1440742570 (1.3 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 263427 bytes 67716993 (64.5 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp0s20f0u1u6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether 7e:8a:e1:d1:ac:45 txqueuelen 1000 (Ethernet)
RX packets 61406 bytes 6306958 (6.0 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 4092
inet ******* netmask 255.255.0.0 broadcast *******
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 00:00:11:29:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 4096 (InfiniBand)
RX packets 2536176 bytes 1019406258 (972.1 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1566415 bytes 445570510 (424.9 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet ******* netmask 255.0.0.0
loop txqueuelen 1000 (Local Loopback)
RX packets 44921 bytes 28924325 (27.5 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 44921 bytes 28924325 (27.5 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

 

And Finally,

- MPI Options

### MPI TUNING:
export I_MPI_PIN=1
export I_MPI_HYDRA_BOOTSTRAP=ssh
export I_MPI_FABRICS=shm:ofi
export FI_PROVIDER=mlx
export FI_VERBS_IFACE=ib0
export FI_MLX_IFACE=ib0
export FI_TCP_IFACE=ib0
export I_HYDRA_IFACE=ib0
#export FI_LOG_LEVEL=debug
export I_MPI_DEBUG=10
export I_MPI_PIN_PROCESSOR_LIST=0-37,38-75
export UCX_TLS=rc_mlx5,dc_mlx5,ud_verbs,ud_mlx5
export I_MPI_OFI_EXPERIMENTAL=1
export I_MPI_HYDRA_TOPOLIB=hwloc
export UCX_NET_DEVICES=mlx5_0:1

{ time mpiexec.hydra -genvall -bootstrap ssh -machinefile ${PWD}/hostfile -n 15808 -ppn 76 ${PWD}/KIM ; } >> ${OUTFILE} 2>&1


- Errors
CASE#1

==== backtrace (tid: 77751) ====
0 0x0000000000056e59 ucs_debug_print_backtrace() ???:0
1 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
2 0x000000000000c4aa mlx_send_callback() osd.c:0
3 0x000000000004c5a2 ucp_tag_offload_unexp_eager() ???:0
4 0x0000000000043925 uct_dc_mlx5_iface_dci_do_common_pending_tx() ???:0
5 0x00000000000439c8 uct_dc_mlx5_iface_dci_do_dcs_pending_tx() ???:0
6 0x0000000000050084 ucs_arbiter_dispatch_nonempty() ???:0
7 0x0000000000044f46 uct_dc_mlx5_ep_check() ???:0
8 0x000000000002f54a ucp_worker_progress() ???:0
9 0x0000000000009ab1 mlx_ep_progress() mlx_ep.c:0
10 0x000000000001e8dd ofi_cq_progress() osd.c:0
11 0x000000000001f59b ofi_cq_readfrom() osd.c:0
12 0x00000000006594c6 fi_cq_read() /usr/include/rdma/fi_eq.h:385
13 0x00000000001ab58b MPIDI_Progress_test() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:181
14 0x00000000001ab58b MPID_Progress_test() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:236
15 0x00000000001ab58b MPID_Progress_wait() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:297
16 0x00000000007f3726 MPIR_Waitall_impl() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/request/waitall.c:74
17 0x0000000000136d5a MPIR_Coll_waitall() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/intel/include/mpir_coll_tree_utils.h:128
18 0x0000000000136d5a MPIR_Bcast_intra_tree_generic() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/intel/bcast/bcast_intra_tree.c:159
19 0x000000000013609e MPIR_Bcast_intra_tree() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/intel/bcast/bcast_intra_tree.c:202
20 0x0000000000174165 MPIDI_NM_mpi_bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_coll.h:127
21 0x0000000000174165 MPIDI_Bcast_intra_composition_alpha() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:297
22 0x0000000000174165 MPID_Bcast_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:1726
23 0x0000000000174165 MPIDI_coll_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3356
24 0x0000000000154f1e MPIDI_coll_select() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:129
25 0x000000000021d12d MPID_Bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:51
26 0x0000000000137ad9 PMPI_Bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/bcast/bcast.c:416
27 0x00000000000e8924 pmpi_bcast_() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/binding/fortran/mpif_h/bcastf.c:270
28 0x00000000008d057b controls_mp_ini_() ???:0
29 0x0000000000410816 MAIN__() ???:0
30 0x00000000004107a2 main() ???:0
31 0x00000000000237b3 __libc_start_main() ???:0
32 0x00000000004106ae _start() ???:0
=================================

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
KIM 00000000012F11CA for__signal_handl Unknown Unknown
libpthread-2.28.s 000014F9A558EB20 Unknown Unknown Unknown
libmlx-fi.so 000014F6542724AA Unknown Unknown Unknown
libucp.so.0.0.0 000014F65402C5A2 Unknown Unknown Unknown
libuct_ib.so.0.0. 000014F6535DA925 uct_dc_mlx5_iface Unknown Unknown
libuct_ib.so.0.0. 000014F6535DA9C8 uct_dc_mlx5_iface Unknown Unknown
libucs.so.0.0.0 000014F653A71084 ucs_arbiter_dispa Unknown Unknown
libuct_ib.so.0.0. 000014F6535DBF46 Unknown Unknown Unknown
libucp.so.0.0.0 000014F65400F54A ucp_worker_progre Unknown Unknown
libmlx-fi.so 000014F65426FAB1 Unknown Unknown Unknown
libmlx-fi.so 000014F6542848DD Unknown Unknown Unknown
libmlx-fi.so 000014F65428559B Unknown Unknown Unknown
libmpi.so.12.0.0 000014F9A62014C6 Unknown Unknown Unknown
libmpi.so.12.0.0 000014F9A5D5358B Unknown Unknown Unknown
libmpi.so.12.0.0 000014F9A639B726 Unknown Unknown Unknown
libmpi.so.12.0.0 000014F9A5CDED5A Unknown Unknown Unknown
libmpi.so.12.0.0 000014F9A5CDE09E Unknown Unknown Unknown
libmpi.so.12.0.0 000014F9A5D1C165 Unknown Unknown Unknown
libmpi.so.12.0.0 000014F9A5CFCF1E Unknown Unknown Unknown
libmpi.so.12.0.0 000014F9A5DC512D Unknown Unknown Unknown
libmpi.so.12.0.0 000014F9A5CDFAD9 MPI_Bcast Unknown Unknown
libmpifort.so.12. 000014F9A6FF0924 pmpi_bcast Unknown Unknown
KIM 00000000008D057B Unknown Unknown Unknown
KIM 0000000000410816 Unknown Unknown Unknown
KIM 00000000004107A2 Unknown Unknown Unknown
libc-2.28.so 000014F9A4E5A7B3 __libc_start_main Unknown Unknown
KIM 00000000004106AE Unknown Unknown Unknown

CASE#2
[maru0471:94652:0:94652] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x50)
[maru0472:98915:0:98915] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x50)
[maru0473:90088:0:90088] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x50)
==== backtrace (tid: 90088) ====
0 0x0000000000056e59 ucs_debug_print_backtrace() ???:0
1 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
2 0x000000000000c4aa mlx_send_callback() osd.c:0
3 0x000000000004c742 ucp_tag_offload_unexp_eager() ???:0
4 0x00000000000502c4 uct_ud_ep_do_pending() ???:0
5 0x0000000000050084 ucs_arbiter_dispatch_nonempty() ???:0
6 0x00000000000533e2 uct_ud_verbs_ep_t_delete() ???:0
7 0x000000000002f54a ucp_worker_progress() ???:0
8 0x0000000000009ab1 mlx_ep_progress() mlx_ep.c:0
9 0x000000000001e8dd ofi_cq_progress() osd.c:0
10 0x000000000001f59b ofi_cq_readfrom() osd.c:0
11 0x00000000006594c6 fi_cq_read() /usr/include/rdma/fi_eq.h:385
12 0x00000000001ab05f MPIDI_Progress_test() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:181
13 0x00000000001ab05f MPID_Progress_test() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:236
14 0x0000000000136969 MPIR_Coll_try_progress() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/intel/include/mpir_coll_tree_utils.h:84
15 0x0000000000136969 MPIR_Bcast_intra_tree_generic() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/intel/bcast/bcast_intra_tree.c:140
16 0x000000000013609e MPIR_Bcast_intra_tree() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/intel/bcast/bcast_intra_tree.c:202
17 0x0000000000174165 MPIDI_NM_mpi_bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_coll.h:127
18 0x0000000000174165 MPIDI_Bcast_intra_composition_alpha() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:297
19 0x0000000000174165 MPID_Bcast_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:1726
20 0x0000000000174165 MPIDI_coll_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3356
21 0x0000000000154f1e MPIDI_coll_select() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:129
22 0x0000000000182a74 MPID_Bcast() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:51
23 0x0000000000182a74 MPIDI_Allgather_intra_composition_gamma() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_extra_compositions.h:903
24 0x0000000000182a74 MPID_Allgather_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:1881
25 0x0000000000182a74 MPIDI_coll_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3287
26 0x0000000000154f1e MPIDI_coll_select() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:129
27 0x000000000021c2b5 MPID_Allgather() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:95
28 0x00000000000fe558 PMPI_Allgather() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/allgather/allgather.c:384
29 0x00000000000e7b4e pmpi_allgather_() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/binding/fortran/mpif_h/allgatherf.c:276
30 0x0000000000691574 parallel_mp_gen_node_group_() ???:0
31 0x000000000041e71c atm_main_program_mp_set_() ???:0
32 0x0000000000415016 atm_comp_driver_mp_set_() ???:0
33 0x0000000000410add MAIN__() ???:0
34 0x00000000004107a2 main() ???:0
35 0x00000000000237b3 __libc_start_main() ???:0
36 0x00000000004106ae _start() ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
KIM 00000000012F11CA for__signal_handl Unknown Unknown
libpthread-2.28.s 00001543C21B2B20 Unknown Unknown Unknown
libmlx-fi.so 0000154070E964AA Unknown Unknown Unknown
libucp.so.0.0.0 0000154070C50742 Unknown Unknown Unknown
libuct_ib.so.0.0. 000015407020B2C4 uct_ud_ep_do_pend Unknown Unknown
libucs.so.0.0.0 0000154070695084 ucs_arbiter_dispa Unknown Unknown
libuct_ib.so.0.0. 000015407020E3E2 Unknown Unknown Unknown
libucp.so.0.0.0 0000154070C3354A ucp_worker_progre Unknown Unknown
libmlx-fi.so 0000154070E93AB1 Unknown Unknown Unknown
libmlx-fi.so 0000154070EA88DD Unknown Unknown Unknown
libmlx-fi.so 0000154070EA959B Unknown Unknown Unknown
libmpi.so.12.0.0 00001543C2E254C6 Unknown Unknown Unknown
libmpi.so.12.0.0 00001543C297705F Unknown Unknown Unknown
libmpi.so.12.0.0 00001543C2902969 Unknown Unknown Unknown
libmpi.so.12.0.0 00001543C290209E Unknown Unknown Unknown
libmpi.so.12.0.0 00001543C2940165 Unknown Unknown Unknown
libmpi.so.12.0.0 00001543C2920F1E Unknown Unknown Unknown
libmpi.so.12.0.0 00001543C294EA74 Unknown Unknown Unknown
libmpi.so.12.0.0 00001543C2920F1E Unknown Unknown Unknown
libmpi.so.12.0.0 00001543C29E82B5 Unknown Unknown Unknown
libmpi.so.12.0.0 00001543C28CA558 MPI_Allgather Unknown Unknown
libmpifort.so.12. 00001543C3C13B4E mpi_allgather Unknown Unknown
KIM 0000000000691574 Unknown Unknown Unknown
KIM 000000000041E71C Unknown Unknown Unknown
KIM 0000000000415016 Unknown Unknown Unknown
KIM 0000000000410ADD Unknown Unknown Unknown
KIM 00000000004107A2 Unknown Unknown Unknown
libc-2.28.so 00001543C1A7E7B3 __libc_start_main Unknown Unknown
KIM 00000000004106AE Unknown Unknown Unknown

 

CASE#3
==== backtrace (tid: 70682) ====
0 0x0000000000056e59 ucs_debug_print_backtrace() ???:0
1 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
2 0x000000000000c4aa mlx_send_callback() osd.c:0
3 0x000000000003f9ca ucp_rndv_reg_send_buffer() ???:0
4 0x0000000000043909 ucp_rndv_ats_handler() ???:0
5 0x0000000000044a8c uct_dc_mlx5_ep_check() ???:0
6 0x000000000002f54a ucp_worker_progress() ???:0
7 0x0000000000009ab1 mlx_ep_progress() mlx_ep.c:0
8 0x000000000001e8dd ofi_cq_progress() osd.c:0
9 0x000000000001f59b ofi_cq_readfrom() osd.c:0
10 0x00000000006594c6 fi_cq_read() /usr/include/rdma/fi_eq.h:385
11 0x00000000001ab58b MPIDI_Progress_test() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:181
12 0x00000000001ab58b MPID_Progress_test() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:236
13 0x00000000001ab58b MPID_Progress_wait() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/ch4_progress.c:297
14 0x00000000007f6cab MPIR_Wait_impl() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/request/wait.c:40
15 0x0000000000330663 MPID_Wait() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/include/mpidpost.h:188
16 0x0000000000330663 MPIC_Wait() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/helper_fns.c:66
17 0x0000000000330663 MPIC_Sendrecv() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/helper_fns.c:334
18 0x000000000010907b MPIR_Allreduce_intra_rbz_redscat_allgather_pof2() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/intel/allreduce/allreduce_intra_rabenseifner.c:128
19 0x000000000010907b MPIR_Allreduce_intra_rabenseifner() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/intel/allreduce/allreduce_intra_rabenseifner.c:229
20 0x000000000017f049 MPIDI_NM_mpi_allreduce() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_coll.h:197
21 0x000000000017f049 MPIDI_Allreduce_intra_composition_beta() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:594
22 0x000000000017f049 MPID_Allreduce_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:1791
23 0x000000000017f049 MPIDI_coll_invoke() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3343
24 0x0000000000154f1e MPIDI_coll_select() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:129
25 0x000000000021ce47 MPID_Allreduce() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:75
26 0x0000000000111472 PMPI_Allreduce() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/mpi/coll/allreduce/allreduce.c:417
27 0x00000000000e7f90 pmpi_allreduce_() /localdisk/jenkins/workspace/workspace/ch4-build-linux-2019/impi-ch4-build-linux_build/CONF/impi-ch4-build-linux-release/label/impi-ch4-build-linux-intel64/_buildspace/release/../../src/binding/fortran/mpif_h/allreducef.c:276
28 0x0000000000695127 par_mpi_mp_par_allreduce_int_() ???:0
29 0x000000000086f4b4 dof_mp_setelemoffset_() ???:0
30 0x00000000007c0314 grid_mp_grid_initialize_() ???:0
31 0x000000000041e8d2 atm_main_program_mp_set_() ???:0
32 0x0000000000415016 atm_comp_driver_mp_set_() ???:0
33 0x0000000000410add MAIN__() ???:0
34 0x00000000004107a2 main() ???:0
35 0x00000000000237b3 __libc_start_main() ???:0
36 0x00000000004106ae _start() ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
KIM 00000000012F11CA for__signal_handl Unknown Unknown
libpthread-2.28.s 0000147E0D2C1B20 Unknown Unknown Unknown
libmlx-fi.so 0000147CD42CF4AA Unknown Unknown Unknown
libucp.so.0.0.0 0000147CD407C9CA Unknown Unknown Unknown
libucp.so.0.0.0 0000147CD4080909 ucp_rndv_ats_hand Unknown Unknown
libuct_ib.so.0.0. 0000147CD3638A8C Unknown Unknown Unknown
libucp.so.0.0.0 0000147CD406C54A ucp_worker_progre Unknown Unknown
libmlx-fi.so 0000147CD42CCAB1 Unknown Unknown Unknown
libmlx-fi.so 0000147CD42E18DD Unknown Unknown Unknown
libmlx-fi.so 0000147CD42E259B Unknown Unknown Unknown
libmpi.so.12.0.0 0000147E0DF344C6 Unknown Unknown Unknown
libmpi.so.12.0.0 0000147E0DA8658B Unknown Unknown Unknown
libmpi.so.12.0.0 0000147E0E0D1CAB Unknown Unknown Unknown
libmpi.so.12.0.0 0000147E0DC0B614 Unknown Unknown Unknown
libmpi.so.12.0.0 0000147E0D9E407B Unknown Unknown Unknown
libmpi.so.12.0.0 0000147E0DA5A049 Unknown Unknown Unknown
libmpi.so.12.0.0 0000147E0DA2FF1E Unknown Unknown Unknown
libmpi.so.12.0.0 0000147E0DAF7E47 Unknown Unknown Unknown
libmpi.so.12.0.0 0000147E0D9EC472 PMPI_Allreduce Unknown Unknown
libmpifort.so.12. 0000147E0ED22F90 mpi_allreduce_ Unknown Unknown
KIM 0000000000695127 Unknown Unknown Unknown
KIM 000000000086F4B4 Unknown Unknown Unknown
KIM 00000000007C0314 Unknown Unknown Unknown
KIM 000000000041E8D2 Unknown Unknown Unknown
KIM 0000000000415016 Unknown Unknown Unknown
KIM 0000000000410ADD Unknown Unknown Unknown
KIM 00000000004107A2 Unknown Unknown Unknown
libc-2.28.so 0000147E0CB8D7B3 __libc_start_main Unknown Unknown
KIM 00000000004106AE Unknown Unknown Unknown

youn__kihang
Novice
626 Views

 

Hello All,

It turned out to be a compatibility issue with the UCX version.
When I downgrade UCX 1.10 version (2021.04.01) to UCX 1.9, it works normally with mlx protocol.

Thank you.

PrasanthD_intel
Moderator
622 Views

Hi Kihang,


Thanks for reporting this to us.

I will let the internal team know regarding this.

Let us know if you have any further queries else we can close this thread.


Regards

Prasanth


PrasanthD_intel
Moderator
585 Views

Hi Kihang,


As your issue has been resolved, we are closing this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.


Regards

Prasanth


Reply