- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello I have successfully compiled my application but sometimes it crashed and produced
Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 40561) ====
0 /usr/lib64/libucs.so.0(ucs_handle_error+0x104) [0x7fa02aade4d4]
1 /usr/lib64/libucs.so.0(+0x1e8fc) [0x7fa02aade8fc]
2 /usr/lib64/libucs.so.0(+0x1eab2) [0x7fa02aadeab2]
3 /opt/intel/oneapi/mkl/latest/lib/intel64/libmkl_avx2.so.1(mkl_blas_avx2_xzdotc+0x2a3) [0x7fa02780d9a3]
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
prep-mpi.x 0000000000A997CA Unknown Unknown Unknown
libpthread-2.26.s 00007FA02D8F12D0 Unknown Unknown Unknown
libmkl_avx2.so.1 00007FA02780D9A3 mkl_blas_avx2_xzd Unknown Unknown
I have MLNX_OFED_LINUX-5.0-2.1.8.0 (due to FDR cards) and intelone api
c 2021.2.0.118
fortran 2021.2.0.136
mpi 2021.2.0.215
mkl 2021.2.0.296
ucx_info -v
# UCT version=1.8.0 revision c0a9704
# configured with: --host=x86_64-suse-linux-gnu --build=x86_64-suse-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/lib --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-dependency-tracking --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --enable-cma --without-cuda --without-gdrcopy --with-verbs --without-cm --with-knem --with-rdmacm --without-rocm --without-xpmem --without-ugni --without-java --disable-numa
Where could be problem?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reaching out to us.
Could you please provide the hardware details of the system on which you are running the application?
If possible, could you also provide the exact link to the application?
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for reply
83:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] MCX353A-FCBT
Supermicro X10DAI 2xIntel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz 256GB
Till now it failed in two
private https://people.sissa.it/~sorella/TurboRVB_Manual/build/html/index.html
and cp2k
I have also tried newer version of UCX
# UCT version=1.11.0 revision 6031c98
# configured with: --host=x86_64-suse-linux-gnu --build=x86_64-suse-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/lib --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-dependency-tracking --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --without-xpmem --without-fuse3 --without-ugni --disable-numa
new ucx_info -d
#
# Memory domain: posix
# Component: posix
# allocate: unlimited
# remote key: 24 bytes
# rkey_ptr is supported
#
# Transport: posix
# Device: memory
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 12179.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: sysv
# Component: sysv
# allocate: unlimited
# remote key: 12 bytes
# rkey_ptr is supported
#
# Transport: sysv
# Device: memory
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 12179.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: self
# Component: self
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
#
# Transport: self
# Device: memory0
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 6911.00 MB/sec
# latency: 0 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 8K
# am_bcopy: <= 8K
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: tcp
# Component: tcp
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
#
# Transport: tcp
# Device: eth1
# System device: <unknown>
#
# capabilities:
# bandwidth: 113.16/ppn + 0.00 MB/sec
# latency: 5776 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 0
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
# Transport: tcp
# Device: ib0
# System device: <unknown>
#
# capabilities:
# bandwidth: 6239.81/ppn + 0.00 MB/sec
# latency: 5210 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
# Transport: tcp
# Device: lo
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.91/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 18 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
# max_conn_priv: 2064 bytes
#
# Memory domain: mlx4_0
# Component: ib
# register: unlimited, cost: 180 nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
#
# Transport: rc_verbs
# Device: mlx4_0:1
# System device: 0000:83:00.0 (0)
#
# capabilities:
# bandwidth: 6433.22/ppn + 0.00 MB/sec
# latency: 900 + 1.000 * N nsec
# overhead: 75 nsec
# put_short: <= 88
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 6 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 2K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 6 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 2K
# am_short: <= 87
# am_bcopy: <= 8255
# am_zcopy: <= 8255, up to 5 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 2K
# am header: <= 127
# domain: device
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# device priority: 10
# device num paths: 1
# max eps: 256
# device address: 4 bytes
# ep address: 4 bytes
# error handling: peer failure, ep_check
#
#
# Transport: ud_verbs
# Device: mlx4_0:1
# System device: 0000:83:00.0 (0)
#
# capabilities:
# bandwidth: 6433.22/ppn + 0.00 MB/sec
# latency: 930 nsec
# overhead: 105 nsec
# am_short: <= 172
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 8 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 3952
# connection: to ep, to iface
# device priority: 10
# device num paths: 1
# max eps: inf
# device address: 4 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Connection manager: rdmacm
# max_conn_priv: 54 bytes
#
# Memory domain: cma
# Component: cma
# register: unlimited, cost: 9 nsec
#
# Transport: cma
# Device: memory
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 11145.00 MB/sec
# latency: 80 nsec
# overhead: 400 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 4 bytes
# error handling: peer failure, ep_check
#
but obtained same segmentation fault
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm unable to reply. I have sent it 10 times and still nothing here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
83:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] mcx353a-fcbt
X10DAI 2xIntel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz 256GB
codes: https://people.sissa.it/~sorella/TurboRVB_Manual/build/html/index.html (private)
cp2k https://www.cp2k.org
I have also tried newer version of ucx but same crash and try also MKL_VERBOSE=1 but I'm unable figure out where could be problem
ucx_info -v
[1625600433.356047] [localhost:47742:0] debug.c:1199 UCX DEBUG using signal stack 0x7f1572f67000 size 141824
[1625600433.379989] [localhost:47742:0] init.c:114 UCX DEBUG /home/bxm/Downloads/ucx-rpms/install/lib64/libucs.so.0 loaded at 0x7f1572670000
[1625600433.380038] [localhost:47742:0] init.c:115 UCX DEBUG cmd line: ucx_info -v
[1625600433.380059] [localhost:47742:0] module.c:69 UCX DEBUG ucs library path: /home/bxm/Downloads/ucx-rpms/install/lib64/libucs.so.0
[1625600433.380072] [localhost:47742:0] module.c:251 UCX DEBUG loading modules for ucs
# UCT version=1.11.0 revision 6031c98
# configured with: --host=x86_64-suse-linux-gnu --build=x86_64-suse-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/lib --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-dependency-tracking --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --without-xpmem --without-fuse3 --without-ugni --disable-numa
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
83:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] mcx353a-fcbt
X10DAI 2xIntel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz 256GB
codes: https://people.sissa.it/~sorella/TurboRVB_Manual/build/html/index.html (private)
cp2k https://www.cp2k.org
I have also tried newer version of ucx UCT version=1.11.0 revision 6031c98 but same crash and try also MKL_VERBOSE=1 but I'm unable figure out where could be problem
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Please provide answers to the below questions.
1.Could you please provide us the command line you have been using?
2.Could you also provide the details of the below commands?
which mpirun
ldd <executable file>
3.Could you please set I_MPI_DEBUG=10 and provide a complete error log?
4.Are you able to run benchmarks/other applications in this environment or facing the same issues?
5.Do you have exclusive access to all the nodes?
Thanks & Regards
Shivani
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for reply
I think I have found source of problems. I think I was to fast posting such a problem here. First I think it is not UCX related bug as I've written. Backtrace only show that ucx library is loaded and it must be because libmlx-fi.so as part of libfabric.so loaded it for my mellanox card.
In cp2k I have found that after increasing stack size sigsegv disapeared and code work well.
For turborvb after compiling with debug features enabled and without optimalizations I was able to do some test calculations without such sigsegv but some checks return infinite numbers so It looks like bug in code, like bad pointer or alignment of memory in parameters for mkl calling. So I think we should close this thread as solved.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Glad to know that your issue is resolved. If you need any additional information, please post a new question as this thread is no longer be monitored by Intel
Thanks & Regards
Shivani

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page