Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
1826 Discussions

[7] [Node0:4997 :0:5094] Caught signal 11 (Segmentation fault: address not mapped to object at

tamilalagan
Novice
3,763 Views

Getting segmentation fault error with stack trace libucs.

Setup:

Intel MPI Version: Version 2019 Update 7 Build 20200312

OFED Version: MLNX_OFED_LINUX-4.7-3.2.9.0

UCX Version: UCT version=1.7.0 revision b02bab9

Launch Parameter: mpirun -l -print-all-exitcodes -cleanup -genv I_MPI_FABRICS shm:ofi -genv I_MPI_WAIT_MODE 1 -genv FI_PROVIDER_PATH <basePath>/Intel/mpi/lib-IB/prov -genv UCX_NET_DEVICES mlx5_0:1 <Process>

Call Stack:

[7] [IMCNode000:4997 :0:5094] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xd8092c2f)
[7] ==== backtrace ====
[7] 0 /usr/lib64/libucs.so.0(+0x1de0c) [0x7fc4c0720e0c]
[7] 1 /usr/lib64/libucs.so.0(+0x1dfc2) [0x7fc4c0720fc2]
[7] 2 /lib64/libpthread.so.0(+0x10c70) [0x7fc4c4fb4c70]
[7] 3 /home/klac/mount_path_Tool/Binux/Tigris_IMC.exe() [0x8f4457]
[7] 4 /home/klac/mount_path_Tool/Binux/Tigris_IMC.exe() [0x8fb0ad]
[7] 5 /home/klac/mount_path_Tool/Binux/libOpenThreads.so(_ZN11OpenThreads20ThreadPrivateActions11StartThreadEPv+0x59) [0x7fc4e2bb4fe9]
[7] 6 /lib64/libpthread.so.0(+0x874a) [0x7fc4c4fac74a]
[7] 7 /lib64/libc.so.6(clone+0x6d) [0x7fc4c3beef6d]
[7] ===================
[7] Segmentation Fault Exception

 

I have found a similar issue handled for OpenMPI in this below link

https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX#running-open-mpi-wit...

$ mpirun -np 2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 ./app

IMPORTANT NOTE: Recent OpenMPI versions contain a BTL component called 'uct', which can cause data corruption when enabled, due to conflict on malloc hooks between OPAL and UCM. In order to work-around this, use one of the following alternatives:

Alternative 1: Disable btl/uct in OpenMPI build configuration:

$ ./configure ... --enable-mca-no-build=btl-uct ...

Alternative 2: Disable btl/uct at runtime

$ mpirun -np 2 -mca pml ucx -mca btl ^uct -x UCX_NET_DEVICES=mlx5_0:1 ./app

Is the issue could be same as OpenMPI one?

Is there any fix is available?

0 Kudos
13 Replies
PrasanthD_intel
Moderator
3,742 Views

Hi,


I am not sure whether this is the same issue as of OpenMPI as I have tested with the same the command line in our environment with a sample program.


Could you please tell us the transports available in your system, the provider you were using and the interconnect.

For transports use following command: ucx_info -d | grep Transport


Regards

Prasanth


tamilalagan
Novice
3,739 Views

Available Transport & Device:

ucx_info -d | grep 'Transport\|Device'

Transport: posix
# Device: memory
# Transport: sysv
# Device: memory
# Transport: self
# Device: memory
# Transport: tcp
# Device: eth2
# Transport: tcp
# Device: ib0
# Transport: tcp
# Device: eth0
# Transport: tcp
# Device: eth3
# Transport: tcp
# Device: eth1
# Transport: rc
# Device: mlx5_0:1
# Transport: rc_mlx5
# Device: mlx5_0:1
# Transport: dc_mlx5
# Device: mlx5_0:1
# Transport: ud
# Device: mlx5_0:1
# Transport: ud_mlx5
# Device: mlx5_0:1
# Transport: cm
# Device: mlx5_0:1
# Transport: cma
# Device: memory

 

The provider you were using: OFI Provider

interconnect: 

Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:9803:9b03:0033:f782
base lid: 0x1
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: InfiniBand

 

PrasanthD_intel
Moderator
3,669 Views

Hi,


We cannot infer much from the stack that you have provided.

Could you please try /provide the following for us:

1) If you can, please try to check with the latest version 2019u9 and let us know if the error persists.

2) Please provide the full command line you are using other than mpirun

3)export I_MPI_DEBUG=10 and provide the logs

4) Please provide a sample reproducer for us to test.


Regards

Prasanth


tamilalagan
Novice
3,646 Views

Hi Prasanth,

This issue is not easy to reproduce in my setup and no definite steps as well.

1) If you can, please try to check with the latest version 2019u9 and let us know if the error persists.

Tamil >> This is bit difficult to integrate and this will take some time to do this test.

2) Please provide the full command line you are using other than mpirun

Tamil >> No other commands are used related to mpirun. Could you please more specific to some commands? (I am not using any setenv and MPI parameters)

3)export I_MPI_DEBUG=10 and provide the logs

Tamil >> Started collected with MPI_DEBUG 5 and OFI_DEBUG 10. Will update with new log once issue reproduced.

4) Please provide a sample reproducer for us to test.

Tamil >> I have to find the clear steps :(.

tamilalagan
Novice
3,644 Views

Please also find the new launch command.

mpirun -l -print-all-exitcodes -cleanup -genv I_MPI_FABRICS shm:ofi -genv I_MPI_WAIT_MODE 1 -genv I_MPI_DEBUG 5 -genv FI_LOG_LEVEL debug -genv FI_PROVIDER_PATH <BasePath>/Intel/mpi/lib-IB/prov -genv UCX_NET_DEVICES mlx5_0:1

PrasanthD_intel
Moderator
3,595 Views

Hi,


We were not able to reproduce the error even with the new launch command.

As you have said could you please provide the logs we have asked for, that will help us in identifying the exact issue?

Also, are you getting errors with every program or with a certain code?


Regards

Prasanth


tamilalagan
Novice
3,590 Views

Thanks Prasanth,

I will trim the logs and upload the same.

tamilalagan
Novice
3,588 Views

Hello Prasanth,

 

Please also consider the below use case and confirm:

1. If any MPI task in cluster crashes (segmentation faults) and other tasks have RDMA pull logic with crashed tasks.

2. The MPI could reports with "sementation fault address not mapped to object"  error?

PrasanthD_intel
Moderator
3,577 Views

Hi,

1. If any MPI task in cluster crashes (segmentation faults) and other tasks have RDMA pull logic with crashed tasks.

I haven't fully understood the question you are trying to ask, but from what I understood you are asking whether Intel MPI aborts when one of the processes crashes, the answer is, yes currently there is no default fault tolerance enabled so if one of the tasks fails Intel MPI will abort. You can change this behavior by enabling fault tolerance.

2. The MPI could reports with "segmentation fault address not mapped to object" error?

Could you please elaborate on what you are trying to ask?

Regards

Prasanth

PrasanthD_intel
Moderator
3,519 Views

Hi,


We have been waiting for the debug logs you said you will provide. Let us know when you can provide them. Also, we haven't understood some of the questions you have asked could you please elaborate on them


Regards

Prasanth


tamilalagan
Novice
3,513 Views

Please provide a day or two, I will share more data for analysis.

PrasanthD_intel
Moderator
3,477 Views

Hi,


We are expecting the logs from you. If it takes more time than expected, please let us know. We can close this issue for now and start a new thread when they are available.


Regards

Prasanth


PrasanthD_intel
Moderator
3,433 Views

Hi,


It seems like you need time for providing logs and we need to close the thread as there is no interaction for an extended period.

Please raise a new thread when you have all the necessary information.

Any further interaction in this thread will be considered community only.


Regards

Prasanth


Reply