Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

[7] [Node0:4997 :0:5094] Caught signal 11 (Segmentation fault: address not mapped to object at

tamilalagan
Novice
15,637 Views

Getting segmentation fault error with stack trace libucs.

Setup:

Intel MPI Version: Version 2019 Update 7 Build 20200312

OFED Version: MLNX_OFED_LINUX-4.7-3.2.9.0

UCX Version: UCT version=1.7.0 revision b02bab9

Launch Parameter: mpirun -l -print-all-exitcodes -cleanup -genv I_MPI_FABRICS shm:ofi -genv I_MPI_WAIT_MODE 1 -genv FI_PROVIDER_PATH <basePath>/Intel/mpi/lib-IB/prov -genv UCX_NET_DEVICES mlx5_0:1 <Process>

Call Stack:

[7] [IMCNode000:4997 :0:5094] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xd8092c2f)
[7] ==== backtrace ====
[7] 0 /usr/lib64/libucs.so.0(+0x1de0c) [0x7fc4c0720e0c]
[7] 1 /usr/lib64/libucs.so.0(+0x1dfc2) [0x7fc4c0720fc2]
[7] 2 /lib64/libpthread.so.0(+0x10c70) [0x7fc4c4fb4c70]
[7] 3 /home/klac/mount_path_Tool/Binux/Tigris_IMC.exe() [0x8f4457]
[7] 4 /home/klac/mount_path_Tool/Binux/Tigris_IMC.exe() [0x8fb0ad]
[7] 5 /home/klac/mount_path_Tool/Binux/libOpenThreads.so(_ZN11OpenThreads20ThreadPrivateActions11StartThreadEPv+0x59) [0x7fc4e2bb4fe9]
[7] 6 /lib64/libpthread.so.0(+0x874a) [0x7fc4c4fac74a]
[7] 7 /lib64/libc.so.6(clone+0x6d) [0x7fc4c3beef6d]
[7] ===================
[7] Segmentation Fault Exception

 

I have found a similar issue handled for OpenMPI in this below link

https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX#running-open-mpi-with-ucx

$ mpirun -np 2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 ./app

IMPORTANT NOTE: Recent OpenMPI versions contain a BTL component called 'uct', which can cause data corruption when enabled, due to conflict on malloc hooks between OPAL and UCM. In order to work-around this, use one of the following alternatives:

Alternative 1: Disable btl/uct in OpenMPI build configuration:

$ ./configure ... --enable-mca-no-build=btl-uct ...

Alternative 2: Disable btl/uct at runtime

$ mpirun -np 2 -mca pml ucx -mca btl ^uct -x UCX_NET_DEVICES=mlx5_0:1 ./app

Is the issue could be same as OpenMPI one?

Is there any fix is available?

0 Kudos
14 Replies
PrasanthD_intel
Moderator
15,616 Views

Hi,


I am not sure whether this is the same issue as of OpenMPI as I have tested with the same the command line in our environment with a sample program.


Could you please tell us the transports available in your system, the provider you were using and the interconnect.

For transports use following command: ucx_info -d | grep Transport


Regards

Prasanth


0 Kudos
tamilalagan
Novice
15,613 Views

Available Transport & Device:

ucx_info -d | grep 'Transport\|Device'

Transport: posix
# Device: memory
# Transport: sysv
# Device: memory
# Transport: self
# Device: memory
# Transport: tcp
# Device: eth2
# Transport: tcp
# Device: ib0
# Transport: tcp
# Device: eth0
# Transport: tcp
# Device: eth3
# Transport: tcp
# Device: eth1
# Transport: rc
# Device: mlx5_0:1
# Transport: rc_mlx5
# Device: mlx5_0:1
# Transport: dc_mlx5
# Device: mlx5_0:1
# Transport: ud
# Device: mlx5_0:1
# Transport: ud_mlx5
# Device: mlx5_0:1
# Transport: cm
# Device: mlx5_0:1
# Transport: cma
# Device: memory

 

The provider you were using: OFI Provider

interconnect: 

Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:9803:9b03:0033:f782
base lid: 0x1
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: InfiniBand

 

0 Kudos
PrasanthD_intel
Moderator
15,543 Views

Hi,


We cannot infer much from the stack that you have provided.

Could you please try /provide the following for us:

1) If you can, please try to check with the latest version 2019u9 and let us know if the error persists.

2) Please provide the full command line you are using other than mpirun

3)export I_MPI_DEBUG=10 and provide the logs

4) Please provide a sample reproducer for us to test.


Regards

Prasanth


0 Kudos
tamilalagan
Novice
15,520 Views

Hi Prasanth,

This issue is not easy to reproduce in my setup and no definite steps as well.

1) If you can, please try to check with the latest version 2019u9 and let us know if the error persists.

Tamil >> This is bit difficult to integrate and this will take some time to do this test.

2) Please provide the full command line you are using other than mpirun

Tamil >> No other commands are used related to mpirun. Could you please more specific to some commands? (I am not using any setenv and MPI parameters)

3)export I_MPI_DEBUG=10 and provide the logs

Tamil >> Started collected with MPI_DEBUG 5 and OFI_DEBUG 10. Will update with new log once issue reproduced.

4) Please provide a sample reproducer for us to test.

Tamil >> I have to find the clear steps :(.

0 Kudos
tamilalagan
Novice
15,518 Views

Please also find the new launch command.

mpirun -l -print-all-exitcodes -cleanup -genv I_MPI_FABRICS shm:ofi -genv I_MPI_WAIT_MODE 1 -genv I_MPI_DEBUG 5 -genv FI_LOG_LEVEL debug -genv FI_PROVIDER_PATH <BasePath>/Intel/mpi/lib-IB/prov -genv UCX_NET_DEVICES mlx5_0:1

0 Kudos
PrasanthD_intel
Moderator
15,467 Views

Hi,


We were not able to reproduce the error even with the new launch command.

As you have said could you please provide the logs we have asked for, that will help us in identifying the exact issue?

Also, are you getting errors with every program or with a certain code?


Regards

Prasanth


0 Kudos
tamilalagan
Novice
15,462 Views

Thanks Prasanth,

I will trim the logs and upload the same.

0 Kudos
tamilalagan
Novice
15,461 Views

Hello Prasanth,

 

Please also consider the below use case and confirm:

1. If any MPI task in cluster crashes (segmentation faults) and other tasks have RDMA pull logic with crashed tasks.

2. The MPI could reports with "sementation fault address not mapped to object"  error?

0 Kudos
PrasanthD_intel
Moderator
15,450 Views

Hi,

1. If any MPI task in cluster crashes (segmentation faults) and other tasks have RDMA pull logic with crashed tasks.

I haven't fully understood the question you are trying to ask, but from what I understood you are asking whether Intel MPI aborts when one of the processes crashes, the answer is, yes currently there is no default fault tolerance enabled so if one of the tasks fails Intel MPI will abort. You can change this behavior by enabling fault tolerance.

2. The MPI could reports with "segmentation fault address not mapped to object" error?

Could you please elaborate on what you are trying to ask?

Regards

Prasanth

0 Kudos
PrasanthD_intel
Moderator
15,393 Views

Hi,


We have been waiting for the debug logs you said you will provide. Let us know when you can provide them. Also, we haven't understood some of the questions you have asked could you please elaborate on them


Regards

Prasanth


0 Kudos
tamilalagan
Novice
15,387 Views

Please provide a day or two, I will share more data for analysis.

0 Kudos
PrasanthD_intel
Moderator
15,351 Views

Hi,


We are expecting the logs from you. If it takes more time than expected, please let us know. We can close this issue for now and start a new thread when they are available.


Regards

Prasanth


0 Kudos
PrasanthD_intel
Moderator
15,307 Views

Hi,


It seems like you need time for providing logs and we need to close the thread as there is no interaction for an extended period.

Please raise a new thread when you have all the necessary information.

Any further interaction in this thread will be considered community only.


Regards

Prasanth


0 Kudos
chenweiguang
Beginner
10,925 Views

Hi,

 

We also encountered similar problems. The attachment is the job script and output of our submission job. 

 

Regards

ChenWeiguang

 

 

0 Kudos
Reply