Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
2276 Discussions

[7] [Node0:4997 :0:5094] Caught signal 11 (Segmentation fault: address not mapped to object at

tamilalagan
Novice
22,957 Views

Getting segmentation fault error with stack trace libucs.

Setup:

Intel MPI Version: Version 2019 Update 7 Build 20200312

OFED Version: MLNX_OFED_LINUX-4.7-3.2.9.0

UCX Version: UCT version=1.7.0 revision b02bab9

Launch Parameter: mpirun -l -print-all-exitcodes -cleanup -genv I_MPI_FABRICS shm:ofi -genv I_MPI_WAIT_MODE 1 -genv FI_PROVIDER_PATH <basePath>/Intel/mpi/lib-IB/prov -genv UCX_NET_DEVICES mlx5_0:1 <Process>

Call Stack:

[7] [IMCNode000:4997 :0:5094] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xd8092c2f)
[7] ==== backtrace ====
[7] 0 /usr/lib64/libucs.so.0(+0x1de0c) [0x7fc4c0720e0c]
[7] 1 /usr/lib64/libucs.so.0(+0x1dfc2) [0x7fc4c0720fc2]
[7] 2 /lib64/libpthread.so.0(+0x10c70) [0x7fc4c4fb4c70]
[7] 3 /home/klac/mount_path_Tool/Binux/Tigris_IMC.exe() [0x8f4457]
[7] 4 /home/klac/mount_path_Tool/Binux/Tigris_IMC.exe() [0x8fb0ad]
[7] 5 /home/klac/mount_path_Tool/Binux/libOpenThreads.so(_ZN11OpenThreads20ThreadPrivateActions11StartThreadEPv+0x59) [0x7fc4e2bb4fe9]
[7] 6 /lib64/libpthread.so.0(+0x874a) [0x7fc4c4fac74a]
[7] 7 /lib64/libc.so.6(clone+0x6d) [0x7fc4c3beef6d]
[7] ===================
[7] Segmentation Fault Exception

 

I have found a similar issue handled for OpenMPI in this below link

https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX#running-open-mpi-with-ucx

$ mpirun -np 2 -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 ./app

IMPORTANT NOTE: Recent OpenMPI versions contain a BTL component called 'uct', which can cause data corruption when enabled, due to conflict on malloc hooks between OPAL and UCM. In order to work-around this, use one of the following alternatives:

Alternative 1: Disable btl/uct in OpenMPI build configuration:

$ ./configure ... --enable-mca-no-build=btl-uct ...

Alternative 2: Disable btl/uct at runtime

$ mpirun -np 2 -mca pml ucx -mca btl ^uct -x UCX_NET_DEVICES=mlx5_0:1 ./app

Is the issue could be same as OpenMPI one?

Is there any fix is available?

0 Kudos
14 Replies
PrasanthD_intel
Moderator
22,936 Views

Hi,


I am not sure whether this is the same issue as of OpenMPI as I have tested with the same the command line in our environment with a sample program.


Could you please tell us the transports available in your system, the provider you were using and the interconnect.

For transports use following command: ucx_info -d | grep Transport


Regards

Prasanth


0 Kudos
tamilalagan
Novice
22,933 Views

Available Transport & Device:

ucx_info -d | grep 'Transport\|Device'

Transport: posix
# Device: memory
# Transport: sysv
# Device: memory
# Transport: self
# Device: memory
# Transport: tcp
# Device: eth2
# Transport: tcp
# Device: ib0
# Transport: tcp
# Device: eth0
# Transport: tcp
# Device: eth3
# Transport: tcp
# Device: eth1
# Transport: rc
# Device: mlx5_0:1
# Transport: rc_mlx5
# Device: mlx5_0:1
# Transport: dc_mlx5
# Device: mlx5_0:1
# Transport: ud
# Device: mlx5_0:1
# Transport: ud_mlx5
# Device: mlx5_0:1
# Transport: cm
# Device: mlx5_0:1
# Transport: cma
# Device: memory

 

The provider you were using: OFI Provider

interconnect: 

Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:9803:9b03:0033:f782
base lid: 0x1
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: InfiniBand

 

0 Kudos
PrasanthD_intel
Moderator
22,863 Views

Hi,


We cannot infer much from the stack that you have provided.

Could you please try /provide the following for us:

1) If you can, please try to check with the latest version 2019u9 and let us know if the error persists.

2) Please provide the full command line you are using other than mpirun

3)export I_MPI_DEBUG=10 and provide the logs

4) Please provide a sample reproducer for us to test.


Regards

Prasanth


0 Kudos
tamilalagan
Novice
22,840 Views

Hi Prasanth,

This issue is not easy to reproduce in my setup and no definite steps as well.

1) If you can, please try to check with the latest version 2019u9 and let us know if the error persists.

Tamil >> This is bit difficult to integrate and this will take some time to do this test.

2) Please provide the full command line you are using other than mpirun

Tamil >> No other commands are used related to mpirun. Could you please more specific to some commands? (I am not using any setenv and MPI parameters)

3)export I_MPI_DEBUG=10 and provide the logs

Tamil >> Started collected with MPI_DEBUG 5 and OFI_DEBUG 10. Will update with new log once issue reproduced.

4) Please provide a sample reproducer for us to test.

Tamil >> I have to find the clear steps :(.

0 Kudos
tamilalagan
Novice
22,838 Views

Please also find the new launch command.

mpirun -l -print-all-exitcodes -cleanup -genv I_MPI_FABRICS shm:ofi -genv I_MPI_WAIT_MODE 1 -genv I_MPI_DEBUG 5 -genv FI_LOG_LEVEL debug -genv FI_PROVIDER_PATH <BasePath>/Intel/mpi/lib-IB/prov -genv UCX_NET_DEVICES mlx5_0:1

0 Kudos
PrasanthD_intel
Moderator
22,787 Views

Hi,


We were not able to reproduce the error even with the new launch command.

As you have said could you please provide the logs we have asked for, that will help us in identifying the exact issue?

Also, are you getting errors with every program or with a certain code?


Regards

Prasanth


0 Kudos
tamilalagan
Novice
22,782 Views

Thanks Prasanth,

I will trim the logs and upload the same.

0 Kudos
tamilalagan
Novice
22,781 Views

Hello Prasanth,

 

Please also consider the below use case and confirm:

1. If any MPI task in cluster crashes (segmentation faults) and other tasks have RDMA pull logic with crashed tasks.

2. The MPI could reports with "sementation fault address not mapped to object"  error?

0 Kudos
PrasanthD_intel
Moderator
22,770 Views

Hi,

1. If any MPI task in cluster crashes (segmentation faults) and other tasks have RDMA pull logic with crashed tasks.

I haven't fully understood the question you are trying to ask, but from what I understood you are asking whether Intel MPI aborts when one of the processes crashes, the answer is, yes currently there is no default fault tolerance enabled so if one of the tasks fails Intel MPI will abort. You can change this behavior by enabling fault tolerance.

2. The MPI could reports with "segmentation fault address not mapped to object" error?

Could you please elaborate on what you are trying to ask?

Regards

Prasanth

0 Kudos
PrasanthD_intel
Moderator
22,713 Views

Hi,


We have been waiting for the debug logs you said you will provide. Let us know when you can provide them. Also, we haven't understood some of the questions you have asked could you please elaborate on them


Regards

Prasanth


0 Kudos
tamilalagan
Novice
22,707 Views

Please provide a day or two, I will share more data for analysis.

0 Kudos
PrasanthD_intel
Moderator
22,671 Views

Hi,


We are expecting the logs from you. If it takes more time than expected, please let us know. We can close this issue for now and start a new thread when they are available.


Regards

Prasanth


0 Kudos
PrasanthD_intel
Moderator
22,627 Views

Hi,


It seems like you need time for providing logs and we need to close the thread as there is no interaction for an extended period.

Please raise a new thread when you have all the necessary information.

Any further interaction in this thread will be considered community only.


Regards

Prasanth


0 Kudos
chenweiguang
Beginner
18,245 Views

Hi,

 

We also encountered similar problems. The attachment is the job script and output of our submission job. 

 

Regards

ChenWeiguang

 

 

0 Kudos
Reply