Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

SEGFAULT on MPI_Finalize

menno
Beginner
2,637 Views

I'm using oneAPI 2021.4 MPI and run simulations that use MPI_Comm_spawn and MPI_Comm_connect. To make this work I set the following environment variables:

export I_MPI_SPAWN=on
export FI_MLX_NS_ENABLE=1
export I_MPI_SPAWN_EXPERIMENTAL=1

The third one is coming off a previous post on this forum, see here.

When the end of the simulations is reached, I get the following error

[node070:922191:0:922191] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)

==== backtrace (tid: 922191) ====
 0 0x0000000000012b20 .annobin_sigaction.c()  sigaction.c:0
 1 0x00000000001fa9e1 MPIDIU_get_avt_size()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4r_proc.c:90
 2 0x00000000005fc11b MPIDI_OFI_free_avt_hook()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_proc.h:61
 3 0x00000000005fc11b graceful_disconnect()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_init.c:465
 4 0x00000000005fc11b MPIDI_OFI_mpi_finalize_hook()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_init.c:2192
 5 0x00000000001da929 MPID_Finalize()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_init.c:1334
 6 0x00000000003004c5 PMPI_Finalize()  /build/impi/_buildspace/release/../../src/mpi/init/finalize.c:158
 7 0x000000000042fe7e MAIN__()  ???:0
 8 0x000000000040d8e2 main()  ???:0
 9 0x00000000000237b3 __libc_start_main()  ???:0
10 0x000000000040d7ee _start()  ???:0

Is there something I can do to fix this?

0 Kudos
6 Replies
HemanthCH_Intel
Moderator
2,601 Views

Hi,

 

Thanks for reaching out to us.

 

We tried at our end using Intel oneAPI 2021.4 on Rocky linux machine by following the below steps.

source /opt/intel/oneapi/setvars.sh
export FI_PROVIDER=mlx
export I_MPI_SPAWN=on
export FI_MLX_NS_ENABLE=1
export I_MPI_SPAWN_EXPERIMENTAL=1
mpiicc example.c
mpirun -n 3 ./a.out

 

We do not encounter any segmentation fault. It worked fine at our end as shown in the screenshots attached. Also, please find the example.c attached.

 

So, could you please provide us with the complete debug file "LOG" by using the below commands?

source /opt/intel/oneapi/setvars.sh
export I_MPI_DEBUG=30
export FI_LOG_LEVEL=debug
export FI_PROVIDER=mlx
export I_MPI_SPAWN=on
export FI_MLX_NS_ENABLE=1
export I_MPI_SPAWN_EXPERIMENTAL=1
mpiicc example.c
mpirun -n 3 ./a.out &> LOG

 

 

Thanks & Regards,

Hemanth.

 

0 Kudos
menno
Beginner
2,589 Views

Thank you for taking a look. I am not experiencing any segfault with the example provided. I haven't been able to make a simple reproduction of the problem, I'm afraid. But when I run the actual simulation, I get the attached log file, where at the end is the segfault.

0 Kudos
menno
Beginner
2,587 Views

Furthermore, I was able to attach a debugger and break on MPIDIU_get_avt_size

Thread 1 "palm_main" hit Breakpoint 1, MPIDIU_get_avt_size (avtid=2) at ../../src/mpid/ch4/src/ch4r_proc.c:90
90 ../../src/mpid/ch4/src/ch4r_proc.c: No such file or directory.
(gdb) p MPIDI_global.avt_mgr
$1 = {mmapped_size = 32768, max_n_avts = 4, n_avts = 3, next_avtid = 2, free_avtid = 0x7f20000f7b80}

 

So, it looks like it enters MPIDIU_get_avt_size with avtid=2.

 

Now, when I do the following, you see that for index '2' there is a problem to dereference:

 

(gdb) p MPIDI_av_table[0]->size
$8 = 1
(gdb) p MPIDI_av_table[1]->size
$9 = 8
(gdb) p MPIDI_av_table[2]->size
Cannot access memory at address 0x8

 

NOTE: I have referred to the MPICH-3.4.2 source for the details on the MPIDIU_get_avt_size function.

0 Kudos
HemanthCH_Intel
Moderator
2,555 Views

Hi,


Could you please provide the below details?

1)Are you running on a single node or multi-node? If you are using a cluster please provide details.

2)Provide us with the command you used to run the program?

2)which fabric provider/ Interconnect you are using?

3)Is sugar++parallel compiled with MPICH? And, are you trying to run it with Intel mpi

4)The debug log which you have provided is the single log created by a single run? Or several logs combined in one file after several runs of an application?

5)Could you please compile with -g option for getting a more detailed debug log in trace.

6) Also, please provide us with the OS details & CPU information.


Thanks & Regards,

Hemanth.


0 Kudos
HemanthCH_Intel
Moderator
2,514 Views

Hi,


We have not heard back from you. Could you please provide the above mentioned details?


Thanks & Regards,

Hemanth.


0 Kudos
HemanthCH_Intel
Moderator
2,492 Views

Hi,


We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.


Thanks & Regards,

Hemanth.


0 Kudos
Reply