- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm using oneAPI 2021.4 MPI and run simulations that use MPI_Comm_spawn and MPI_Comm_connect. To make this work I set the following environment variables:
export I_MPI_SPAWN=on
export FI_MLX_NS_ENABLE=1
export I_MPI_SPAWN_EXPERIMENTAL=1
The third one is coming off a previous post on this forum, see here.
When the end of the simulations is reached, I get the following error
[node070:922191:0:922191] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
==== backtrace (tid: 922191) ====
0 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
1 0x00000000001fa9e1 MPIDIU_get_avt_size() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4r_proc.c:90
2 0x00000000005fc11b MPIDI_OFI_free_avt_hook() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_proc.h:61
3 0x00000000005fc11b graceful_disconnect() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_init.c:465
4 0x00000000005fc11b MPIDI_OFI_mpi_finalize_hook() /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_init.c:2192
5 0x00000000001da929 MPID_Finalize() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_init.c:1334
6 0x00000000003004c5 PMPI_Finalize() /build/impi/_buildspace/release/../../src/mpi/init/finalize.c:158
7 0x000000000042fe7e MAIN__() ???:0
8 0x000000000040d8e2 main() ???:0
9 0x00000000000237b3 __libc_start_main() ???:0
10 0x000000000040d7ee _start() ???:0
Is there something I can do to fix this?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reaching out to us.
We tried at our end using Intel oneAPI 2021.4 on Rocky linux machine by following the below steps.
source /opt/intel/oneapi/setvars.sh
export FI_PROVIDER=mlx
export I_MPI_SPAWN=on
export FI_MLX_NS_ENABLE=1
export I_MPI_SPAWN_EXPERIMENTAL=1
mpiicc example.c
mpirun -n 3 ./a.out
We do not encounter any segmentation fault. It worked fine at our end as shown in the screenshots attached. Also, please find the example.c attached.
So, could you please provide us with the complete debug file "LOG" by using the below commands?
source /opt/intel/oneapi/setvars.sh
export I_MPI_DEBUG=30
export FI_LOG_LEVEL=debug
export FI_PROVIDER=mlx
export I_MPI_SPAWN=on
export FI_MLX_NS_ENABLE=1
export I_MPI_SPAWN_EXPERIMENTAL=1
mpiicc example.c
mpirun -n 3 ./a.out &> LOG
Thanks & Regards,
Hemanth.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Furthermore, I was able to attach a debugger and break on MPIDIU_get_avt_size
Thread 1 "palm_main" hit Breakpoint 1, MPIDIU_get_avt_size (avtid=2) at ../../src/mpid/ch4/src/ch4r_proc.c:90
90 ../../src/mpid/ch4/src/ch4r_proc.c: No such file or directory.
(gdb) p MPIDI_global.avt_mgr
$1 = {mmapped_size = 32768, max_n_avts = 4, n_avts = 3, next_avtid = 2, free_avtid = 0x7f20000f7b80}
So, it looks like it enters MPIDIU_get_avt_size with avtid=2.
Now, when I do the following, you see that for index '2' there is a problem to dereference:
(gdb) p MPIDI_av_table[0]->size
$8 = 1
(gdb) p MPIDI_av_table[1]->size
$9 = 8
(gdb) p MPIDI_av_table[2]->size
Cannot access memory at address 0x8
NOTE: I have referred to the MPICH-3.4.2 source for the details on the MPIDIU_get_avt_size function.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please provide the below details?
1)Are you running on a single node or multi-node? If you are using a cluster please provide details.
2)Provide us with the command you used to run the program?
2)which fabric provider/ Interconnect you are using?
3)Is sugar++parallel compiled with MPICH? And, are you trying to run it with Intel mpi
4)The debug log which you have provided is the single log created by a single run? Or several logs combined in one file after several runs of an application?
5)Could you please compile with -g option for getting a more detailed debug log in trace.
6) Also, please provide us with the OS details & CPU information.
Thanks & Regards,
Hemanth.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. Could you please provide the above mentioned details?
Thanks & Regards,
Hemanth.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.
Thanks & Regards,
Hemanth.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page