- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am trying to run the WRF application (real.exe), and i get following error message on RHEL 8.3 (4.18.0-240.22.1.el8_3.x86_64) on single server (having 2x 8380 processors)-
metgrid input_wrf.F first_date_nml = 2021-11-09_00:00:00
Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_recv.h at line 1334: cma_read_nbytes == size
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x155551c5991c]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x1555516242d1]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(+0xaa6974) [0x155551bde974]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(+0x1bb1a5) [0x1555512f31a5]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(+0x749e6d) [0x155551881e6d]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(+0x3583ba) [0x1555514903ba]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(+0x68c613) [0x1555517c4613]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(+0x1916ad) [0x1555512c96ad]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(+0x16055c) [0x15555129855c]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(+0x22fadd) [0x155551367add]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(MPI_Scatterv+0x2e4) [0x1555517c56a4]
/home/user1/WRFV4/main/real.exe() [0xab2216]
/home/user1/WRFV4/main/real.exe() [0x9342b9]
/home/user1/WRFV4/main/real.exe() [0x8a2623]
/home/user1/WRFV4/main/real.exe() [0x89d80f]
/home/user1/WRFV4/main/real.exe() [0x89364b]
/home/user1/WRFV4/main/real.exe() [0x892bc9]
/home/user1/WRFV4/main/real.exe() [0x89263a]
/home/user1/WRFV4/main/real.exe() [0x1a0e45e]
/home/user1/WRFV4/main/real.exe() [0x14abbb2]
/home/user1/WRFV4/main/real.exe() [0x415df6]
/home/user1/WRFV4/main/real.exe() [0x414b62]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x1555503667b3]
/home/user1/WRFV4/main/real.exe() [0x414a6e]
Abort(1) on node 8: Internal error
application was launched using -
mpirun -np 10 -ppn 10 $WRF_PATH/main/real.exe
is there any quick/temporary fix for this issue?
please advice.
- puneet
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
attempted with the latest oneapi release (2021.4.0) , i get same issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reaching out to us.
Could you please let us know which version of Intel parallel studio you have been using?
Could you please provide us with the complete debug log using the below command with the latest Intel oneAPI 2021.4?
I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -v -np 10 -ppn 10 $WRF_PATH/main/real.exe
Also, could you please confirm whether you are able to run the sample IMB-MPI1 benchmark using the below command?
mpirun -np 2 -ppn 2 IMB-MPI1
Thanks & Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for providing the debug log.
Could you please try disabling cma using the below command and then run your application(real.exe)?
export I_MPI_SHM_CMA=0
If this resolves your issue, make sure to accept this as a solution. This would help others with a similar issue.
Thanks & Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
after setting the aforementioned env variable, the error message changed to this -
Yes, this special data is acceptable to use: OUTPUT FROM METGRID V4.3
Input data is acceptable to use: met_em.d03.2021-11-09_00:00:00.nc
metgrid input_wrf.F first_date_input = 2021-11-09_00:00:00
metgrid input_wrf.F first_date_nml = 2021-11-09_00:00:00
[z7-33:273213:0:273213] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1547fbbf9640)
==== backtrace (tid: 273213) ====
0 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
1 0x00000000003f9b80 memcpy_loop_regular_load_nontemporal_store_avx2() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/i_mpi_memcpy_avx2.c:4904
2 0x00000000003f9b80 memcpy_loop_regular_load_nontemporal_store_avx2() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/i_mpi_memcpy_avx2.c:4905
3 0x00000000003f9b80 I_MPI_memcpy_nontemporal_avx2() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/i_mpi_memcpy_avx2.c:5721
4 0x00000000009ebb6e icx_write_to_frame() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_icx.h:620
5 0x00000000009ebb6e write_to_frame() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_memcpy.h:609
6 0x00000000009ebb6e progress_serialization_frame() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_send.h:1230
7 0x00000000009ed246 isend() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_send.h:1797
8 0x00000000009ed246 impi_shm_isend() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_send.h:1906
9 0x000000000035d570 MPIDI_POSIX_am_isend() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/posix_am.h:99
10 0x000000000035d570 MPIDI_SHM_am_isend() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_am.h:49
11 0x000000000035d570 MPIDIG_isend_impl() /build/impi/_buildspace/release/../../src/mpid/ch4/generic/mpidig_send.h:120
12 0x0000000000375489 MPIDIG_am_isend() /build/impi/_buildspace/release/../../src/mpid/ch4/generic/mpidig_send.h:176
13 0x0000000000375489 MPIDIG_mpi_isend() /build/impi/_buildspace/release/../../src/mpid/ch4/generic/mpidig_send.h:237
14 0x0000000000375489 MPIDI_POSIX_mpi_isend() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/posix_send.h:59
15 0x0000000000375489 MPIDI_SHM_mpi_isend() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_p2p.h:165
16 0x0000000000375489 MPIDI_isend_unsafe() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:215
17 0x0000000000375489 MPIDI_isend_safe() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:351
18 0x0000000000375489 MPID_Isend() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:935
19 0x0000000000375489 MPID_Isend_coll() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:954
20 0x0000000000375489 MPIC_Isend() /build/impi/_buildspace/release/../../src/mpi/coll/helper_fns.c:531
21 0x00000000006e0804 MPIR_Scatterv_allcomm_linear() /build/impi/_buildspace/release/../../src/mpi/coll/scatterv/scatterv_allcomm_linear.c:67
22 0x000000000019eaea MPIDI_POSIX_mpi_scatterv() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/intel/posix_coll.h:438
23 0x000000000017b0d2 MPIDI_SHM_mpi_scatterv() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_coll.h:131
24 0x000000000017b0d2 MPIDI_Scatterv_intra_composition_gamma() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_extra_compositions.h:605
25 0x000000000017b0d2 MPID_Scatterv_invoke() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:1834
26 0x000000000017b0d2 MPIDI_coll_invoke() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3213
27 0x00000000001686aa MPIDI_coll_select() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:130
28 0x00000000002523bd MPID_Scatterv() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:158
29 0x00000000006e2697 PMPI_Scatterv() /build/impi/_buildspace/release/../../src/mpi/coll/scatterv/scatterv.c:380
30 0x0000000000ab2216 dist_on_comm0_() ???:0
31 0x00000000009342b9 wrf_global_to_patch_real_() ???:0
32 0x00000000008a2623 call_pkg_and_dist_generic_() ???:0
33 0x000000000089d80f call_pkg_and_dist_real_() ???:0
34 0x000000000089364b call_pkg_and_dist_() ???:0
35 0x0000000000892bc9 wrf_read_field1_() ???:0
36 0x000000000089263a wrf_read_field_() ???:0
37 0x0000000001a0e45e wrf_ext_read_field_() ???:0
38 0x00000000014abbb2 input_wrf_() ???:0
39 0x0000000000415df6 MAIN__() ???:0
40 0x0000000000414b62 main() ???:0
41 0x00000000000237b3 __libc_start_main() ???:0
42 0x0000000000414a6e _start() ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi,
do you have any suggestion to get past the error message i shared in previous comment?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
To investigate more on your issue, could you please provide us the "complete debug log" using the below command after disabling cma?
export I_MPI_SHM_CMA=0
I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -v -np 10 -ppn 10 $WRF_PATH/main/real.exe
Thanks & Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for providing the complete debug log.
We can see that the invalid address space is generated at scatterv.
So, could you please try the below steps?
export I_MPI_ADJUST_SCATTERV=0
I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -v -np 10 -ppn 10 $WRF_PATH/main/real.exe
Could you please let us know whether it works? If you still face the issue, then please provide us with the complete debug log.
Thanks & Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We haven't heard back from you. Could you please provide an update on your issue?
Also, please get back to us if your issue still persists.
Thanks & Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.
Thanks & Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page