Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2158 Discussions

intel mpi error - line 1334: cma_read_nbytes == size

psing51
New Contributor I
4,050 Views

Hi,
 I am trying to run the WRF application (real.exe), and i get following error message on RHEL 8.3 (4.18.0-240.22.1.el8_3.x86_64) on single server (having 2x 8380 processors)- 

metgrid input_wrf.F first_date_nml = 2021-11-09_00:00:00
Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_recv.h at line 1334: cma_read_nbytes == size
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x155551c5991c]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x1555516242d1]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(+0xaa6974) [0x155551bde974]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(+0x1bb1a5) [0x1555512f31a5]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(+0x749e6d) [0x155551881e6d]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(+0x3583ba) [0x1555514903ba]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(+0x68c613) [0x1555517c4613]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(+0x1916ad) [0x1555512c96ad]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(+0x16055c) [0x15555129855c]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(+0x22fadd) [0x155551367add]
/opt/impi/2019.12.320/intel64/lib/release/libmpi.so.12(MPI_Scatterv+0x2e4) [0x1555517c56a4]
/home/user1/WRFV4/main/real.exe() [0xab2216]
/home/user1/WRFV4/main/real.exe() [0x9342b9]
/home/user1/WRFV4/main/real.exe() [0x8a2623]
/home/user1/WRFV4/main/real.exe() [0x89d80f]
/home/user1/WRFV4/main/real.exe() [0x89364b]
/home/user1/WRFV4/main/real.exe() [0x892bc9]
/home/user1/WRFV4/main/real.exe() [0x89263a]
/home/user1/WRFV4/main/real.exe() [0x1a0e45e]
/home/user1/WRFV4/main/real.exe() [0x14abbb2]
/home/user1/WRFV4/main/real.exe() [0x415df6]
/home/user1/WRFV4/main/real.exe() [0x414b62]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x1555503667b3]
/home/user1/WRFV4/main/real.exe() [0x414a6e]
Abort(1) on node 8: Internal error


application was launched using -
mpirun -np 10 -ppn 10 $WRF_PATH/main/real.exe

is there any quick/temporary fix for this issue?
please advice.
- puneet

0 Kudos
12 Replies
psing51
New Contributor I
4,013 Views

attempted with the latest oneapi release (2021.4.0) , i get same issue.


0 Kudos
SantoshY_Intel
Moderator
3,996 Views

Hi,

 

Thanks for reaching out to us.

 

Could you please let us know which version of Intel parallel studio you have been using?

 

Could you please provide us with the complete debug log using the below command with the latest Intel oneAPI 2021.4?

 

I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -v -np 10 -ppn 10 $WRF_PATH/main/real.exe

 

 

Also, could you please confirm whether you are able to run the sample IMB-MPI1 benchmark using the below command?

 

mpirun -np 2 -ppn 2 IMB-MPI1

 

 

Thanks & Regards,

Santosh

 

 

0 Kudos
psing51
New Contributor I
3,980 Views

IMB tests worked fine.
real.exe logs attached.

0 Kudos
SantoshY_Intel
Moderator
3,929 Views

Hi,

 

Thanks for providing the debug log.

 

Could you please try disabling cma using the below command and then run your application(real.exe)?

export I_MPI_SHM_CMA=0 

 

If this resolves your issue, make sure to accept this as a solution. This would help others with a similar issue.

 

Thanks & Regards,

Santosh

 

 

 

0 Kudos
psing51
New Contributor I
3,910 Views

after setting the aforementioned env variable, the error message changed to this - 

Yes, this special data is acceptable to use: OUTPUT FROM METGRID V4.3
Input data is acceptable to use: met_em.d03.2021-11-09_00:00:00.nc
metgrid input_wrf.F first_date_input = 2021-11-09_00:00:00
metgrid input_wrf.F first_date_nml = 2021-11-09_00:00:00
[z7-33:273213:0:273213] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1547fbbf9640)
==== backtrace (tid: 273213) ====
0 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0
1 0x00000000003f9b80 memcpy_loop_regular_load_nontemporal_store_avx2() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/i_mpi_memcpy_avx2.c:4904
2 0x00000000003f9b80 memcpy_loop_regular_load_nontemporal_store_avx2() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/i_mpi_memcpy_avx2.c:4905
3 0x00000000003f9b80 I_MPI_memcpy_nontemporal_avx2() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/i_mpi_memcpy_avx2.c:5721
4 0x00000000009ebb6e icx_write_to_frame() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_icx.h:620
5 0x00000000009ebb6e write_to_frame() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_memcpy.h:609
6 0x00000000009ebb6e progress_serialization_frame() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_send.h:1230
7 0x00000000009ed246 isend() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_send.h:1797
8 0x00000000009ed246 impi_shm_isend() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/posix/eager/include/intel_transport_send.h:1906
9 0x000000000035d570 MPIDI_POSIX_am_isend() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/posix_am.h:99
10 0x000000000035d570 MPIDI_SHM_am_isend() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_am.h:49
11 0x000000000035d570 MPIDIG_isend_impl() /build/impi/_buildspace/release/../../src/mpid/ch4/generic/mpidig_send.h:120
12 0x0000000000375489 MPIDIG_am_isend() /build/impi/_buildspace/release/../../src/mpid/ch4/generic/mpidig_send.h:176
13 0x0000000000375489 MPIDIG_mpi_isend() /build/impi/_buildspace/release/../../src/mpid/ch4/generic/mpidig_send.h:237
14 0x0000000000375489 MPIDI_POSIX_mpi_isend() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/posix_send.h:59
15 0x0000000000375489 MPIDI_SHM_mpi_isend() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_p2p.h:165
16 0x0000000000375489 MPIDI_isend_unsafe() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:215
17 0x0000000000375489 MPIDI_isend_safe() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:351
18 0x0000000000375489 MPID_Isend() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:935
19 0x0000000000375489 MPID_Isend_coll() /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_send.h:954
20 0x0000000000375489 MPIC_Isend() /build/impi/_buildspace/release/../../src/mpi/coll/helper_fns.c:531
21 0x00000000006e0804 MPIR_Scatterv_allcomm_linear() /build/impi/_buildspace/release/../../src/mpi/coll/scatterv/scatterv_allcomm_linear.c:67
22 0x000000000019eaea MPIDI_POSIX_mpi_scatterv() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/../posix/intel/posix_coll.h:438
23 0x000000000017b0d2 MPIDI_SHM_mpi_scatterv() /build/impi/_buildspace/release/../../src/mpid/ch4/shm/src/../src/shm_coll.h:131
24 0x000000000017b0d2 MPIDI_Scatterv_intra_composition_gamma() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_extra_compositions.h:605
25 0x000000000017b0d2 MPID_Scatterv_invoke() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:1834
26 0x000000000017b0d2 MPIDI_coll_invoke() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3213
27 0x00000000001686aa MPIDI_coll_select() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:130
28 0x00000000002523bd MPID_Scatterv() /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:158
29 0x00000000006e2697 PMPI_Scatterv() /build/impi/_buildspace/release/../../src/mpi/coll/scatterv/scatterv.c:380
30 0x0000000000ab2216 dist_on_comm0_() ???:0
31 0x00000000009342b9 wrf_global_to_patch_real_() ???:0
32 0x00000000008a2623 call_pkg_and_dist_generic_() ???:0
33 0x000000000089d80f call_pkg_and_dist_real_() ???:0
34 0x000000000089364b call_pkg_and_dist_() ???:0
35 0x0000000000892bc9 wrf_read_field1_() ???:0
36 0x000000000089263a wrf_read_field_() ???:0
37 0x0000000001a0e45e wrf_ext_read_field_() ???:0
38 0x00000000014abbb2 input_wrf_() ???:0
39 0x0000000000415df6 MAIN__() ???:0
40 0x0000000000414b62 main() ???:0
41 0x00000000000237b3 __libc_start_main() ???:0
42 0x0000000000414a6e _start() ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source





0 Kudos
psing51
New Contributor I
3,862 Views

hi,
do you have any suggestion to get past the  error message i shared in previous comment?

0 Kudos
SantoshY_Intel
Moderator
3,832 Views

Hi,

 

To investigate more on your issue, could you please provide us the "complete debug log" using the below command after disabling cma?

 

export I_MPI_SHM_CMA=0
I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -v -np 10 -ppn 10 $WRF_PATH/main/real.exe

 

 

Thanks & Regards,

Santosh

0 Kudos
psing51
New Contributor I
3,809 Views

stdout logs are attached.
please advice.

0 Kudos
SantoshY_Intel
Moderator
3,759 Views

Hi,

 

Thanks for providing the complete debug log.

 

We can see that the invalid address space is generated at scatterv.

So, could you please try the below steps?

export I_MPI_ADJUST_SCATTERV=0
I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -v -np 10 -ppn 10 $WRF_PATH/main/real.exe

 

Could you please let us know whether it works? If you still face the issue, then please provide us with the complete debug log.

 

Thanks & Regards,

Santosh

 

 

0 Kudos
SantoshY_Intel
Moderator
3,660 Views

Hi,


We haven't heard back from you. Could you please provide an update on your issue?

Also, please get back to us if your issue still persists.


Thanks & Regards,

Santosh


0 Kudos
SantoshY_Intel
Moderator
3,537 Views

Hi,


We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.


Thanks & Regards,

Santosh


0 Kudos
psing51
New Contributor I
3,517 Views

Hi,
Apologies for the delayed reply.
The run failed again with suggested settings.
stdout and rsl.error log files are attached.

0 Kudos
Reply