Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
1987 Discussions

MPI program aborts with an "Assertion failed in file ch4_shm_coll.c" message

ombrophile
Beginner
1,608 Views

Hi,

I have written a Fortran code to solve some differential equations. Moreover, this makes use of MPI. Initially, the code worked fine when a simple 1D decomposition of the input arrays was being used. Recently, in order to enable computation on larger-sized arrays, I had modified the code to use 2D decomposition using the MPI topology feature. However, upon running, the code sometimes exits with the following error:

 

Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7f595a5c7bcc]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f5959fa1df1]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x2b1eb9) [0x7f5959c70eb9]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x1a8c18) [0x7f5959b67c18]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x1717ec) [0x7f5959b307ec]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x2b4387) [0x7f5959c73387]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(PMPI_Allreduce+0x561) [0x7f5959ace6e1]
/opt/intel/oneapi/mpi/2021.5.1//lib/libmpifort.so.12(+0xdd95a) [0x7f595b2d195a]
/opt/intel/oneapi/mpi/2021.5.1//lib/libmpifort.so.12(mpi_allreduce_f08ts_+0x208) [0x7f595b23c758]
./cav_2d.exe() [0x45a512]
./cav_2d.exe() [0x408612]
./cav_2d.exe() [0x404ae2]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f59596640b3]
./cav_2d.exe() [0x4049ee]
Abort(1) on node 6: Internal error
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7feb1f8a5bcc]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7feb1f27fdf1]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x2b1eb9) [0x7feb1ef4eeb9]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x1a8c18) [0x7feb1ee45c18]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x1717ec) [0x7feb1ee0e7ec]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x2b4387) [0x7feb1ef51387]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(PMPI_Allreduce+0x561) [0x7feb1edac6e1]
/opt/intel/oneapi/mpi/2021.5.1//lib/libmpifort.so.12(+0xdd95a) [0x7feb205af95a]
/opt/intel/oneapi/mpi/2021.5.1//lib/libmpifort.so.12(mpi_allreduce_f08ts_+0x208) [0x7feb2051a758]
./cav_2d.exe() [0x45a512]
./cav_2d.exe() [0x408612]
./cav_2d.exe() [0x404ae2]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7feb1e9420b3]
./cav_2d.exe() [0x4049ee]
Abort(1) on node 12: Internal error
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7ff95a227bcc]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7ff959c01df1]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x2b1eb9) [0x7ff9598d0eb9]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x1a8c18) [0x7ff9597c7c18]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x1717ec) [0x7ff9597907ec]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x2b4387) [0x7ff9598d3387]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(PMPI_Allreduce+0x561) [0x7ff95972e6e1]
/opt/intel/oneapi/mpi/2021.5.1//lib/libmpifort.so.12(+0xdd95a) [0x7ff95af3195a]
/opt/intel/oneapi/mpi/2021.5.1//lib/libmpifort.so.12(mpi_allreduce_f08ts_+0x208) [0x7ff95ae9c758]
./cav_2d.exe() [0x45a512]
./cav_2d.exe() [0x408612]
./cav_2d.exe() [0x404ae2]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7ff9592c40b3]
./cav_2d.exe() [0x4049ee]
Abort(1) on node 18: Internal error
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7f8f90d31bcc]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f8f9070bdf1]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x2b1eb9) [0x7f8f903daeb9]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x1a8c18) [0x7f8f902d1c18]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x1717ec) [0x7f8f9029a7ec]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x2b4387) [0x7f8f903dd387]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(PMPI_Allreduce+0x561) [0x7f8f902386e1]
/opt/intel/oneapi/mpi/2021.5.1//lib/libmpifort.so.12(+0xdd95a) [0x7f8f91a3b95a]
/opt/intel/oneapi/mpi/2021.5.1//lib/libmpifort.so.12(mpi_allreduce_f08ts_+0x208) [0x7f8f919a6758]
./cav_2d.exe() [0x45a512]
./cav_2d.exe() [0x408612]
./cav_2d.exe() [0x404ae2]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f8f8fdce0b3]
./cav_2d.exe() [0x4049ee]
Abort(1) on node 24: Internal error
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7f0ae71b4bcc]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7ff9725e3bcc]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7ff971fbddf1]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x2b1eb9) [0x7ff971c8ceb9]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x1a8c18) [0x7ff971b83c18]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x1717ec) [0x7ff971b4c7ec]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x2b4387) [0x7ff971c8f387]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(PMPI_Allreduce+0x561) [0x7ff971aea6e1]
/opt/intel/oneapi/mpi/2021.5.1//lib/libmpifort.so.12(+0xdd95a) [0x7ff9732ed95a]
/opt/intel/oneapi/mpi/2021.5.1//lib/libmpifort.so.12(mpi_allreduce_f08ts_+0x208) [0x7ff973258758]
./cav_2d.exe() [0x45a512]
./cav_2d.exe() [0x408612]
./cav_2d.exe() [0x404ae2]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7ff9716800b3]
./cav_2d.exe() [0x4049ee]
Abort(1) on node 36: Internal error
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7f86f181cbcc]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f86f11f6df1]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x2b1eb9) [0x7f86f0ec5eb9]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x1a8c18) [0x7f86f0dbcc18]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x1717ec) [0x7f86f0d857ec]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x2b4387) [0x7f86f0ec8387]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(PMPI_Allreduce+0x561) [0x7f86f0d236e1]
/opt/intel/oneapi/mpi/2021.5.1//lib/libmpifort.so.12(+0xdd95a) [0x7f86f252695a]
/opt/intel/oneapi/mpi/2021.5.1//lib/libmpifort.so.12(mpi_allreduce_f08ts_+0x208) [0x7f86f2491758]
./cav_2d.exe() [0x45a512]
./cav_2d.exe() [0x408612]
./cav_2d.exe() [0x404ae2]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f86f08b90b3]
./cav_2d.exe() [0x4049ee]
Abort(1) on node 42: Internal error
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7f3e968dfbcc]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f3e962b9df1]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x2b1eb9) [0x7f3e95f88eb9]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x1a8c18) [0x7f3e95e7fc18]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x1717ec) [0x7f3e95e487ec]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x2b4387) [0x7f3e95f8b387]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(PMPI_Allreduce+0x561) [0x7f3e95de66e1]
/opt/intel/oneapi/mpi/2021.5.1//lib/libmpifort.so.12(+0xdd95a) [0x7f3e975e995a]
/opt/intel/oneapi/mpi/2021.5.1//lib/libmpifort.so.12(mpi_allreduce_f08ts_+0x208) [0x7f3e97554758]
./cav_2d.exe() [0x45a512]
./cav_2d.exe() [0x408612]
./cav_2d.exe() [0x404ae2]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f3e9597c0b3]
./cav_2d.exe() [0x4049ee]
Abort(1) on node 0: Internal error
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f0ae6b8edf1]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x2b1eb9) [0x7f0ae685deb9]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x1a8c18) [0x7f0ae6754c18]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x1717ec) [0x7f0ae671d7ec]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(+0x2b4387) [0x7f0ae6860387]
/opt/intel/oneapi/mpi/2021.5.1//lib/release/libmpi.so.12(PMPI_Allreduce+0x561) [0x7f0ae66bb6e1]
/opt/intel/oneapi/mpi/2021.5.1//lib/libmpifort.so.12(+0xdd95a) [0x7f0ae7ebe95a]
/opt/intel/oneapi/mpi/2021.5.1//lib/libmpifort.so.12(mpi_allreduce_f08ts_+0x208) [0x7f0ae7e29758]
./cav_2d.exe() [0x45a512]
./cav_2d.exe() [0x408612]
./cav_2d.exe() [0x404ae2]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f0ae62510b3]
./cav_2d.exe() [0x4049ee]
Abort(1) on node 30: Internal error

 

Please note that the above error is encounterd only occassionally, and particularly in hardwares employing newer 2nd generation Xeon processors.

 

I understand that the above information may not be sufficient to know what might be wrong. Please let me know what additional information I must provide in order to help me debug this.

 

Thanks in advance.

Labels (2)
0 Kudos
21 Replies
HemanthCH_Intel
Moderator
1,485 Views

Hi,

 

Thank you for posting in Intel Communities.

 

Could you please provide us with the below details?

  1. The operating system and its version.
  2. A sample reproducer code and steps to reproduce your issue.
  3. The debug log by using the below command:
I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -n 4 -ppn 4 ./a.out

 

Thanks & Regards,

Hemanth.

 

ombrophile
Beginner
1,463 Views

Hi Hemanth,

Thank you for your reaching out. A point-by-point response to your queries are as follows:

 

1. I am encountering this issue in two different systems. One is a cluster wherein I am trying to run the code on 240 cores. All the nodes of the cluster consists of Intel(R) Xeon(R) Gold 6248 CPUs and the OS in them is CentOS 7. Moreover, I am also facing this issue while running the code in a 48 core server consisting of Intel(R) Xeon(R) Gold 5220R CPU; OS in this server is Ubuntu 20.04.4. Both the systems have the same version of OneAPI installed, i.e. ifort (IFORT) 2021.5.0 20211109 and Intel(R) MPI Library for Linux* OS, Version 2021.5 Build 20211102.

 

2. The code is very big, and it might be difficult for me create a sample reproducer code from that. Instead, I could share the entire code if you wish. Do let me know.

 

3. Output of the code after using the suggested flags and running it in the 48-core server is attached with this reply. Please have a look.

 

Do let me know if further details are required.

 

 

Additional observation: Apparently the code successfully runs in the 48 core server when the environment variable I_MPI_ADJUST_ALLREDUCE=4 is used while running the executable. However, this seems to have to no effect while running it on the 5-node-240-core cluster.

HemanthCH_Intel
Moderator
1,404 Views

Hi,

 

Could you please mention the message size used in your application?

 

Could you please share the APS report of your application? Please use the below commands to generate an APS report:

 

 

export MPS_STAT_LEVEL=5
mpirun -n 2 aps -c mpi ./myapp
aps-report <generated report file name> -m

 

 

For more information refer to this link:https://www.intel.com/content/www/us/en/develop/documentation/application-snapshot-user-guide/top/an...

 

Thanks & Regards,

Hemanth

 

HemanthCH_Intel
Moderator
1,375 Views

Hi,


We haven't heard back from you. Could you please provide any updates on your issue?


Thanks & Regards,

Hemanth


ombrophile
Beginner
1,365 Views

Hi,

 

Apologies for the late reply. I had followed the steps suggested by you. However, as aps generates multiple dump files, I was unable to understand which file to provide in the last step that you suggested. Please allow me some more time before I could get back to you. Meanwhile, if you have any further suggestions on aps-report, do let me know.

 

Thanks.

 

HemanthCH_Intel
Moderator
1,340 Views

Hi,

 

Please find the below commands and screenshot to find the message size for the Intel MPI benchmark:

1) export MPS_STAT_LEVEL=5

2) Run the MPI benchmark using the below command:

mpirun -n 2 aps -c mpi IMB-MPI1

Above command will generate an aps report with this message: "Intel(R) VTune(TM) Profiler 2022.1.0 collection completed successfully. Use the "aps --report /home/intel/hemanth/mpi_threads/aps_result_20220422" command to generate textual and HTML reports for the profiling session."

3)use the below command to generate message size:

aps-report /home/intel/hemanth/mpi_threads/aps_result_20220422 -m

You can find the message size in the below screenshot for the IMB-MPI1 benchmarking.

HemanthCH_Intel_0-1650866490611.png

 

 

Thanks & Regards,

Hemanth

 

ombrophile
Beginner
1,315 Views

Hi,

Thank you for the suggestion. After using the commands

 

export MPS_STAT_LEVEL=5
mpirun -n 2 aps -c mpi IMB-MPI1
aps-report aps_result_20220426 -m

 

I get the following output:

 

Loading 100.00%
| Message Sizes summary for all Ranks
|-----------------------------------------------------------------------------------------------------
| Message size(B)       Volume(MB)        Volume(%)        Transfers        Time(sec)          Time(%)
|-----------------------------------------------------------------------------------------------------
0             0.00             0.00          1205566             1.07            12.88
4194304          3208.64            12.02             1094             0.74             8.84
2097152          3145.73            11.79             2144             0.73             8.78
1048576          3114.27            11.67             4244             0.72             8.61
524288          3098.54            11.61             8444             0.67             8.11
131072          3086.75            11.57            33644             0.67             8.09
65536          2892.10            10.84            63716             0.64             7.74
262144          3090.68            11.58            16844             0.60             7.16
32768          2065.37             7.74            92444             0.52             6.23
16384          1032.68             3.87            92444             0.34             4.12
8192           516.34             1.93            92444             0.23             2.80
4096           258.17             0.97            92444             0.16             1.96
8388608           922.75             3.46              132             0.15             1.77
2048           129.09             0.48            92444             0.14             1.65
1024            64.54             0.24            92444             0.12             1.40
512            32.27             0.12            92444             0.11             1.28
128             8.07             0.03            92444             0.09             1.09
256            16.14             0.06            92444             0.09             1.09
64             4.03             0.02            92444             0.09             1.05
8             0.52             0.00            94982             0.09             1.04
4             0.24             0.00            90780             0.08             1.00
16             1.01             0.00            92444             0.08             0.98
32             2.02             0.01            92444             0.08             0.98
2             0.10             0.00            75636             0.07             0.82
1             0.04             0.00            67232             0.04             0.52
|=====================================================================================================
| TOTAL                   26690.09           100.00          2773786             8.32           100.00
|

 

 

 

 

 

However, when I try to enable aps with my executable using the following commands

 

export MPS_STAT_LEVEL=5
mpirun -n 2 aps -c mpi ./cav_2d.exe

 

I get the following error:

 

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
cav_2d.exe         00000000005F31BA  Unknown               Unknown  Unknown
libpthread-2.31.s  00007FAF738E53C0  Unknown               Unknown  Unknown
libmpifort.so.12.  00007FAF75225D4B  mpi_comm_rank_        Unknown  Unknown
libmps.so          00007FAF7E30F904  mpi_comm_rank_        Unknown  Unknown
cav_2d.exe         0000000000406145  MAIN__                     68  cav_2d.f90
cav_2d.exe         0000000000405CA2  Unknown               Unknown  Unknown
libc-2.31.so       00007FAF735B40B3  __libc_start_main     Unknown  Unknown
cav_2d.exe         0000000000405BA9  Unknown               Unknown  Unknown
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
cav_2d.exe         00000000005F31BA  Unknown               Unknown  Unknown
libpthread-2.31.s  00007F25023083C0  Unknown               Unknown  Unknown
libmpifort.so.12.  00007F2503C48D4B  mpi_comm_rank_        Unknown  Unknown
libmps.so          00007F250CD32904  mpi_comm_rank_        Unknown  Unknown
cav_2d.exe         0000000000406145  MAIN__                     68  cav_2d.f90
cav_2d.exe         0000000000405CA2  Unknown               Unknown  Unknown
libc-2.31.so       00007F2501FD70B3  __libc_start_main     Unknown  Unknown
cav_2d.exe         0000000000405BA9  Unknown               Unknown  Unknown
aps Error: Cannot run: ./cav_2d.exe
aps Error: Cannot run: ./cav_2d.exe

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 9219 RUNNING AT materials14
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

 

Line 68 in the source file (shown in the traceback) contains the following entry:

 

call MPI_Comm_rank( MPI_COMM_WORLD, myrank )

 

 

In addition to this, I get the same error when repeating the above step with an executable compiled from another source file. Moreover, here too, the backtrace shows that the line in which the error occurred had the same entry as that shown above.

 

Without the aps option in mpirun, the executable at least starts to run fine. Hence, I do not understand why this kind of an error should occur. Please let me know possible workarounds.

 

Thanks.

 

HemanthCH_Intel
Moderator
1,296 Views

Hi,


>>"The code is very big, and it might be difficult for me create a sample reproducer code from that. Instead, I could share the entire code if you wish. Do let me know."

Could you please provide the sample reproducer code and steps that you have followed, so that we can reproduce the issue from our end?


Thanks & Regards,

Hemanth


ombrophile
Beginner
1,283 Views

Let me try to write a short reproducible version of the program. I will let you know in a few days.

HemanthCH_Intel
Moderator
1,265 Views

Hi,


We haven't heard back from you. Could you please provide any updates on your issue?


Thanks & Regards,

Hemanth


ombrophile
Beginner
1,255 Views

Hi Hemanth,

 

Thank you for getting back. However, I am afraid I might need a few more days before I can get back to you. Please allow me some more time.

 

Thanks.

HemanthCH_Intel
Moderator
1,227 Views

Hi,


>>" However, I am afraid I might need a few more days before I can get back to you. Please allow me some more time."

Sure you can take your own time. So can we go ahead and close this thread for time being? You may raise another thread with the reproducer by referring to this thread.


Thanks & Regards,

Hemanth


ombrophile
Beginner
1,220 Views

Hi Hemanth,

 

Thank you for waiting. As per your suggestion, I have been trying to generate a minimum working example. However, the code still is relatively big. Would it be all right if I do some final checks on it and send it to you by tomorrow?

ombrophile
Beginner
1,209 Views

Hi Hemanth,

 

Weirdly, the shorter version of the code that I created to upload here is running successfully and not showing the above-mentioned error. This suggests that the error may not be not related to MPI.  Should I still upload the code for you to check?

HemanthCH_Intel
Moderator
1,172 Views

Hi,


Unfortunately, we can't help you without reproducing the error. As the sample reproducer code can't reproduce your issue, then there may not be a problem with the Intel MPI Library. So can we go ahead and close the thread?


Thanks & Regards,

Hemanth


ombrophile
Beginner
1,123 Views

Hi Hemanth, 

 

Apologies for the delayed reply. I have been able to figure out the parameter in the code that triggers the error. This, in fact, also occurs in the sample reproducer code. Hence, I have attached the sample reproducer code with this reply for your reference.

 

Brief description of the files in the MWE and relevant parameter(s)

In it, you will find multiple fortran files. The file named `cav_2d.f90` is the main source file, whereas the rest are the modules containing necessary subroutines. For example, the module `mod_mpi.f90` contains all the MPI related subroutines. In addition to this, a `makefile` is also provided that shows the compilation flags that I am using. Moreover, this can also be used to compile the code. After compilation, an executable called `cav_2d.exe` will be generated. This can be directly run. But since it generates too many dump files, it is recommended to move that executable to a temporary folder before running it.

If, after compilation, the executable is run with the default parameters, it terminates with the above-mentioned error after some time. However, if the code is recompiled by setting the value of the variable `TSMOOTH` to 0 (in line 45 of the main code), it can be observed that the code runs successfully.

The code contains two solver loops. The first loop (lines 178--202) is supposed to be a very short one and runs when the value of `TSMOOTH` is greater than zero. Technically, this should not cause such an error especially since the two loops are independent of each other. Still, I encounter such an error.

It must be mentioned that the code, in its present form, takes some time to run. For example, on a system with two 24 core Xeon Gold 5220R CPUs, it takes around 4 hours to run. Reducing the variable `TSTEPS` (and `SAVET` accordingly) could make the code run faster; however, it may not show the above error. Another fact to note is that, for the given conditions, the executable cannot be run on more than 100 (i.e. 10x10) cores as that will cause the results to be incorrect. Even though it will run with cores more than 100, it will definitely abort if the core count is more than 225 (i.e. 15x15) as, then, the domain decomposition will fail. (The reason for mentioning this is that these errors, if they were to occur, are not relevant to the issue being disscussed here. Hence, it might be apt to keep the core count less than 100.)

 

Summary

1. `cav_2d.f90` is the main source file.

2. Compile the code using `make`. This will generate `cav_2d.exe` which can be directly run.

3. Please run the executable in a separate temporary folder as it generates too many dump files.

4. Code runs fine when the value of `TSMOOTH` is 0.

5. Code aborts with the above error otherwise. (Value of `TSMOOTH` must be a multiple of 100.)

 

Please let me know if you need any further information.

 

Thanks.

ombrophile
Beginner
1,112 Views

In addition to the above, I observed that the outcome of the executable depends on the number of processors being used. A summary of this for a few trials is listed in the table below.

 

Number of cores Outcome
20 Successful
36 Failed
48 Failed

 

The above error occurs only when both the solver loops in the code are allowed to run. It is not clear why this should occur. Any help in resolving it will be appreciated. Please let me know if further clarification is required.

 

Thanks.

ombrophile
Beginner
1,093 Views

Hi Hemanth,

 

Please ignore the above two posts and consider the code attached with this post instead.

 

I have been able to figure out the conditions under which the code throws the mentioned error. This, in fact, also occurs with the sample reproducer code attached with this post.

 

Upon running the compiled executable, it can be observed that the code runs fine when using 20 cores or fewer. In contrast, it throws the above error when trying to run on 36 or 48 cores. (I have not tried to run it with other core combinations.) Moreover, contradicting my earlier statement, I can now confirm that this error is independent of the type of processor. Based on the provided parameters in the code, this should not be the case. Still, I encounter such an error. Any help in resolving this will be deeply appreciated. Please let me know if further clarification is required.

 

Thanks.

 

 

Brief description of the files in the MWE and relevant parameter(s)

Upon extracting the compressed file attached with this post, you will find multiple fortran files. The file named `cav_2d.f90` is the main source file, whereas the rest are the modules containing necessary subroutines. For example, the module `mod_mpi.f90` contains all the MPI related subroutines. In addition to this, a `makefile` is also provided that shows the compilation flags that I am using and can also be used to compile the code. After compilation, an executable called `cav_2d.exe` will be generated which can then be run.

Please note that the code takes some time to run either successfully or before throwing the error. For example, it takes around 3.5 hours while running on 36 cores before throwing the error.

 

HemanthCH_Intel
Moderator
1,050 Views

Hi,

 

We have reported your issue to the concerned development team. They are looking into your issue.

 

Thanks & Regards,

Hemanth

 

ombrophile
Beginner
1,032 Views

I see. Thank you for informing.

Reply