Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Heap memory corruption for OneAPI?

Carlygao
New Contributor I
451 Views

Hi,

 

Just realized I can't post questions in my previous thread. My code package is huge and mixed by C++ and Fortran. Recently we were stuck by a floating point error. The code stopped at the lines which definitely have no errors. I enabled Heap, but crashed Visual Studio. So I used Valgrind to detect the memory issues. Here are the summary from valgrind. Does this mean OneAPI have memory leak? I don't know how to interpret the results. However, it seems all leaks are not related my codes. Thanks.

 

Unhandled exception at 0x00007FF7FAA1E907 in bsam20_2022_10_21_7f4fea3.exe: 0xC0000090: Floating-point invalid operation (parameters: 0x0000000000000000, 0x0000000000009961).

 

Exception thrown at 0x00007FF7FAA1E907 in bsam20_2022_10_21_7f4fea3.exe: 0xC0000090: Floating-point invalid operation (parameters: 0x0000000000000000, 0x0000000000009961).
Unhandled exception at 0x00007FF7FAA1E907 in bsam20_2022_10_21_7f4fea3.exe: 0xC0000090: Floating-point invalid operation (parameters: 0x0000000000000000, 0x0000000000009961).

 

From Valgrind:

 

==22803== HEAP SUMMARY:
==22803== in use at exit: 5,249 bytes in 20 blocks
==22803== total heap usage: 9,716 allocs, 9,696 frees, 14,371,548 bytes allocated
==22803==
==22803== Searching for pointers to 20 not-freed blocks
==22803== Checked 2,357,980,840 bytes
==22803==
==22803== 8 bytes in 1 blocks are still reachable in loss record 1 of 18
==22803== at 0x5487738: operator new(unsigned long) (vg_replace_malloc.c:417)
==22803== by 0xBBF83F: CryptoPP::NewObject<CryptoPP::OAEP<CryptoPP::SHA1, CryptoPP::P1363_MGF1> >::operator()() const (misc.h:258)
==22803== by 0xBBF954: CryptoPP::Singleton<CryptoPP::OAEP<CryptoPP::SHA1, CryptoPP::P1363_MGF1>, CryptoPP::NewObject<CryptoPP::OAEP<CryptoPP::SHA1, CryptoPP::P1363_MGF1> >, 0>::Ref() const (misc.h:346)
==22803== by 0xBBF0AC: CryptoPP::TF_ObjectImplBase<CryptoPP::TF_DecryptorBase, CryptoPP::TF_CryptoSchemeOptions<CryptoPP::TF_ES<CryptoPP::RSA, CryptoPP::OAEP<CryptoPP::SHA1, CryptoPP::P1363_MGF1>, int>, CryptoPP::RSA, CryptoPP::OAEP<CryptoPP::SHA1, CryptoPP::P1363_MGF1> >, CryptoPP::InvertibleRSAFunction>::GetMessageEncodingInterface() const (pubkey.h:594)
==22803== by 0xBBC103: CryptoPP::TF_CryptoSystemBase<CryptoPP::PK_Decryptor, CryptoPP::TF_Base<CryptoPP::TrapdoorFunctionInverse, CryptoPP::PK_EncryptionMessageEncodingMethod> >::FixedMaxPlaintextLength() const (pubkey.h:273)
==22803== by 0xBB61CF: license::decrypt(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (Encrypt.cpp:83)
==22803== by 0xBAEB6D: license::is_valid(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, int const&, double const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (License.cpp:140)
==22803== by 0x8ACA0B: license::check_license(double) (License.cpp:120)
==22803== by 0x891712: SHE_license_check_license (wrapsheff_license.cpp:17)
==22803== by 0x6CCFF6: varnam_ (varnam.f90:108)
==22803== by 0x410079: MAIN__ (mainf1.f:47)
==22803== by 0x410011: main (in /home/ga/gaoz/bin/bsam20_2022_10)
==22803==
==22803== 13 bytes in 1 blocks are definitely lost in loss record 2 of 18
==22803== at 0x5487017: malloc (vg_replace_malloc.c:380)
==22803== by 0x116950C9: strdup (in /usr/lib64/libc-2.17.so)
==22803== by 0x12DD9197: ???
==22803== by 0x486B8F2: _dl_init (in /usr/lib64/ld-2.17.so)
==22803== by 0x48704CD: dl_open_worker (in /usr/lib64/ld-2.17.so)
==22803== by 0x486B703: _dl_catch_error (in /usr/lib64/ld-2.17.so)
==22803== by 0x486FABA: _dl_open (in /usr/lib64/ld-2.17.so)
==22803== by 0xEF31EEA: dlopen_doit (in /usr/lib64/libdl-2.17.so)
==22803== by 0x486B703: _dl_catch_error (in /usr/lib64/ld-2.17.so)
==22803== by 0xEF324EC: _dlerror_run (in /usr/lib64/libdl-2.17.so)
==22803== by 0xEF31F80: dlopen@@GLIBC_2.2.5 (in /usr/lib64/libdl-2.17.so)
==22803== by 0x11E6E051: ofi_reg_dl_prov (in /home/ba/ballardmk/bin/intel/oneapi/mpi/2021.1.1/libfabric/lib/libfabric.so.1)

 

==22803== 31 bytes in 1 blocks are definitely lost in loss record 5 of 18
==22803== at 0x5487017: malloc (vg_replace_malloc.c:380)
==22803== by 0x12DD96F5: ???
==22803== by 0x486B8F2: _dl_init (in /usr/lib64/ld-2.17.so)
==22803== by 0x48704CD: dl_open_worker (in /usr/lib64/ld-2.17.so)
==22803== by 0x486B703: _dl_catch_error (in /usr/lib64/ld-2.17.so)
==22803== by 0x486FABA: _dl_open (in /usr/lib64/ld-2.17.so)
==22803== by 0xEF31EEA: dlopen_doit (in /usr/lib64/libdl-2.17.so)
==22803== by 0x486B703: _dl_catch_error (in /usr/lib64/ld-2.17.so)
==22803== by 0xEF324EC: _dlerror_run (in /usr/lib64/libdl-2.17.so)
==22803== by 0xEF31F80: dlopen@@GLIBC_2.2.5 (in /usr/lib64/libdl-2.17.so)
==22803== by 0x11E6E051: ofi_reg_dl_prov (in /home/ba/ballardmk/bin/intel/oneapi/mpi/2021.1.1/libfabric/lib/libfabric.so.1)
==22803== by 0x11E6E6B8: fi_ini (in /home/ba/ballardmk/bin/intel/oneapi/mpi/2021.1.1/libfabric/lib/libfabric.so.1)
==22803==

 

==22803==
==22803== 32 bytes in 1 blocks are definitely lost in loss record 8 of 18
==22803== at 0x548B778: calloc (vg_replace_malloc.c:1117)
==22803== by 0x12B7905A: ???
==22803== by 0x12B6017A: ???
==22803== by 0x11E6E06F: ofi_reg_dl_prov (in /home/ba/ballardmk/bin/intel/oneapi/mpi/2021.1.1/libfabric/lib/libfabric.so.1)
==22803== by 0x11E6E6B8: fi_ini (in /home/ba/ballardmk/bin/intel/oneapi/mpi/2021.1.1/libfabric/lib/libfabric.so.1)
==22803== by 0x11E6F460: fi_getinfo@@FABRIC_1.3 (in /home/ba/ballardmk/bin/intel/oneapi/mpi/2021.1.1/libfabric/lib/libfabric.so.1)
==22803== by 0x11E73E65: fi_getinfo@FABRIC_1.1 (in /home/ba/ballardmk/bin/intel/oneapi/mpi/2021.1.1/libfabric/lib/libfabric.so.1)
==22803== by 0x106E1BB2: MPIDI_OFI_mpi_init_hook (ofi_init.c:1167)
==22803== by 0x1022FCE7: MPID_Init (ch4_init.c:1138)
==22803== by 0x104C624F: MPIR_Init_thread (initthread.c:137)
==22803== by 0x104C624F: PMPI_Init_thread (initthread.c:269)
==22803== by 0xFDC0F4B: MPI_INIT_THREAD (initthreadf.c:270)
==22803== by 0x6391A8: mpi_util_mp_mpi_util_start_ (mpi_util.f90:52)
==22803==

 

==22803== LEAK SUMMARY:
==22803== definitely lost: 76 bytes in 3 blocks
==22803== indirectly lost: 0 bytes in 0 blocks
==22803== possibly lost: 0 bytes in 0 blocks
==22803== still reachable: 5,173 bytes in 17 blocks
==22803== suppressed: 0 bytes in 0 blocks

0 Kudos
1 Reply
jimdempseyatthecove
Honored Contributor III
428 Views

The "usual suspect" is not necessarily memory leak, but more often corruption. IOW errant code is "generating invalid addresses" .OR. ".not. generating correct addresses" for use in variable reference. This being the case, valgrind, as well as any other heap check, will not detect such error until well after the problem (errant memory write) occurs.

 

The first step is to build in Debug mode with full compile time diagnostics .AND. specifying full runtime checks. Fix all compile time warnings and/or errors, then run with full runtime checks.

Note, it is not unusual for coding errors (either yours or compiler bugs) to not misbehave while you are looking for them. These are the "Heisenbugs" (Heisenberg uncertainty principal). In this case, you may have to do a binary search, so to speak, partitioning and then sub-partitioning your code into "optimized" and "unoptimized" code permuted with and without runtime checks.

 

In the event of your coding error... well you fix it. And move on.

In the event of a compiler bug, report it (with simplified reproducer if possible), and you may be able to bypass the bug by compiling the problematic source file(s) with lessor optimization levels.

Additional note, due to having two compilers (ifort and ifx) the error may present itself in one of the compilers. In that case, you can compile the problematic file(s) using the other compiler. Barring that, you may need to rearrange your code to remove the sensitivity, assuming you can isolate the code.

On occasion (a few years back), I experienced an error in a (few) sections of code loops that were correctable by excising the loop and placing it into a CONTAINS procedure. This required very little work (once the problem section of code was identified).

 

Good luck hunting.

 

Jim Dempsey

0 Kudos
Reply