- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We didn't have short code to reproduce the issue. But when we upgrading cputype from AVX to AVX2, segfault reported in around 100 RTS.
Valgrind trace:
==230116== Invalid read of size 4
==230116== at 0x968BE69: l9_commit (in /nfs/orto/proj/tapeout/cit_dev104/mkgumbel/avx2/IT/redist/gcc-release_sles11-align256/bin/sles11/mazama)
==230116== by 0xBDE5D4B: static_dfti_commit (in /nfs/orto/proj/tapeout/cit_dev104/mkgumbel/avx2/IT/redist/gcc-release_sles11-align256/bin/sles11/mazama)
==230116== by 0xA646551: commit (in /nfs/orto/proj/tapeout/cit_dev104/mkgumbel/avx2/IT/redist/gcc-release_sles11-align256/bin/sles11/mazama)
==230116== by 0xBDE5D4B: static_dfti_commit (in /nfs/orto/proj/tapeout/cit_dev104/mkgumbel/avx2/IT/redist/gcc-release_sles11-align256/bin/sles11/mazama)
==230116== by 0xA3B1DCF: DftiCommitDescriptor (in /nfs/orto/proj/tapeout/cit_dev104/mkgumbel/avx2/IT/redist/gcc-release_sles11-align256/bin/sles11/mazama)
==230116== by 0x4F4D575: FFTMatrix::GetPlan(FFTDirection) (FFTMatrix.cpp:191)
==230116== by 0x4F4DB74: FFTMatrix::Transform(FFTDirection) (FFTMatrix.cpp:328)
==230116== by 0x4C5234A: IplDensityConv::AccumulateDensity_FFT(int, int, int, int, int, int, float*, float, float) (IplDensityConv.cpp:567)
==230116== by 0x4D5088D: IplModelAPI::RenderDensityImage(int, int, int, int, int, int, float*, float*) (densityImage.cpp:176)
==230116== by 0x48018F5: ModelBaseImpl::CalculateRenderedDensityImage(int) (ModelBase_RenderedImpl.cpp:1498)
==230116== by 0x479E819: ModelBaseImpl::RenderDensityImage(int) (ModelBaseImpl.cpp:2192)
==230116== by 0x479EA95: ModelBaseImpl::ComputeDensityImage(int, int, int) (ModelBaseImpl.cpp:2233)
==230116== Address 0x18 is not stack'd, malloc'd or (recently) free'd
==230116==
Error: caught signal(11)
MKL Version tested:
- 2020.1.217 (2020.0.20200208)
- 2021.1.1 (2021.0.20201104)
- 2021.2.0 (2021.0.20210312)
CPU model: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
Compiler version: gcc 10.2
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
please pay attention to the mkl 2021 system requirements follow the link: https://software.intel.com/content/www/us/en/develop/articles/oneapi-math-kernel-library-system-requirements.html
- SUSE Linux Enterprise Server* (SLES) 12, 15
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We rerun with SLES12 machine, same segfault error flagged.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can provide a bit more information:
1. During program initialization, we set the ISA as such:
::mkl_cbwr_set(MKL_CBWR_AVX2);
::mkl_enable_instructions(MKL_ENABLE_AVX2);
If we change those to _AVX (instead of _AVX2) then the crash and invalid read goes away.
2. This issue seems triggered by small changes in program memory layout. For example, just changing small unrelated things in the program and recompiling can make the issue go away. As such it is very difficult for us to extract a small test case. When we extract just the MKL calls/input data, the issue does not occur.
3. Initial suspicion was perhaps alignment issues, but we even tried very large 256B alignment of the arrays and it still occurs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As we fixed not exactly the same FFT issue into the latest version of mkl 2021.3, I would recommend you to try your case with this version and get us the result. The MKL v.2021.3 should be already released and available for download.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tested on mkl 2021.3.0, same segfault error flagged.
Error: caught signal(11)
Run on SLES12:
MKL: 2021.0.20210617 (Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Then we need to have a reproducer which we could use for deep diving the cause of the issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gennady,
We have tried our best, but are unable to extract a small reproducer that could be shared here. However, we have analyzed the assembly, and believe that the problem originates in the MKL library:
- At the bottom of the valgrind output, we see the real problem is an invalid address: Address 0x18 is not stack'd, malloc'd or (recently) free'd
- Using gdb to analyze:
- Shows a segfault in MKL l9_commit commit function at instruction address 0x000000000962d1a9
- use "disassemble 0x000000000962d1a9" to get assembly code where bug happened
- the instruction where the segfault occurs is "cmpl $0x1,0x18(%rax)" --> "compare 0x1 to the contents of the address 0x18 bytes past address in RAX register"
- a few instructions earlier we see RAX set as such "mov 0x20(%r13),%rax" --> "move 0x20 bytes past contents of R13 register into RAX"
- use "info registers" to get register contents
- RAX is 0x0, that explains how we tried to access address 0x18
- R13 is 0x185d1c40 (for my run, yours may differ)
- 0x20 bytes past 0x185d1c40 is 0x185d1c60
- Where "0x185d1c60" is getting set? We can use watchpoints.
- set location watchpoint with "watch -l *0x185d1c60"
- also set read watchpoint with "rwatch -l *0x185d1c60"
- use "info watch" to get watchpoint numbers and disable them --> use "disable n"
- create a conditional breakpoint to stop in FFTMatrix::GetPlan when we hit this FFTMatrix again
- start run from beginning with just "run" (it will reuse args)
- Once you hit the breakpoint re-enable the watchpoints and disable the conditional breakpoint
- continue again, you'll break for a read watchpoint in "mkl_serv_calloc"
- The heap buffer this address resolved in is allocated by this mkl_serv_calloc that happens after we've finished initializing our DFTI plan and asked MKL to do its thing. --> The important thing is that this buffer is allocated by MKL and we have no control nor input for it. It is set to zero at this point.
- continue again, once again our read watchpoint is triggered in "mkl_dft_avx2_dfti_create_node", we don't really need to know exactly why. The point is it is reading the 0x0 still in that address.
- A few more continues later, all of these watchpoints are reads. Eventually you'll land a read in l9_commit and right after you'll get that segfault.
- Once again if you confirm at the point of failure you will find the contents of that address is 0x0
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for your patience.
The issue raised by you has been fixed in the 2021.4 version. Please download and let us know if this resolves your issue.
Regards,
Vidya.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are closing this thread assuming that your issue has been resolved. Please post a new question if you need any additional information from Intel as this thread will no longer be monitored.
Regards,
Vidya.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page