Segfault when upgrading from AVX to AVX2

Shy · ‎06-28-2021

We didn't have short code to reproduce the issue. But when we upgrading cputype from AVX to AVX2, segfault reported in around 100 RTS.

Valgrind trace:

==230116== Invalid read of size 4

==230116== at 0x968BE69: l9_commit (in /nfs/orto/proj/tapeout/cit_dev104/mkgumbel/avx2/IT/redist/gcc-release_sles11-align256/bin/sles11/mazama)

==230116== by 0xBDE5D4B: static_dfti_commit (in /nfs/orto/proj/tapeout/cit_dev104/mkgumbel/avx2/IT/redist/gcc-release_sles11-align256/bin/sles11/mazama)

==230116== by 0xA646551: commit (in /nfs/orto/proj/tapeout/cit_dev104/mkgumbel/avx2/IT/redist/gcc-release_sles11-align256/bin/sles11/mazama)

==230116== by 0xBDE5D4B: static_dfti_commit (in /nfs/orto/proj/tapeout/cit_dev104/mkgumbel/avx2/IT/redist/gcc-release_sles11-align256/bin/sles11/mazama)

==230116== by 0xA3B1DCF: DftiCommitDescriptor (in /nfs/orto/proj/tapeout/cit_dev104/mkgumbel/avx2/IT/redist/gcc-release_sles11-align256/bin/sles11/mazama)

==230116== by 0x4F4D575: FFTMatrix::GetPlan(FFTDirection) (FFTMatrix.cpp:191)

==230116== by 0x4F4DB74: FFTMatrix::Transform(FFTDirection) (FFTMatrix.cpp:328)

==230116== by 0x4C5234A: IplDensityConv::AccumulateDensity_FFT(int, int, int, int, int, int, float*, float, float) (IplDensityConv.cpp:567)

==230116== by 0x4D5088D: IplModelAPI::RenderDensityImage(int, int, int, int, int, int, float*, float*) (densityImage.cpp:176)

==230116== by 0x48018F5: ModelBaseImpl::CalculateRenderedDensityImage(int) (ModelBase_RenderedImpl.cpp:1498)

==230116== by 0x479E819: ModelBaseImpl::RenderDensityImage(int) (ModelBaseImpl.cpp:2192)

==230116== by 0x479EA95: ModelBaseImpl::ComputeDensityImage(int, int, int) (ModelBaseImpl.cpp:2233)

==230116== Address 0x18 is not stack'd, malloc'd or (recently) free'd

==230116==

Error: caught signal(11)

MKL Version tested:

- 2020.1.217 (2020.0.20200208)

- 2021.1.1 (2021.0.20201104)

- 2021.2.0 (2021.0.20210312)

CPU model: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz

Compiler version: gcc 10.2

Gennady_F_Intel · ‎06-28-2021

please pay attention to the mkl 2021 system requirements follow the link: https://software.intel.com/content/www/us/en/develop/articles/oneapi-math-kernel-library-system-requirements.html

SUSE Linux Enterprise Server* (SLES) 12, 15

Shy · ‎06-29-2021

We rerun with SLES12 machine, same segfault error flagged.

Matthew_G_Intel1 · ‎06-29-2021

I can provide a bit more information:

1. During program initialization, we set the ISA as such:

::mkl_cbwr_set(MKL_CBWR_AVX2);
::mkl_enable_instructions(MKL_ENABLE_AVX2);

If we change those to _AVX (instead of _AVX2) then the crash and invalid read goes away.

2. This issue seems triggered by small changes in program memory layout. For example, just changing small unrelated things in the program and recompiling can make the issue go away. As such it is very difficult for us to extract a small test case. When we extract just the MKL calls/input data, the issue does not occur.

3. Initial suspicion was perhaps alignment issues, but we even tried very large 256B alignment of the arrays and it still occurs.

Gennady_F_Intel · ‎06-30-2021

As we fixed not exactly the same FFT issue into the latest version of mkl 2021.3, I would recommend you to try your case with this version and get us the result. The MKL v.2021.3 should be already released and available for download.

Shy · ‎06-30-2021

Tested on mkl 2021.3.0, same segfault error flagged.

Error: caught signal(11)

Run on SLES12:

MKL: 2021.0.20210617 (Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors)

Gennady_F_Intel · ‎06-30-2021

Then we need to have a reproducer which we could use for deep diving the cause of the issue.

Matthew_G_Intel1 · ‎07-01-2021

Gennady,

We have tried our best, but are unable to extract a small reproducer that could be shared here. However, we have analyzed the assembly, and believe that the problem originates in the MKL library:

At the bottom of the valgrind output, we see the real problem is an invalid address: Address 0x18 is not stack'd, malloc'd or (recently) free'd
Using gdb to analyze:
1. Shows a segfault in MKL l9_commit commit function at instruction address 0x000000000962d1a9
2. use "disassemble 0x000000000962d1a9" to get assembly code where bug happened
3. the instruction where the segfault occurs is "cmpl $0x1,0x18(%rax)" --> "compare 0x1 to the contents of the address 0x18 bytes past address in RAX register"
4. a few instructions earlier we see RAX set as such "mov 0x20(%r13),%rax" --> "move 0x20 bytes past contents of R13 register into RAX"
5. use "info registers" to get register contents
  1. RAX is 0x0, that explains how we tried to access address 0x18
  2. R13 is 0x185d1c40 (for my run, yours may differ)
  3. 0x20 bytes past 0x185d1c40 is 0x185d1c60
Where "0x185d1c60" is getting set? We can use watchpoints.
1. set location watchpoint with "watch -l *0x185d1c60"
2. also set read watchpoint with "rwatch -l *0x185d1c60"
3. use "info watch" to get watchpoint numbers and disable them --> use "disable n"
4. create a conditional breakpoint to stop in FFTMatrix::GetPlan when we hit this FFTMatrix again
5. start run from beginning with just "run" (it will reuse args)
6. Once you hit the breakpoint re-enable the watchpoints and disable the conditional breakpoint
7. continue again, you'll break for a read watchpoint in "mkl_serv_calloc"
8. The heap buffer this address resolved in is allocated by this mkl_serv_calloc that happens after we've finished initializing our DFTI plan and asked MKL to do its thing. --> The important thing is that this buffer is allocated by MKL and we have no control nor input for it. It is set to zero at this point.
9. continue again, once again our read watchpoint is triggered in "mkl_dft_avx2_dfti_create_node", we don't really need to know exactly why. The point is it is reading the 0x0 still in that address.
10. A few more continues later, all of these watchpoints are reads. Eventually you'll land a read in l9_commit and right after you'll get that segfault.
11. Once again if you confirm at the point of failure you will find the contents of that address is 0x0

VidyalathaB_Intel · ‎10-07-2021

Hi,

Thank you for your patience.

The issue raised by you has been fixed in the 2021.4 version. Please download and let us know if this resolves your issue.

Regards,

Vidya.

VidyalathaB_Intel · ‎10-13-2021

Hi,

We are closing this thread assuming that your issue has been resolved. Please post a new question if you need any additional information from Intel as this thread will no longer be monitored.

Regards,

Vidya.