Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

zgetrf error handling regression

Paul_C_2
Beginner
799 Views

Hello,

We are in the process of trying to upgrade our MKL from 11.2.2.1 to 2017 update 2. We noticed a change in how a singular matrix is handled by zgetrf. In 11.2.2.1, zgetrf would return (via info) a positive number indicating the problem pivot point. Now 2017.2 throws a floating point division by zero exception and we do not know the problem pivot number. Was this an intentional change? If so, how do we find out the problem pivot number?

A simple test case is a 4x4 complex matrix represented by:

+        [0]    {d_re=0.00000000000000000 d_im=1000000.0000000000 }    complex
+        [1]    {d_re=0.00000000000000000 d_im=1000000.0000000000 }    complex
+        [2]    {d_re=0.00000000000000000 d_im=0.00000000000000000 }    complex
+        [3]    {d_re=0.00000000000000000 d_im=0.00000000000000000 }    complex
+        [4]    {d_re=0.00000000000000000 d_im=1000000.0000000000 }    complex
+        [5]    {d_re=0.00000000000000000 d_im=1000000.0000000000 }    complex
+        [6]    {d_re=0.00000000000000000 d_im=0.00000000000000000 }    complex
+        [7]    {d_re=0.00000000000000000 d_im=0.00000000000000000 }    complex
+        [8]    {d_re=0.00000000000000000 d_im=0.00000000000000000 }    complex
+        [9]    {d_re=0.00000000000000000 d_im=0.00000000000000000 }    complex
+        [10]    {d_re=1000000.0005243536 d_im=-16.191775445209515 }    complex
+        [11]    {d_re=-1000000.0000000000 d_im=0.00000000000000000 }    complex
+        [12]    {d_re=0.00000000000000000 d_im=0.00000000000000000 }    complex
+        [13]    {d_re=0.00000000000000000 d_im=0.00000000000000000 }    complex
+        [14]    {d_re=-1000000.0000000000 d_im=0.00000000000000000 }    complex
+        [15]    {d_re=1000000.0005243536 d_im=-16.191775445209515 }    complex

Exception info:

First-chance exception at 0x00007FFBFAFEC926 (mkl_avx2.dll) in blah.exe: 0xC000008E: Floating-point division by zero (parameters: 0x0000000000000000). In our exe, we translate select structured exceptions like this to C++ exceptions so we can deal with computation errors at a higher level.

 

Thanks,

Paul

 

0 Kudos
9 Replies
Gennady_F_Intel
Moderator
799 Views

hello Paul, at the first glance, this might be caused by non-exactness of floating point arithmetic and FMA instructions set. We need to check this more carefully. 

0 Kudos
Gennady_F_Intel
Moderator
799 Views

Paul, I see no exceptions with LU routine and with the data you gave. the example of the code is attached. Here is the output I see on my side with the latest MKL 2017 u2: 

..\mkl_Forums\u731589>2017.exe

 ZGESVD Example Program Results
Major version:           2017
Minor version:           0
Update version:          2
Product status:          Product
Build:                   20170126
Platform:                Intel(R) 64 architecture
Processor optimization:  Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors
================================================================

 info, zgetrf = 2
 ipiv:
 [0] = 1
, [1] = 2
, [2] = 3
, [3] = 4

0 Kudos
Paul_C_2
Beginner
799 Views

We finally have a test case. I have a zip file of a VS solution with three projects: A command line exe that loads a win32 dll which in turn depends on a FORTRAN project. It seems loading the FORTRAN runtime libs triggers the problem. The zip file is 92 MB if I include the MKL libs and that file fails to upload. Should I remove the MKL libs and try again?

0 Kudos
Gennady_F_Intel
Moderator
799 Views

yes, you may remove mkl's libs and upload the project. 

you may also try to check the problem with the latest MKL 2017 u3 which we released one week ago. the announcement on the top of the forum.

0 Kudos
Paul_C_2
Beginner
799 Views

Ok, version without bin libs is attached. We'll try update 3 later this week.

0 Kudos
Gennady_F_Intel
Moderator
799 Views

I checked with mkl 2017 u3 on three different CPUs. I only added mkl_version routine and I see the same behavior  and no exceptions...

Windows 8.1, 64 bit.

 ZGESVD Example Program Results
Major version:           2017
Minor version:           0
Update version:          3
Product status:          Product
Build:                   20170413
Platform:                Intel(R) 64 architecture
Processor optimization:  Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled processors
================================================================

 info, zgetrf = 2
Press any key to continue . . .

 ZGESVD Example Program Results
Major version:           2017
Minor version:           0
Update version:          3
Product status:          Product
Build:                   20170413
Platform:                Intel(R) 64 architecture
Processor optimization:  Intel(R) Advanced Vector Extensions (Intel(R) AVX) enabled processors
================================================================

 info, zgetrf = 2
Press any key to continue . . .

 ZGESVD Example Program Results
Major version:           2017
Minor version:           0
Update version:          3
Product status:          Product
Build:                   20170413
Platform:                Intel(R) 64 architecture
Processor optimization:  Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors
================================================================

 info, zgetrf = 2
Press any key to continue . . .

0 Kudos
Paul_C_2
Beginner
799 Views

We just verified that 2017 u3 seems to work. Can you do us a favor and try it on u2 just to confirm our findings and maybe let us know what changed? It seemed like the FORTRAN runtime libs were putting the exception mask into a bad state.

0 Kudos
Paul_C_2
Beginner
799 Views

Any update on what changed between update 2 and update 3 that fixed this issue?

 

0 Kudos
Eugene_C_Intel1
Employee
799 Views

Hi Paul,

You are right, it was a bug in LU implementation for small sizes (2x2, 3x3 and 4x4). It was introduced in MKL 2017 Update 1 and was fixed in MKL 2017 Update 3. A column was scaled even in the case of zero pivot. It caused division by zero and NaNs in a matrix.

0 Kudos
Reply