A bug in zgelsd in MKL 15.0

Yue_W_ · ‎03-23-2015

Hi there,

Thank you for reading this post.

I got these error messages when calling zgelsd in MKL 15.0 to solve a fairly large matrix,

Intel MKL INTERNAL ERROR: Condition 1 detected in function DLASD4.

Intel MKL INTERNAL ERROR: Condition 1 detected in function DLASD8.

I googled online and found the exact issue here https://software.intel.com/en-us/forums/topic/373673, where it said the bug had been fixed in MKL 11 update 5.

The matrix contains 23066 * 23068, which is more than 500 million, complex numbers. At first I thought it might be some overflow issue cause I only encounter this issue when dealing with matrices of such size or larger. However, it seems that 500 million is still far less than (2^31-1). Also I do have a case, in which the coefficients are computed in a slightly different way, where the solver works and gives the correct result. (I was developing a code used in our group and the coefficient computed in the two ways are mostly the same with slightly difference in minor places.)

Thank you,

Yue

Yue_W_ · ‎03-23-2015

Sorry about the MKL version. I'm not sure about it. I was using the one in composer_xe_2015.2.164.

Ying_H_Intel · ‎03-23-2015

Hi Yue,

Could you please tell which platform are you working on? it would be better if you provide the test code and the test matrix.

I try the code and test matrix from https://software.intel.com/en-us/forums/topic/373673 on Linux machine with 64bit, dynamic link. with composer_xe_2015.2.164

[yhu5@prc-mic01 ~]$ source /opt/intel/composer_xe_2015.2.164/mkl/bin/mklvars.sh intel64

[yhu5@prc-mic01 F373673_zgelsd]$ gcc -Wall -g -O0 -o zgelsd_bug zgelsd-bug.c -fno-strict-aliasing -L $LD_LIBRARY_PATH -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lm

[yhu5@prc-mic01 F373673_zgelsd]$ ./zgelsd_bug
Reading the input matrix...
Reading the input RHS...
Done
info = 0

[yhu5@prc-mic01 F373673_zgelsd]$ gcc --version
gcc (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3)

Best Regards,
Ying

Yue_W_ · ‎03-23-2015

Hi Ying,

Thank you for the prompt reply.

I'm working on Linux. Yes, I can send you the solver part of the test code. However, the matrix saved in formatted file is around 30G. I'll convert it to unformatted form and see if there is more I can do to compress it.

Thank you,

Yue

Yue_W_ · ‎03-24-2015

Hi Ying,

I made two tarballs (~13G together) containing two sets of data and the test code. One set of data works while the other doesn't and gives these error messages. Is there anyway that I can send you the data please?

Thank you,

Yue

Ying_H_Intel · ‎03-29-2015

Hi Yue,

Thanks for the test package. We are trying it. It seems that the run is very long and I will keep you update if any result.

Thanks

Ying

Yue_W_ · ‎03-29-2015

Hi Ying,

Thank you for the update. Yes, the run takes a while (should be around 10 to 15 hours). Thanks!

Regards,

Yue

Ying_H_Intel · ‎03-31-2015

Hi Yue,

We are able to reproduce the errors. There is a optimization-related issues inside of MKL, which causes some loss of precision and the algorithm could not converge on that particular matrix. I have recorded the problem into our buglists, our developer will fix it later.

For temp workaround, would you please try the zgelss? ( it works fine for the matrix.).

We have two SVD based algorithms for solving Least Squares problems:

1. ?gelsd – using SVD based on Divide and Conquer (D&C),

2. ?gelss – using SVD based on QR.

D&C algorithm are faster and exploit less flops, but less stable and there are some matrices they are unable to solve. QR based algorithms are more robust but slower. basically, we try solving with D&C algorithm first, but if it reports an error, then rerun the same task with QR.

Best regards,

Ying

Yue_W_ · ‎03-31-2015

Hi Ying,

Thank you for the update. I'll try zgelss and do some tests.

Thanks,

Yue