Bug in zgesdd

marko_l_ · ‎08-12-2016

I am using the zgesdd routine to find the singular value decomposition for a matrix, however, for certain sizes of the matrix the routine will cause an access violation (segmentation fault) and the program will be forced to stop by the OS (Win10 64bit).

Regardless of the compilation options (whether or not I allow AVX2 extensions or not) the call stack points to

mkl_blas_avx2_dgemm_kernel_0_b0()

To the best of my knowledge my CPU (i7-4720HQ) should support AVX2 routines (although I wouldn't expect them to be called if I specify that they should not be used).

It seems that zgesvd does not share the same issue and appears to be working fine for me.

Some sizes that cause a crash:

191 by 354 (354x191 would be the output of the code below, not really the standard when it comes to matrices, I know)
282 by 498 (498x282)
242 by 420 (242x420)

The above were found with the following code, you may also directly input those values into the test_matrix constructor and the crash will (or it did for me) always happen.

#include <iostream>
#include <random>
#include <ctime>
#include <new>
#include <tuple>

#include <mkl/mkl.h>
#include <mkl/mkl_lapack.h>
#include <mkl/mkl_blas.h>

class matrix {
public:
	ll x, y;
	MKL_Complex16* mat;
	matrix() {
		x = y = 0;
		mat = NULL;
	}
	matrix(ll X, ll Y) {
		x = X;
		y = Y;
		mat = new MKL_Complex16[x * y]();
	}
	~matrix() {
		x = y = 0;
		if (mat != NULL) delete[] mat;
	}
	void set(ll X, ll Y) {
		x = X;
		y = Y;
		if (mat != NULL) delete[] mat;
		mat = new MKL_Complex16[x * y]();
	}
	void clear() {
		x = y = 0;
		if (mat != NULL) {
			delete[] mat;
			mat = NULL;
		}
	}
}

class test_matrix {
public:
	matrix data;
	test_matrix(int x, int y) {
		data.set(x, y);
		for (int i = 0; i < x * y; i++) {
			data.mat.real = i + 1;
		}
	}
	~test_matrix() {
		data.clear(); //Don't really need this...
	}
};

tuple<matrix*, double*, matrix*, ll> matrixSVD(matrix* in) {
	ll ssize = in->y < in->x ? in->y : in->x;
	matrix* U = new matrix(in->y, in->y);
	matrix* VT = new matrix(in->x, in->x);
	double* S = new double[ssize]();
#ifndef _TEST_SVD
	LAPACKE_zgesdd(LAPACK_ROW_MAJOR, 'A', in->y, in->x, in->mat, in->x, S, U->mat, U->x, VT->mat, VT->x);
#endif
#ifdef _TEST_SVD
	double* sp = new double[5 * ssize]();
	LAPACKE_zgesvd(LAPACK_ROW_MAJOR, 'A', 'A', in->y, in->x, in->mat, in->x, S, U->mat, in->y, VT->mat, in->x, sp);
	delete[] sp;
#endif
	return make_tuple(U, S, VT, ssize);
}

int main(void) {
	mt19937_64 gen(clock());
	uniform_int_distribution<int> dist(1, 500);
	for (int i = 0; i < 10000; i++) {
		int i1 = dist(gen), i2 = dist(gen);
		cout << i << ": " << i1 << 'x' << i2 << '\n';
		test_matrix tmt(i1, i2);
		matrix* U;
		double* S;
		matrix* VT;
		ll s;
		tie(U, S, VT, s) = matrixSVD(&tmt.data);
		delete U;
		delete VT;
		delete[] S;
	}
	return 0;
}

Pardon the messy code, it was just a test. It might still be that the fault is mine, but since it works when I change to zgesvd, that is unlikely. Any and all suggestions are of course welcome.

Thank you for your time.

Ying_H_Intel · ‎08-15-2016

Hi Marko,

How do you define the ll type? we can't build your code in MSVC 2012, what command line do you use for build the program?

as the two function use different algrithm, it is possible that one is not converge, so hang

Return Values
This function returns a value info.
If info=0, the execution is successful.
If info = -i, the i-th parameter had an illegal value.
If info = i, then ?bdsdc did not converge, updating process failed.

Best Regard,

Ying

marko_l_ · ‎08-15-2016

Hi Ying,

Thank you for your reply!

I knew I'd forget to put something in that code, if anything else is missing, please let me know.

The ll type is simply long long:

typedef long long ll;

Regarding the return values, zgesdd appears to corrupt the heap, so I don't really get to that part, but for smaller test cases, where this either didn't happen or was on a scale too small to be noticed, the algorithm did appear to converge to the correct values.

One thing I did notice now is that I simply allocate the memory with no regard to alignment, but assuming I turn off all extensions (SSE, AVX...), that should not cause any issues? (Although, as mentioned, the routine uses some avx2 subroutines regardless of this setting).

The command line is somewhat long (and I use VS2015, update 3 and Intel Parallel Studio 2016 update 3)

/MP /GS- /Qopenmp-offload:gfx /Qopenmp /GA /W3 /QxCORE-AVX2 /Gy /Zc:wchar_t /Zi /O2 /Fd"x64\Release\vc140.pdb" /Quse-intel-optimized-headers /Qgpu-arch=haswell /D "__builtin_huge_val()=HUGE_VAL" /D "__builtin_huge_valf()=HUGE_VALF" /D "__builtin_nan=nan" /D "__builtin_nanf=nanf" /D "__builtin_nans=nan" /D "__builtin_nansf=nanf" /D "_MBCS" /Qipo /Qoffload-arch:haswell:visa3.1 /Zc:forScope /Qopt-matmul /Oi /MT /QaxCORE-AVX2 /Fa"x64\Release\" /EHsc /nologo /Qparallel /Fo"x64\Release\" /Qprof-dir "x64\Release\" /Ot /Fp"x64\Release\DMRG.pch"

I imagine that the latest release makes the __builtin obsolete? (It caused a problem previously)

Regardless of whether /QaxCORE-AVX2 and /QxCORE-AVX2 are enabled or not, the same avx2 routine described in the original post is called and subsequently causes a crash.

Best regards,
Marko

marko_l_ · ‎08-15-2016

Ahh, I also forgot to add:

using namespace std;

to the snippet above.

Unfortunately I can't edit my posts, so I have to resort to posting a few more. :)

Ying_H_Intel · ‎08-17-2016

Hi Marko,

Thanks for the add. I can reproduce the problem at my sides. we will investigate it further and get back to you later.

Thanks

Ying

marko_l_ · ‎08-18-2016

Hi Ying,

Thank you for your reply.

There is no rush, since I can use zgesvd to solve the same problem. However, I do have a question, which of the singular value routines is generally considered to be the fastest (zgesvd, zgesdd, zgesvj, not sure if there are any others I'm forgetting)?

Best regards,
Marko

Ying_H_Intel · ‎08-21-2016

Hi Marko,

?GESDD is usually the preferable option.

You may see the description at

http://www.netlib.org/lapack/lug/node32.html

There are two types of driver routines for the SVD. Originally LAPACK had just the simple driver described below, and the other one was added after an improved algorithm was discovered.

a simple driver xGESVD computes all the singular values and (optionally) left and/or right singular vectors.

a divide and conquer driver xGESDD solves the same problem as the simple driver. It is much faster than the simple driver for large matrices, but uses more workspace. The name divide-and-conquer refers to the underlying algorithm (see sections 2.4.4 and 3.4.3).

Best Regards,
Ying

marko_l_ · ‎08-22-2016

Hi Ying!

Thank you for the reply. I did expect that divide and conquer would be faster, but I never knew where to check. Out of curiosity though, how large is "large", the matrices I'm dealing with are 200x200 at most, I imagine that's large enough for zgesdd to be faster than zgesvd, unless the overhead is massive?

Best Regards,
Marko

Ying_H_Intel · ‎08-22-2016

Hi Marko,

The 200x200 should be large enough for the SDD. We can try it out anyway, ( on one of my lab machine, sdd : real 0m0.269s vs. svd: real 0m0.490s)

Best Regards,

Ying

yhu5@ivb01-ub:~$ time ./a.out
sizeof(long long)4
0: 200x347
1: 346x414
2: 200x200
3: 200x200
4: 282x473
5: 406x311
6: 200x200
7: 475x200
8: 200x200
9: 200x415

real    0m0.269s
user    0m1.033s
sys     0m0.020s
yhu5@ivb01-ub:~$ icc test_zgesvd.cpp -mkl -std=c++11 -D_TEST_SVD
yhu5@ivb01-ub:~$ time ./a.out
sizeof(long long)4
0: 200x347
1: 346x414
2: 200x200
3: 200x200
4: 282x473
5: 406x311
6: 200x200
7: 475x200
8: 200x200
9: 200x415

real    0m0.490s
user    0m1.894s
sys     0m0.044s
yhu5@ivb01-ub:~$ time ./a.out
sizeof(long long)4
0: 200x347
1: 346x414
2: 200x200
3: 200x200
4: 282x473
5: 406x311
6: 200x200
7: 475x200
8: 200x200
9: 200x415

real    0m0.404s
user    0m1.545s
sys     0m0.048s

marko_l_ · ‎08-22-2016

Hi Ying!

Thank you for all your help.

Best regards,
Marko

marko_l_ · ‎10-04-2016

I was wondering, when can a fix be expected?
I find no mention of this being fixed in the latest release (11.3 Update 4).

Best regards,
Marko

Ying_H_Intel · ‎10-10-2016

Hi Marko,

I have only new version installed on several Linux machine. and try it, seems the bug was fixed in MKL 2017 version. Hopefully it was fixed on windows too.

If possible, please try either the release MKL 2017 (early of Sep) or MKL 11.3.4 (Oct) on windows?

Best Regards,

Ying

marko_l_ · ‎10-10-2016

Hi Ying!

Linux is fine, most of my computations are run on Linux machines. That said, I will update my Windows machine as well and check to see if it works there (no reason to believe it wouldn't).

Best regards,
Marko

marko_l_ · ‎10-11-2016

Hello again!

I have tested the above code (and appended it as a file to this post, without the missing lines in the first post) on both Windows and Linux. The only relevant change is that the array allocation is now aligned (if for whatever reason the compiler assumes this, though it shouldn't, but you never know), but that seems to make no difference.

Windows: access violation, last output: "790: 458x249"
Call stack still points to mkl_blas_avx2_dgemm_kernel_0_b0()

Linux: similar behavior (the difference in number is purely due to different implementations of the random generator by MSVC and GCC)
581: 118x223
*** Error in "./test": corrupted double-linked list: 0x0000000002fc2980 ***
Aborted

Removing the random sizes and running 458x249 will cause an immediate crash on Windows.
Interestingly enough, the same can not be said for Linux, neither 458x249 nor 118x223 will cause the program to crash (if we only run repetitively). HOWEVER, the crash at 581: 118x223 happened 8/10 times, 2/10 it crashed eariler at 455: 287x264. This indicates that some heap corruption still takes place.

In other words, it seems the issue is not yet fixed with the latest release (2017).

Using zgesvd results in a considerable slowdown, but it does work.

Best regards,
Marko

Ying_H_Intel · ‎10-11-2016

Hi Marko,

Interesting result!.

I tried the original code i had it works fine. but I just tried the test.cpp you provide seems i can reproduce the coredump again. .

Could you please try small loop and with the
>export MKL_VERBOSE=1

and let me know the output?

Mine is

MKL_VERBOSE Intel(R) MKL 2017.0 Product build 20160801 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.30GHz lp64 intel_thread NMICDev:0.

the main diff are the random size. i'm using mt19937_64 gen((77777)); and your mt19937_64 gen(12345).

You are right, it should be still a issue. i will investigate it later.

Best Regards,
Ying

icc test_zgesvd.cpp -o test_zgesvd -mkl -std=c++11

[yhu5_new@hsw-ep01 ~]$ diff test_zgesvd.cpp test.cpp
1,7d0
< // ConsoleApplication1.cpp : Defines the entry point for the console application.
< //
<
< //#include "stdafx.h"
<
< // TODO: reference any additional headers you need in STDAFX.H
< // and not in this file
10,11d2
< #include <ctime>
< #include <new>
17c8,10
< typedef MKL_INT ll;
---
>
> typedef long long ll;
>
31c24
< mat = new MKL_Complex16[x * y]();
---
> mat = (MKL_Complex16*)mkl_calloc(x * y, sizeof(MKL_Complex16), 64);
35c28
< if (mat != NULL) delete[] mat;
---
> if (mat != NULL) mkl_free(mat);
40,41c33,34
< if (mat != NULL) delete[] mat;
< mat = new MKL_Complex16[x * y]();
---
> if (mat != NULL) mkl_free(mat);
> mat = (MKL_Complex16*)mkl_calloc(x * y, sizeof(MKL_Complex16), 64);
46c39
< delete[] mat;
---
> mkl_free(mat);
83c76
< mt19937_64 gen((77777));//clock()));//77777);
---
> mt19937_64 gen(12345);
85,87d77
<
< cout << "sizeof(long long)" << sizeof(ll) << '\n';
<
89c79
< int i1 = max(dist(gen),200), i2 = max(dist(gen),200);
---
> int i1 = dist(gen), i2 = dist(gen);
102c92
< }
---
> }
\ No newline at end of file

3638: 200x438
3639: 200x200
3640: 200x484
3641: 238x200
3642: 425x200
3643: 295x200
3644: 382x392
3645: 200x392
3646: 200x200
3647: 370x200
3648: 478x203
3649: 200x209
3650: 200x314
3651: 352x442
3652: 272x209
3653: 200x435
3654: 230x297
3655: 200x200

9989: 200x200
9990: 200x298
9991: 428x483
9992: 458x386
9993: 200x289
9994: 200x291
9995: 323x251
9996: 200x440
9997: 454x200
9998: 200x200
9999: 425x200
[yhu5_new@hsw-ep01 ~]$

marko_l_ · ‎10-11-2016

Hi Ying!

Thanks for the fast reply.

Windows: MKL_VERBOSE Intel(R) MKL 2017.0 Product build 20160801 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Win 2.60GHz cdecl intel_thread NMICDev:0

Linux: MKL_VERBOSE Intel(R) MKL 2017.0 Product build 20160801 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.60GHz lp64 intel_thread NMICDev:0

For reference, these are different machines.

As for the random seed, that just changes when it crashes.

Best regards,
Marko

Ying_H_Intel · ‎11-20-2016

Hi Marko and ~

Just got news from our developers, the issue should be targeted to be fixed in next version. ( supposed 2017 update 2) , so let's wait and check it when the version is ready.

Best Regards,
Ying

marko_l_ · ‎02-28-2017

Hi Ying!

While the release notes don't indicate anything has been changed it does seem to work properly now (2017.2 on Windows - not yet installed on the Linux machines).

That being said, I'd like to thank you for all the time you've put into this.

Best regards,
Marko

Ying_H_Intel · ‎03-14-2017

Hi Marko,

Thanks you very much for the confirmation. Right, from our internal data base, it shows the issue fixed in 2017.2.

The information is supposed be in MKL bug fix list

https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2017-bug-fixes-list

Best Regards,

Ying

DPD200415105

Fixed LAPACKE_zgesdd failure with some specific input problems