Solved: Re: Segmentation fault at LAPACK zgeev for large matrices (N=60k and larger).

AlexKor · ‎08-03-2020

Dear All,

I have to solve very large eigenvalue problem (complex, no symmetries, so zgeev is used) and encountered the following issue: for very large matrices (size of the matrix N=60000 and higher) if you ask to find right eigenvectors segfault happens. For sizes N=32768 everything is OK. If I ask only to find eigenvalues no segfault as well. I use LAPACK subroutine zgeev through Fortran interface in C (so really I call in my C code zgeev_ ).

Segfault doesn't happen right away, which is very irritating. It looks like zgeev does something wrong only when it starts some operation which is not associated with eigenvalues alone, but needed for eigenvectors.

VERY IMPORTANT! It looks like it can be some general LAPACK/BLAS problem, as I get the same issue using OpenBLAS version of LAPACK/BLAS. May be some old type of "int" overflow in FORTRAN part of code.

Operation system is Gentoo Linux, kernel (uname -a output):
Linux Thor 5.1.12-gentoo #1 SMP Sat Jun 22 22:14:11 MDT 2019 x86_64 Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz GenuineIntel GNU/Linux

I have 2CPU system, which results in 24 real cores and 48 HT cores. 128GiB of RAM. So N=2*32768
should almost fit into the memory (although it shouldn't be problem as well, as I have around 300GiB of swap), but N=2*32000 and N=2*30000 have to fit in real RAM. Matrix of complex doubles for N=2*32768=65536 requires 64GiB, eigenvectors require the same amount of memory + negligible workspace amount.

Intel MKL is the newest one, was downloaded from Intel web-site just yesterday. I use gcc-9.1.0 for compilation and linking, but it shouldn't be the issue, as in the moment of sigsegv program is in zgeev subroutine which is from binary MKL.

Please, let me know if I need to provide more information or perform more tests.

With my best regards,

Alexander Korotkevich.

Here is the dmesg output:
[35085992.907111] in eigenfuncRV_k_30000_MKL[401000+b84000]
[35085992.907113] eigenfuncRV_k_3[7790]: segfault at 62596bec0 ip 00000000006e1cb7 sp 00007f25a32f1c10 error 6
[35085992.907115] Code: 00 00 00 48 39 df 7f 3e 4c 89 ca 4c 89 c0 48 89 f9 48 c1 e2 04 48 c1 e0 04 48 03 55 10 48 03 45 18 66 90 48 8b 32 48 83 c1 01 <48> 89 30 48 8b 72 08 48 83 c2 10 48 89 70 08 48 8d 71 ff 48 83 c0
[35085992.907119] in eigenfuncRV_k_30000_MKL[401000+b84000]
[35085992.907120] Code: 00 00 00 48 39 df 7f 3e 4c 89 ca 4c 89 c0 48 89 f9 48 c1 e2 04 48 c1 e0 04 48 03 55 10 48 03 45 18 66 90 48 8b 32 48 83 c1 01 <48> 89 30 48 8b 72 08 48 83 c2 10 48 89 70 08 48 8d 71 ff 48 83 c0

AlexKor · ‎08-04-2020

Gennady, hello,

I think I figured out the source of the problem. zgeev is not the source. In order to call FORTRAN interface of LAPACK one has to provide all number constants as "int" type. These are definitions which I have in the code:

#define N 60000

.......

int lapack_N, lapack_LDA, lapack_LDVL, lapack_LDVR, lapack_LWORK, lapack_INFO=0;

lapack_N = (N);
lapack_LDA = lapack_N;
lapack_LDVL = 1;
lapack_LDVR = lapack_N;
lapack_LWORK = 17*4*(2*lapack_N); /* Minimum 2*lapack_N, for good performance usually bigger */
lapack_WORK = (complex double *) malloc (lapack_LWORK*sizeof(complex double));
lapack_W = (complex double *) malloc (lapack_N*sizeof(complex double)); /* Array for eigenvalues */
lapack_VR = (complex double *) malloc (lapack_LDVR*lapack_N*sizeof(complex double)); /*Array for eigenvectors */

And this is where the issue most probably is located. I was trying to follow FORTRAN LAPACK documentation from Netlib word to word. This is what is written for the size of VR array:

VR is COMPLEX*16 array, dimension (LDVR,N)
          If JOBVR = 'V', the right eigenvectors v(j) are stored one
          after another in the columns of VR, in the same order
          as their eigenvalues.
          If JOBVR = 'N', VR is not referenced.
          v(j) = VR(:,j), the j-th column of VR.

This is what is written about LDVR:

LDVR is INTEGER
          The leading dimension of the array VR.  LDVR >= 1; if
          JOBVR = 'V', LDVR >= N.

Of course, the same thing is written about another parameter supplied to zgeev:

N is INTEGER
          The order of the matrix A. N >= 0.

As soon as N in #define exceeds 46340 even by 1, N*N exceeds upper limit for int-type. So now it is negative. Here is a simple illustration:
N=46340: NO Overflow in int for lapack_LDVR*lapack_N=2147395600
N=46341: Overflow in int for lapack_LDVR*lapack_N=-2147479015
When this negative number cast into size_t-type in malloc() it results in a HUGE number, which is enough to get SIGSEGV on any realistic machine.

And what is important, until zgeev starts generating eigenvectors, array is not touched, which means it is not really allocated in RAM. This is why it doesn't happen right away, but only after few hours.

I believe it is quite reasonable explanation. I attach the program which can reproduce this behavior, but it is definitely not some problem with LAPACK, just with my code. At the same time as it is pretty standard to use int-type variables to submit parameters into LAPACK when FORTRAN interface is used (normally all my integers are "unsigned long int" type exactly to avoid such situations), I believe it was instructive and may help somebody in the future.

Anyway, thank you for your time! If I wouldn't start producing an example may be I wouldn't find this issue so soon. In the attached example problematic instruction is on line 56. But the same behavior can be obtained even easier if one would comment out line 29 and comment in line 30. Then with -Wall option for compiler it is possible to get a warning:
zgeev_test.c:30:43: warning: integer overflow in expression of type ‘int’ results in ‘-2147479015’ [-Woverflow]
A_matrix = (complex double *) malloc ((N)*(N)*sizeof(complex double));
and an error at link time.
I decided to try to run a corrected version of the real computational code with N=64000 (for matrix and eigenvectors to fit in 128GiB of RAM for sure), instead of the attached example, and hopefully will get my eigenvectors tomorrow afternoon. Will confirm if everything will be OK.

With my best regards,

Alex.

View solution in original post

Gennady_F_Intel · ‎08-03-2020

Alexander, did you try to link with ILP64 API of MKL?

AlexKor · ‎08-03-2020

Gennady, here are my LDFLAGS:
LDFLAGS=-L ../../fftw3lib/lib -flto -Ofast -lfftw3 -lquadmath -lgfortran -Wl,--start-group /home/kao/MKL/mkl/lib/intel64/libmkl_intel_lp64.a /home/kao/MKL/mkl/lib/intel64/libmkl_gnu_thread.a /home/kao/MKL/mkl/lib/intel64/libmkl_core.a -Wl,--end-group -lgomp -lpthread -lm -ldl -static

So it looks like yes, I linked with libmkl_intel_lp64.a. And of course I haven't generated this line by myself, I used a form here:
https://software.intel.com/content/www/us/en/develop/articles/intel-mkl-link-line-advisor.html

I have to say I was strongly against using MKL even when I was able to get academic license and never considered it for my applications, as I have had bad experience spending time figuring out how to use it and when license expired I had understood that I wasted my time. Now, when it is available for free (as a beer) I am perfectly fine using it. But I figured it out only yesterday.

Another thing. Linker gives a warning (on my local machine):
(mkl_semaphore.o): in function `mkl_serv_inspector_suppress':
mkl_semaphore.c:(.text+0xb9): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

At least on my laptop program always end with a segfault (only when it exits), although it works well and does the job. If I don't link with MKL everything is OK. Should I create a separate topic?

As I said, it looks like it is some problem with FORTRAN-C interaction as I have the same behavior with OpenBLAS. Should I try to use LAPACKE C-interface to check? Runs are about 24 hours each, so it will take some time.

With my best regards,

Alex.

Gennady_F_Intel · ‎08-03-2020

yes, I realized that my first reply was incorrect - it seems the problem is beyond of ILP64 usage mode.

Could you give us the reproducer which we could build and run on our side?

regarding the linker warnings - yes, it would be better to have a separate thread.

AlexKor · ‎08-03-2020

Will try to give one tomorrow when the code finishes.

AlexKor · ‎08-04-2020

Gennady, hello,

I think I figured out the source of the problem. zgeev is not the source. In order to call FORTRAN interface of LAPACK one has to provide all number constants as "int" type. These are definitions which I have in the code:

#define N 60000

.......

int lapack_N, lapack_LDA, lapack_LDVL, lapack_LDVR, lapack_LWORK, lapack_INFO=0;

lapack_N = (N);
lapack_LDA = lapack_N;
lapack_LDVL = 1;
lapack_LDVR = lapack_N;
lapack_LWORK = 17*4*(2*lapack_N); /* Minimum 2*lapack_N, for good performance usually bigger */
lapack_WORK = (complex double *) malloc (lapack_LWORK*sizeof(complex double));
lapack_W = (complex double *) malloc (lapack_N*sizeof(complex double)); /* Array for eigenvalues */
lapack_VR = (complex double *) malloc (lapack_LDVR*lapack_N*sizeof(complex double)); /*Array for eigenvectors */

And this is where the issue most probably is located. I was trying to follow FORTRAN LAPACK documentation from Netlib word to word. This is what is written for the size of VR array:

VR is COMPLEX*16 array, dimension (LDVR,N)
          If JOBVR = 'V', the right eigenvectors v(j) are stored one
          after another in the columns of VR, in the same order
          as their eigenvalues.
          If JOBVR = 'N', VR is not referenced.
          v(j) = VR(:,j), the j-th column of VR.

This is what is written about LDVR:

LDVR is INTEGER
          The leading dimension of the array VR.  LDVR >= 1; if
          JOBVR = 'V', LDVR >= N.

Of course, the same thing is written about another parameter supplied to zgeev:

N is INTEGER
          The order of the matrix A. N >= 0.

As soon as N in #define exceeds 46340 even by 1, N*N exceeds upper limit for int-type. So now it is negative. Here is a simple illustration:
N=46340: NO Overflow in int for lapack_LDVR*lapack_N=2147395600
N=46341: Overflow in int for lapack_LDVR*lapack_N=-2147479015
When this negative number cast into size_t-type in malloc() it results in a HUGE number, which is enough to get SIGSEGV on any realistic machine.

And what is important, until zgeev starts generating eigenvectors, array is not touched, which means it is not really allocated in RAM. This is why it doesn't happen right away, but only after few hours.

I believe it is quite reasonable explanation. I attach the program which can reproduce this behavior, but it is definitely not some problem with LAPACK, just with my code. At the same time as it is pretty standard to use int-type variables to submit parameters into LAPACK when FORTRAN interface is used (normally all my integers are "unsigned long int" type exactly to avoid such situations), I believe it was instructive and may help somebody in the future.

Anyway, thank you for your time! If I wouldn't start producing an example may be I wouldn't find this issue so soon. In the attached example problematic instruction is on line 56. But the same behavior can be obtained even easier if one would comment out line 29 and comment in line 30. Then with -Wall option for compiler it is possible to get a warning:
zgeev_test.c:30:43: warning: integer overflow in expression of type ‘int’ results in ‘-2147479015’ [-Woverflow]
A_matrix = (complex double *) malloc ((N)*(N)*sizeof(complex double));
and an error at link time.
I decided to try to run a corrected version of the real computational code with N=64000 (for matrix and eigenvectors to fit in 128GiB of RAM for sure), instead of the attached example, and hopefully will get my eigenvectors tomorrow afternoon. Will confirm if everything will be OK.

With my best regards,

Alex.

AlexKor · ‎08-05-2020

Yes, I was right. The issue I mentioned in the previous message (was mistakenly marked as spam by algorithm, but now approved through) was really the source of problem. Was able to successfully finish a problem with N=64000. I hope somebody will learn from my experience.

With my best regards,

Alex.

Gennady_F_Intel · ‎08-06-2020

Alex thanks for letting us know.

Gennady_F_Intel · ‎08-06-2020

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only