dstegr function causes segmentation fault

____4 · ‎04-26-2011

My colleague and I tested a very simple program using 'dstegr' in MKL, but we found that the functions seems to be not good at all. The code is here...

[cpp]#include 
#include 
#include 
#include 

#define N 2000

int main(int argc, char *argv[])
{
 char jobz='V', range='I';
 MKL_INT n=N;
 double d, e;
 double vl=0.0, vu=0.1, abstol=0.000001;
 MKL_INT il=1, iu=10, m=0, ldz=N;
 double w, z[N*N], work[18*N];
 MKL_INT isuppz[2*N], iwork[10*N];
 MKL_INT liwork=10*N, lwork=18*N;
 MKL_INT info=0;

 double duration;
 MKL_INT i;
 clock_t start, finish;

 for(i=0;i {
 d=4.0;
 e=1.0;
 }

 start = clock();

 dstegr_(&jobz,&range,&n,d,e,&vl,&vu,&il,&iu,&abstol,&m,w,z,&ldz,isuppz,work,&lwork,iwork,&liwork,&info);

 finish = clock();
 duration = (double)(finish - start) / CLOCKS_PER_SEC;
 printf( "%f seconds\n", duration );

 for(i=0;i<10;i++)
 printf("%.8f:%f:%.8f\n",d,e,w);

 return(0);
}

[/cpp]

When using dynamic linking with either ilp64 or lp64 interface, segmentation fault will always come out.
When using static linking with ilp64 interface, it seemed to be ok.
When using static linking with lp64 interface, on different platforms, things are also different:

In a Xeon 5450 32G RHEL5.1 machine with Intel 11.0.083 C++ compiler and 10.1.1.019 MKL, when using lp64 interfaces, codes could be run, but segmentation fault occered when return from the function which called "dstegr".

Gdb shows:
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000001 in ?? ()

and valgrind shows:
==3255== Invalid write of size 8
==3255== at 0x40198A: main (tri.c:9)
==3255== Address 0x7FEFFF0F8 is on thread 1's stack
==3255==
==3255== Invalid write of size 8
==3255== at 0x4019A5: main (tri.c:13)
==3255== Address 0x7FEFFF0E0 is on thread 1's stack
==3255==
==3255== Invalid write of size 8
==3255== at 0x4019B3: main (tri.c:13)
==3255== Address 0x7FEFFF0E8 is on thread 1's stack
==3255==
==3255== Invalid write of size 8
==3255== at 0x4019BF: main (tri.c:13)
==3255== Address 0x7FEFFF0F0 is on thread 1's stack
==3255==
==3255== Invalid write of size 8
==3255== at 0x401A05: main (tri.c:26)
==3255== Address 0x7FEFFB260 is on thread 1's stack
==3255== Stack overflow in thread 1: can't grow stack to 0x7FEFFB260
==3255==
==3255== Process terminating with default action of signal 11 (SIGSEGV)
==3255== Access not within mapped region at address 0x7FEFFB260
==3255== at 0x401A05: main (tri.c:26)
==3255==
==3255== Invalid write of size 8
==3255== at 0x48022D8: _vgnU_freeres (vg_preloaded.c:56)
==3255== Address 0x7FD111158 is on thread 1's stack

In other two machines, one is Xeon 5504 RHEL 5.3 with Intel 11.1 compiler and 10.2 MKL, another is Xeon EX-4870 RHEL 6 with Intel parallel studio XE 2011, using lp64 will also cause segmentatoin fault, and the function seemed not run. Gdb gave the information below:

Program received signal SIGSEGV, Segmentation fault.
0x0000000000411b57 in mkl_lapack_dlar1v ()

We ran the code OK with the netlib LAPACK lib, so we thought the function in MKL is not correct implemented. Could this issue be confirmed and fixed soon?

Gennady_F_Intel · ‎04-26-2011

???

I don't think this is the MKL's problem.

You can try to allocate the all working arrays dynamically( for example

double* w = (double*) malloc( N * sizeof(double) );insted ofdoublew,z[N*N]) and check how it will work.

--Gennady

mecej4 · ‎04-27-2011

You are running out of stack. With N=2000, you require a stack of about 40 Mbytes.

If you compile with the /F40000000 option, your code will run without errors on Windows. Find out how to adjust the maximum run-time stack for your OS.

If you change N, you will have to recompute how much stack is required, and specify that to the compiler.

Gennady's suggestion is better in the long run.

____4 · ‎04-27-2011

In fact, we used dynamic memory allocation at first. While we met the segmentation fault, we changed the code to static arrays. So we don't think changing memory allocation mode will give us any help.

____4 · ‎04-27-2011

Quoting Gennady Fedorov (Intel)

???
I don't think this is the MKL's problem.
You can try to allocate the all working arrays dynamically( for example
double* w = (double*) malloc( N * sizeof(double) );insted ofdoublew,z[N*N]) and check how it will work.
--Gennady

In fact, we used dynamic memory allocation at first. While we met the segmentation fault, we changed the code to static arrays. So we don't think changing memory allocation mode will give us any help.

Gennady_F_Intel · ‎04-27-2011

In this case, which version of MKL do you use? and Could you give us your linking line?

we will try to reproduce the problem with dynamically allocated arrays.

--Gennady

____4 · ‎04-27-2011

Quoting mecej4

You are running out of stack. With N=2000, you require a stack of about 40 Mbytes.

If you compile with the /F40000000 option, your code will run without errors on Windows. Find out how to adjust the maximum run-time stack for your OS.

If you change N, you will have to recompute how much stack is required, and specify that to the compiler.

Gennady's suggestion is better in the long run.

So, I have changed my code again in dynamic allocation mode, like this:

[bash]#include 
#include 
#include 

#define N 2000

int main(int argc, char *argv[])
{
    char jobz='V', range='I';
    double *d, *e, *w, *z, *work;
    double vl=0.0, vu=0.1, abstol=0.000001;
    MKL_INT n=N, il=1, iu=10, m=0, ldz=N, info=0;
    MKL_INT *isuppz, *iwork, liwork=10*N, lwork=18*N;

    double duration;
    MKL_INT i;
    clock_t start, finish;

    d=(double*)malloc(N*sizeof(double));
    e=(double*)malloc(N*sizeof(double));
    w=(double*)malloc(N*sizeof(double));
    z=(double*)malloc(N*N*sizeof(double));
    work=(double*)malloc(18*N*sizeof(double));
    iwork=(MKL_INT *)malloc(10*N*sizeof(MKL_INT));
    isuppz=(MKL_INT *)malloc(2*N*sizeof(MKL_INT));

    for(i=0;i=4.0;
        e=1.0;
    }

    start = clock();

    dstegr_(&jobz,&range,&n,d,e,&vl,&vu,&il,&iu,&abstol,&m,w,z,&ldz,isuppz,work,&lwork,iwork,&liwork,&info);

    finish = clock(); 
    duration = (double)(finish - start) / CLOCKS_PER_SEC; 
    printf( "%f secondsn", duration ); 

    for(i=0;i<10;i++)
    printf("%.8f:%f:%.8fn",d,e,w);

    free(isuppz);
    free(iwork);
    free(work);
    free(z);
    free(w);
    free(d);
    free(e);
    return(0);
}


[/bash]

compile the code:
icc -g -o tri tri.c -I$MKLROOT/include -L$MKLROOT/lib/em64t -Wl,-Bstatic -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread

then run again:
0.010000 seconds
1.99999754:0.500001:2.00000246
1.49999692:0.666668:2.00000986
1.33332950:0.750002:2.00002218
1.24999538:0.800003:2.00003944
1.19999458:0.833337:2.00006162
1.16666044:0.857147:2.00008874
1.14285010:0.875005:2.00012078
1.12499215:0.888895:2.00015775
1.11110244:0.900007:2.00019966
1.09999051:0.909099:2.00024649
Segmentation fault

Even if I added a printf before the return function of the program, the info CAN be prinited out correctly.
I guess the function made some modification in the stack, so functions cann't return normally.

____4 · ‎04-27-2011

Quoting Gennady Fedorov (Intel)

In this case, which version of MKL do you use? and Could you give us your linking line?
we will try to reproduce the problem with dynamically allocated arrays.
--Gennady

We have 3 platforms.

Xeon 5450 / 32G / RHEL5.1 / Intel 11.0.083 compiler / 10.1.1.019 MKL
on this machine, codes can get computaiton result while using static library.

Xeon 5504 / 4G / RHEL 5.3 / Intel 11.1 compiler / 10.2 MKL
Xeon E7-4870 / 128G / RHEL 6 / Intel parallel studio XE 2011
on these two machines, codes can not get any result and report segmentation fault directly.

Before we ran the program, we had already set the memory limit and the stack limit to "unlimited".

barragan_villanueva_ · ‎04-27-2011

Hi,

I've reproduced your problem even with N=200 (the same problem for static and dynamic MKL libraries)

Program received signal SIGSEGV, Segmentation fault.
0x0000000000411e57 in mkl_lapack_dlar1v ()
(gdb) bt
#0 0x0000000000411e57 in mkl_lapack_dlar1v ()
#1 0x0000000000407883 in mkl_lapack_dlarrv ()
#2 0x0000000000404ab0 in mkl_lapack_dstemr ()
#3 0x00000000004029b9 in mkl_lapack_dstegr ()
#4 0x000000000040252c in dstegr_ ()
#5 0x0000000000402176 in main (argc=1, argv=0x7fff985e4068) at dsteqr+.c:35

But your linking line is not fully correct

icc -g -o tri tri.c -I$MKLROOT/include -L$MKLROOT/lib/em64t -Wl,-Bstatic -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread

Pleasetry MKL Link Line Advisor

____4 · ‎04-27-2011

Quoting Victor Pasko (Intel)

Hi,

I've reproduced your problem even with N=200 (the same problem for static and dynamic MKL libraries)

Program received signal SIGSEGV, Segmentation fault.
0x0000000000411e57 in mkl_lapack_dlar1v ()
(gdb) bt
#0 0x0000000000411e57 in mkl_lapack_dlar1v ()
#1 0x0000000000407883 in mkl_lapack_dlarrv ()
#2 0x0000000000404ab0 in mkl_lapack_dstemr ()
#3 0x00000000004029b9 in mkl_lapack_dstegr ()
#4 0x000000000040252c in dstegr_ ()
#5 0x0000000000402176 in main (argc=1, argv=0x7fff985e4068) at dsteqr+.c:35

But your linking line is not fully correct

icc -g -o tri tri.c -I$MKLROOT/include -L$MKLROOT/lib/em64t -Wl,-Bstatic -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread

Pleasetry MKL Link Line Advisor

Yes. It's the same problem we met on the latter two machines I had introduced in my last post. It occurs in 10.2 or newer MKL.

In 10.1, things looks quite good until the caller function returns.

The MKL Link Line Advisor is quite a good and interest tool, thank you!

yuriisig · ‎04-27-2011

I do not advise to use function dstegr. See http://software.intel.com/en-us/forums/showthread.php?t=73653&o=d&s=lr

Gennady_F_Intel · ‎04-27-2011

quote "It occurs in 10.2 or newer MKL...."

I checked the problem with the latest 10.3.Update3 version and couldn't reproduce the segmentation .

quote "It occurs in 10.2 or newer MKL...."I checked the problem with the latest 10.3.Update3 version and couldn't reproduce the segmentation.

--Gennady

yuriisig · ‎04-27-2011

Quoting ?? ?

#defineN2000
......................
double...., z[N*N], ....

mecej4offered the correct decision: ".../F40000000 ..."
Intel MKL is not guilty.

____4 · ‎04-27-2011

Quoting yuriisig

Quoting ?? ?
#defineN2000
......................
double...., z[N*N], ....
mecej4offered the correct decision: ".../F40000000 ..."
Intel MKL is not guilty.

1. /Fn parameter does not exist under Linux icc

2. Stack size is not the key point. Dynamic allocation of memory also made seg fault.

____4 · ‎04-27-2011

Quoting Gennady Fedorov (Intel)

quote "It occurs in 10.2 or newer MKL...."
I checked the problem with the latest 10.3.Update3 version and couldn't reproduce the segmentation .
quote "It occurs in 10.2 or newer MKL...."I checked the problem with the latest 10.3.Update3 version and couldn't reproduce the segmentation.
--Gennady

I updated the MKL to 10.3 update 3 last night (the E7-4870 machine with RHEL 6 x64), and seg fault is still there.

Is it possible that the problem has relationship with compiler/glibc/kernel/linux distribution/other libs?

____4 · ‎04-27-2011

Quoting ?? ?

Quoting Gennady Fedorov (Intel)

quote "It occurs in 10.2 or newer MKL...."
I checked the problem with the latest 10.3.Update3 version and couldn't reproduce the segmentation .
quote "It occurs in 10.2 or newer MKL...."I checked the problem with the latest 10.3.Update3 version and couldn't reproduce the segmentation.
--Gennady
I updated the MKL to 10.3 update 3 last night (the E7-4870 machine with RHEL 6 x64), and seg fault is still there.
Is it possible that the problem has relationship with compiler/glibc/kernel/linux distribution/other libs?

Just now I tried install MKL 10.3 update3 on the Xeon 5450 RHEL 5.1 machine and didn't work neither...

Gennady_F_Intel · ‎04-27-2011

Hello, we have finally reproduced the problem and has already found the cause of the problem. The problem escalated and would be fixed soon. I will let you know when the fix would be availble.

--Gennady

yuriisig · ‎04-28-2011

Quoting ?? ?

1. /Fn parameter does not exist under Linux icc
2. Stack size is not the key point. Dynamic allocation of memory also made seg fault.

okey.

But what for you use this procedure: it unreliable. And I about it already wrote.

____4 · ‎04-28-2011

Quoting yuriisig

Quoting ?? ?
1. /Fn parameter does not exist under Linux icc
2. Stack size is not the key point. Dynamic allocation of memory also made seg fault.
okey.

But what for you use this procedure: it unreliable. And I about it already wrote.

It is used in a solver package named PSEPS (Parallel Symmetric Eigenvalue Package of Solver) which is developed by my collegue. I really know nothing about the scientific meaning and how and why he use the function. My duty is to maintain the machines and the system environments. Anyway, I had told him about your suggestion. Thank you!

____4 · ‎04-28-2011

Quoting Gennady Fedorov (Intel)

Hello, we have finally reproduced the problem and has already found the cause of the problem. The problem escalated and would be fixed soon. I will let you know when the fix would be availble.
--Gennady

That's really good news for me. I'll wait for the solution.

Thanks all!