floating-point assist fault in mkl_lapack_dtrtri and dcsrilu0 on Itanium2

eckardt4 · ‎11-12-2007

I'm using mkl (10.0.011, ia64) in finite element simulations on an Itanium2 system. Several times during the simulation I get following message:

floating-point assist fault at ip 2000000000780561, isr 0000020000000008

This leads to a significant slow-down of my simulations. Using

prctl --fpemu=signal

the program stopps in mkl_lapack_dtrtri or dcsrilu0. I have prepared a small example for the failure in mkl_lapack_dtrtri:

#include 
#include 

int matrix_inverse (double *mat, double *inv, int dim)
{

    memcpy (inv,mat,dim*dim *sizeof(double));
    int* ipiv = new int[dim];
    double* work = new double[dim *dim];

    int info;
    dgetrf_ (&dim, &dim, inv, &dim, ipiv, &info);
    dgetri_ (&dim, inv, &dim, ipiv, work, &dim, &info);

    delete[] work;
    delete[] ipiv;
    return 0;
}


int main()
{
    for(int i = 0; i < 1000; i++)
    {
        std::cout << i << std::endl;
        double test1[9] = {1,0,0,0,1,0,0,0,1};
        double inv[9];
        matrix_inverse(test1,inv,3);
    }
    return 0;
}

The program is compiled with (Compiler Version: 10.0.026):

icpc -O0 -g -ftz -o test test.cpp -I/ahome/ism/eckardt4/intel/mkl/include -L/ahome/ism/eckardt4/intel/mkl/lib/64/lib -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lguide -lpthread

Running in idb:

> idb ./test
Intel Debugger for applications running on IA-64, Version 10.0-29 , Build 20070405
------------------
object file name: ./test
Reading symbols from /tmp/eckardt4/test...done.
(idb) run
[New Thread 2305843009251186048 (LWP 13474)]
[New Thread 2305843009251186048 (LWP 13474)]
Starting program: /tmp/eckardt4/test
0
1
Program received signal SIGFPE
mkl_lapack_dtrtri () in /ahome/ism/eckardt4/intel/mkl_10.0.011/lib/64/libmkl_lapack.so
(idb) where
#0  0x2000000000780562 in mkl_lapack_dtrtri () in /ahome/ism/eckardt4/intel/mkl_10.0.011/lib/64/libmkl_lapack.so
#1  0x200000000046d8d0 in mkl_lapack_dgetri () in /ahome/ism/eckardt4/intel/mkl_10.0.011/lib/64/libmkl_lapack.so
#2  0x200000000120f920 in DGETRI () in /ahome/ism/eckardt4/intel/mkl_10.0.011/lib/64/libmkl_intel_lp64.so
#3  0x4000000000001440 in matrix_inverse (mat=0x607fffffff4963b0, inv=0x607fffffff496400, dim=3) at test.cpp:13
#4  0x4000000000001820 in main () at test.cpp:28
#5  0x200000000217bc20 in __libc_start_main () in /lib/libc-2.4.so
#6  0x4000000000000ec0 in _start () in /tmp/eckardt4/test

dmesg gives the following message:

test(13474): floating-point assist fault at ip 2000000000780561, isr 0000020000000008

Compiling without debug informations (without -g) the program works well.

The original finite element code is compiled without debug informations (-O2 -ftz) and this message is not only observed in mkl_lapack_dtrtri but also in dcsrilu0.

Thank you,

Stefan Eckardt

TimP · ‎11-12-2007

I would suggest you submit your problem report on your premier.intel.com support account.
We have seen somewhat similar situations where software not under our control sent console
messages at an unacceptable rate, so it was necessary to suppress the message. Evidently, it
would be preferable to have the source of the problem investigated.