floating invalid calling DGEQRF with IFORT and MKL

Zapata__Felipe · ‎01-18-2019

I have implemented an orthogonalization method using the DGEQRF subroutine to compute a QR factorization. The compilation and subsequently execution works well using gfortran/MKL but when I compile and run it with Ifort/MKL I get the following error:

forrtl: error (65): floating invalid

I have checked for uninitialized variables or NaN values, and the problem seems to be an arithmetic operation inside DGEQRF.

Issue description:

Ifort version: ifort (IFORT) 19.0.1.144 20181018

MKL version: parallel_studio_xe_2019_update1_cluster_edition

The source code can be found at: https://github.com/NLESC-JCER/Fortran_Davidson

Line that cause the error: https://github.com/NLESC-JCER/Fortran_Davidson/blob/master/src/davidson.f90#L241

I compile the code using cmake like:

cmake -H. -Bbuild -DCMAKE_Fortran_COMPILER=ifort DCMAKE_BUILD_TYPE=Debug

cmake --build build

I run the resulting binary like:

./bin/main

Finally, The orthogonalization subroutine is

  subroutine lapack_qr(basis)
    !> Orthoghonalize the basis using the QR factorization.
    !> QR factorization of the M-by-N (M>N) matrx A=Q*R in the form where
    !> Q is square M-by-M matrix and R is an upper triangular M-by-N matrix.
    !> The equality A=Q*R can be re-written also as a product Q1*R1 where Q1
    !> is a rectangular M-by-N submatrix of the matrix Q and R1 is M-by-M
    !> submatrix of the R. Let us note that columns of Q1 are orthonormal
    !> (they are orthogonal to each other and have norms equal to 1).
    !> The equality A=Q1*R1 can be treated as every column of A is a linear
    !> combination of Q1 columns, i.e. they span the same linear space.
    !> In other words, columns of Q1 is the result of ortogonalization of columns A.
    !> DGEQRF does not not compute Q directly, DORGQR must be call subsequently.
    
    !> \param basis
    !> \return orthogonal basis    

    implicit none
    real(dp), dimension(:, :), intent(inout) :: basis
    real(dp), dimension(:), allocatable :: work ! workspace, see lapack documentation
    real(dp), dimension(size(basis, 2)) :: tau ! see DGEQRF documentation
    integer :: info, lwork, m, n

    ! Matrix shape
    m = size(basis, 1)
    n = size(basis, 2)

    ! 1. Call the QR decomposition
    ! 1.1 Query size of the workspace (Check lapack documentation)
    allocate(work(1))
    call DGEQRF(m, n, basis, m, tau, work, -1, info)

    ! 1.2 Allocate memory for the workspace
    lwork = max(1, int(work(1)))
    deallocate(work)
    allocate(work(lwork))

    ! 1.3 Call QR factorization
    call DGEQRF(m, n, basis, m, tau, work, lwork, info)
    deallocate(work)
    
    ! 2. Generates an orthonormal matrix
    ! 2.1 Query size of the workspace (Check lapack documentation)
    allocate(work(1))
    call DORGQR(m, n, min(m, n), basis, m, tau, work, -1, info)

    ! 2.2 Allocate memory fo the workspace
    lwork = max(1, int(work(1)))
    deallocate(work)
    allocate(work(lwork))

    ! 2.3 compute the matrix Q
    call DORGQR(m, n, min(m, n), basis, m, tau, work, lwork, info)
    
    ! release memory
    deallocate(work)
    
  end subroutine lapack_qr

I really appreciate any help you can provide.

Best,

Felipe Z.

Gennady_F_Intel · ‎01-20-2019

Could you set MKL_VERBOSE=1 and give us the log file around qr function?

mecej4 · ‎01-20-2019

I think that in situations such as this, it is largely up to the original poster to isolate one call (or a small number of calls) to MKL routines and provide a description of why it is thought that the MKL routine(s) are not working correctly. In particular, it should be checked whether the arguments to the MKL routine have reasonable and meaningful values.

I printed the profile of the 50 X 8 matrix BASIS being passed to DGEQRF in file davidson.f90. Here is the pattern of zero/non-zero values:

Row 1            X000XXXX
Row 2            0X00XXXX
Row 3            00X0XXXX
Row 4            000XXXXX
Rows 5 to 50     0000XXXX

Is this correct?

Zapata__Felipe · ‎01-20-2019

The following is the output of MKL_VERBOSE when calling the qr subroutine:

MKL_VERBOSE DPOSV(U,50,1,0x7ffce57b0290,50,0x7ffce57b50b0,50,1) 428ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE DGEQRF(50,32,0x2e18640,50,0x7ffce57c60f0,0x2de6e40,-1,0) 432ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
forrtl: error (65): floating invalid

For some reason that I need to investigate further, the linear system in the previous step calling the "DPOSV" subroutine failed. I need to figure up why it works when the program is compiled with gfortran.

Also, the pattern of zeros/non-zeros value that is passed to the DGEQRF should correspond roughly to a basis for a given subspace used to compute the eigenvalue/eigenvector pair of a given matrix. In the example, I am trying to compute the lowest 3 eigenvalues of a 50 x 50 matrix using an initial subspace of dimension 4, using the Davidson method (http://www.netlib.org/utk/people/JackDongarra/etemplates/node402.html)

Thank you for your time

mecej4 · ‎01-20-2019

The code in davidson.f90 in your Github repository ignores the values of the status variable info after calls to Lapack routines such as DGEQRF, DPOSV, etc. This is acceptable only if (based on experience or previous tests) you are confident that info = 0. For the test case that is first run, I find that after the first call to DPOSV we get info=4 and for the next few subsequent calls, info=1. This (i.e., info > 0) implies that the matrix is not positive definite, because of which the factorization/solution failed.

When such a situation is encountered, the calculation should be stopped or some appropriate correction made before continuing. To continue the execution of the program despite the failed solution may invite further errors of an unpredictable nature, and no conclusions should be drawn as to which compiler+library "works" and which does not.

Gennady_F_Intel · ‎01-20-2019

+ our two cents : in the case if info == 0 and you continue to see the runtime failure, please give us the reproducer which will greatly reduce the investigation time from our side. thanks.

Zapata__Felipe · ‎01-21-2019

I have wrongly assumed that the matrix that I passed to the qr subroutine is positive define.

Thank you for your kind advice!