Compiler or LAPACK bug.

John_D_12 · ‎08-29-2016

Hello!

I can't decide whether this is a compiler bug or a LAPACK bug.

I wrote the following code for diagonalizing a complex hermitian matrix:

program test
implicit none

integer, parameter :: n=27
integer i,j
integer liwork,lrwork,lwork,info
integer, allocatable :: iwork(:)
real(8), allocatable :: w(:),rwork(:)
complex(8), allocatable :: a(:,:),work(:)
liwork=5*n+3
lrwork=2*n**2+5*n+1
lwork=n**2+2*n
allocate(w(n),a(n,n))
allocate(iwork(liwork),rwork(lrwork),work(lwork))

a(:,:)=0.d0
do i=1,n
  a(i,i)=1.d0
  do j=i+1,n
    a(i,j)=cos(dble(i))
  end do
end do

call zheevd('V','U',n,a,n,w,work,lwork,rwork,lrwork,iwork,liwork,info)

print *,'info',info

end program

If I compile it with ifort and MKL BLAS/LAPACK it works fine. However, if I compile my own copy of LAPACK with ifort, the diagonalization fails with info>0. The problem can be traced to a division by zero in DLAED4.

This only happens with the Intel fortran compiler, gfortran works fine with natively compiled BLAS/LAPACK.

TimP · ‎08-29-2016

This may require a full reproducer. Perhaps that might be a useful test case for the strange ways in which -f[no-]inline-functions, -[no-]prec-div, and gradual underflow options work with the various releases of ifort. Remember that -no-ftz (presumably implied by -fp-model source) can work only when compiled in the main program, and it may be necessary to USE ieee_arithmetic .... call ieee_set_underflow_mode(gradual). The suspected situations include those where the zero you divided by was produced by abrupt underflow, and the prec-div replacements of divide by inversion and vice versa (possibly in combination).

Note there are opportunities for ifort to do the wrong thing if you don't set -assume protect-parens, so start with that, if you haven't already.

On linux, gfortran builds default to gradual underflow (but -ffast-math can switch to abrupt). gfortran will not invoke -freciprocal-math nor parenthesis violation implicitly unless -ffast-math is set. For starters, we need to know the build options you use with ifort and gfortran, and the compiler versions.

John_D_12 · ‎08-29-2016

Tim P. wrote:

This may require a full reproducer. Perhaps that might be a useful test case for the strange ways in which -f[no-]inline-functions, -[no-]prec-div, and gradual underflow options work with the various releases of ifort. Remember that -no-ftz (presumably implied by -fp-model source) can work only when compiled in the main program, and it may be necessary to USE ieee_arithmetic .... call ieee_set_underflow_mode(gradual). The suspected situations include those where the zero you divided by was produced by abrupt underflow, and the prec-div replacements of divide by inversion and vice versa (possibly in combination).

Note there are opportunities for ifort to do the wrong thing if you don't set -assume protect-parens, so start with that, if you haven't already.

On linux, gfortran builds default to gradual underflow (but -ffast-math can switch to abrupt). gfortran will not invoke -freciprocal-math nor parenthesis violation implicitly unless -ffast-math is set. For starters, we need to know the build options you use with ifort and gfortran, and the compiler versions.

I should have mentioned: compiling with -O0 or -O1 works, while -O2 or -O3 both fail. You're correct that it's some optimisation which results in a loss of precision. Thus it should probably be classified as a LAPACK bug and not a compiler bug.

Surprising that this hasn't been noticed before...

mecej4 · ‎08-29-2016

You mentioned that you used your own compiled Lapack. Which Lapack source distribution did you use? The current version is 3.6.1, see http://www.netlib.org/lapack/#_lapack_version_3_6_1_2 .

John_D_12 · ‎08-29-2016

mecej4 wrote:

You mentioned that you used your own compiled Lapack. Which Lapack source distribution did you use? The current version is 3.6.1, see http://www.netlib.org/lapack/#_lapack_version_3_6_1_2 .

I tried 3.6.1, 3.6.0, 3.5.0 and 3.4.2 and they all have the problem.

jimdempseyatthecove · ‎08-29-2016

You may want to look at your code where you have something like

IF (A .EQ. B) ...

Where MKL uses "approximately equal", and your code is using "exactly equal".

Jim Dempsey

John_D_12 · ‎08-29-2016

jimdempseyatthecove wrote:

You may want to look at your code where you have something like

IF (A .EQ. B) ...

Where MKL uses "approximately equal", and your code is using "exactly equal".

Jim Dempsey

It's actually not my code; it's LAPACK (http://www.netlib.org/lapack/).

The problem occurs at line 285 of DLAED4 :

DELTA( J ) = ( D( J )-D( I ) ) - TAU

D(I) and D(J) are large and almost equal so the difference is almost zero. With optimization -O2 or -O3, the difference is exactly zero which results in a division by zero later on.

mecej4 · ‎08-29-2016

Compiling all of Lapack is no trivial matter, so I looked for a short cut. Luckily for us, DLAED4 has few dependencies, so the following diagnostic test can be performed to see if, indeed, the error occurs in the specific place that you described.

1. I supplied a dummy subroutine DLAED4() as follows, and used a different compiler than Intel's to compile your test program with the dummy subroutine and the Lapack static library that came with that compiler . The result is an unformatted file that contains the input arguments to DLAED4.

subroutine dlaed4(N,I,D,Z,DELTA,RHO,DLAM,INFO)
integer N,I,INFO
double precision D(N),Z(N),DELTA(N),RHO,DLAM
open(11,file='dlaed4.dat',form='unformatted',status='replace')
write(11)n,i,d,z,rho
close(11)
stop
end subroutine dlaed4

2. I downloaded the source and dependencies of DLAED4 from Netlib. I added the following line after line-286 of dlaed4.f:

         if(any(delta(1:N) == 0d0))stop 'Delta() contains zero'

I used the following program to read the file written in Step-1 above and call DLAED4:

program xdlaed4
implicit none
integer, parameter :: NN = 27
integer N,I,INFO,J
double precision D(NN),Z(NN),DELTA(NN),RHO,DLAM
open(11,file='dlaed4.dat',form='unformatted',status='old')
read(11)n,i,d,z,rho
close(11)
call dlaed4(N,I,D,Z,DELTA,RHO,DLAM,INFO)
write(*,*)' INFO = ',info,'  DLAM = ',dlam
write(*,'(1x,i3,2x,ES15.7)')(j,delta(j),j=1,N)
end program

I compiled this program with the Netlib sources with IFort 16.0.2 on Windows (note: MKL or any other full Lapack library is not needed). No divisions by zero were trapped, with /O2 or /O3. Perhaps, the zeros that you saw were symptoms of other problems that may have happened elsewhere before DLAED4 was called?

You may object to my using a different compiler than Intel's in Step-1. To answer this, I ran another test, in which I used IFort and the sequential version of the MKL libraries, with the following directive in the dummy subroutine:

!dir$ ATTRIBUTES ALIAS:'_mkl_lapack_dlaed4' :: dlaed4

The data file produced was not byte-for-byte identical with that produced by the other compiler, but the end results were the same: no errors, no divisions by zero.

John_D_12 · ‎08-29-2016

Thanks for checking this. I tested ifort version 16.0.1, 14.0.0 and 13.0.1 and they all fail.

(Also, as I mentioned above, MKL works fine, so there is no need to test with MKL.)

I'm not sure why you wrote all the additional code. Just compile BLAS/LAPACK with 'ifort -O3' and the code I attached and you should be able to reproduce the error.

TimP · ‎08-30-2016

It's still entirely expected that ifort option -assume protect_parens will be among the requirements for building lapack. The relatively recent introduction of the alternatives such as -standard-semantics seems to be about as far as Intel would go in changing policy.

If the problem you assert varies with ifort version it would be much more interesting if it could be demonstrated with a current version. As I said, I have observed instances where ifort appears to behave different from documentation, and it would be good to have a clear reproducer which the originator is willing to submit as a test case. I was thinking of trying to build your case with and without source code for dlaed4 to see what happens. Even if it reproduces a problem, there is the question of whether Intel policy would allow dealing with a problem easily avoided by using the supplied library.

John_D_12 · ‎08-31-2016

I added '-assume protect_parens' and that fixes the problem.

This seems to be a problem with LAPACK. The divide-and-conquer routine should not depend on which part of an expression the compiler evaluates first.

Thanks for your efforts,

John.

Steven_L_Intel1 · ‎08-31-2016

It's reasonable for LAPACK to assume that the compiler follows the evaluation rules in the standard.

TimP · ‎09-01-2016

This dependence on protect_parens continues in the 2017u1 beta compiler. It looks like non-standard compilation in places such as

DO 10 J = 1, N
DELTA( J ) = ( D( J )-D( I ) ) - MIDPT

....

           DO 130 J = 1, N
               DELTA( J ) = ( D( J )-D( I ) ) - TAU
130       CONTINUE
         ELSE
            DO 140 J = 1, N
               DELTA( J ) = ( D( J )-D( IP1 ) ) - TAU

(dlaed4.f) may be at fault. But my testing indicates the problem may be elsewhere in the procedures supplied as dependencies of zheevd.

It appears possible to cut the number of subtractions in those loops nearly in half by violating parentheses, but this seems unlikely to speed up that procedure significantly. In order to justify this shortcut for the Pentium III and earlier CPUs, one would have had to assure that the CPU was running in 64-bit precision mode (ifort option pc80). This possibility became unattractive with Pentium4.

Presence of such parentheses in professionally developed source is usually an indication that such numerical requirements have been recognized.

gfortran developers chose to treat this situation more carefully than gcc does, by introducing the option -fno-protect-parens which is not set implicitly by any other compiler option. If and when gcc drops parentheses, it may do so more consistently than Intel compilers do. So this is an area where it has been seemingly difficult to make Intel compilers consistent with others.

Here is a more complete presentation on the subject of accurate summation, with cases which may break with legacy K&R C style re-association:

http://www.phys.uconn.edu/~rozman/Courses/P2200_11F/downloads/sum-howto.pdf

Compilers which shortcut the guards in Kahan's source in order to vectorize may be able to achieve the less accurate (but not necessarily totally wrong, as with Intel compilers) result expected without the precautions.