Solved: Suspected optimizer bug

mecej4 · ‎04-10-2023

Here is a very short program that uses two routines from Lapack:

program tdlamc2
   implicit none
   integer beta, it, imin, imax, irnd
   double precision eps, rmin, rmax
   logical lrnd

   call dlamc2( beta, it, lrnd, eps, imin, rmin, imax, rmax )
   if( lrnd ) then
      irnd = 1
      eps = ( beta**( 1d0-it ) ) / 2
   else
      irnd = 0
      eps = beta**( 1d0-it )
   end if
   print 77, beta,it,irnd,imin,imax
77 format('base,it,irnd,  imin,imax: ',5I8)
   print 78,eps,rmin,rmax
78 format('eps,rmin,rmax: ',3ES20.12)
end program

The two Lapack files needed are: dlamch.f and lsame.f . Building and running with optimization disabled produces the correct result (with ifx as well as ifort):

ifx /MD /Od /traceback tdlamc2.f90 dlamch.f lsame.f
>tdlamc2
base,it,irnd,  imin,imax:        2      53       1   -1021    1024
eps,rmin,rmax:   1.110223024625E-16  2.225073858507-308  1.797693134862+308

Changing the optimization level to /O2 produces EXEs that hang (with ifx as well as ifort).

I used the recently released version of the compiler (20230320) on Windows 11 64-bit.

[NOTE: The routine DLAMC2 is available in MKL, but using that gives incorrect results, as I have reported in a separate post on the MKL forum.]

mecej4 · ‎04-11-2023

@Ron_Green, it's my turn to apologize! While searching for more information about DLAMCH and its siblings, I found the following warning in the Lapack FAQ:

The routine dlamch.f (and its dependent subroutines dlamc1, dlamc2, dlamc3, dlamc4, dlamc5) MUST be compiled without optimization. If you downloaded the entire lapack distribution this will be taken care of by the LAPACK/SRC/Makefile. However, if you downloaded a specific LAPACK routine plus dependencies, you need to take care that slamch.f (if you downloaded a single precision real or single precision complex routine) or dlamch.f (if you downloaded a double precision real or double precision complex routine) has been included.

Perhaps this is the same caution that, not having been placed in the source file in plain sight, was not recognized when MKL was built, and that lapse caused the DLAMCH in MKL to give incorrect results. My attempt to mend matters by downloading individual Lapack source files and compiling them with optimizations turned on probably made things worse.

I did file a bug report in the MKL forum, before I started this thread, but it has not been answered yet. Some coordination between the Intel compiler and MKL teams seems to be required to make the necessary corrections to the DLAMCH within MKL.

View solution in original post

Ron_Green · ‎04-11-2023

add

/warn:interfaces /standard-semantics /fp-model:precise

and try again. Also, so I don't have to waste 10 minutes finding dlamch.f and lsame.f, if the above does not help please attach those 2 source files.

mecej4 · ‎04-11-2023

I provided links to the two Netlib source files concerned (need to copy/paste, but that is how they provide the Lapack sources of individual routines), and I have attached a Zip file containing all three files below.

The bug (program hangs) does not occur if the /fp:precise option is used. Thanks.

Ron_Green · ‎04-11-2023

@mecej4 I apologize, I didn't read your first post carefully. I'll try to unpeel the PRECISE options to find the specific options needed. I suspect flush to zero, maybe precise-divide. Usually one of those 2. NO-FTZ if denormals are needed to converge. precise divide because non-precise is sloppy IMHO. Just my first 2 guesses.

Ron_Green · ‎04-11-2023

The plot thickens - I didn't expect floating overflow. Yes, this is going to need a deeper look.

$ ifort -g -O2 -xhost -fpe0 -traceback tdlamc2.f90 dlamchf77.f lsame.f -o tdlamc2

rwgreen@orcsle153:~/quad/tnew$ ./tdlamc2

forrtl: error (72): floating overflow

Image PC Routine Line Source

libc.so.6 00007F6CFB23EA30 Unknown Unknown Unknown

tdlamc2 0000000000404870 dlamc2_ 329 dlamchf77.f

tdlamc2 0000000000404258 MAIN__ 7 tdlamc2.f90

tdlamc2 00000000004041BD Unknown Unknown Unknown

libc.so.6 00007F6CFB229510 Unknown Unknown Unknown

libc.so.6 00007F6CFB2295C9 __libc_start_main Unknown Unknown

tdlamc2 00000000004040D5 Unknown Unknown Unknown

Aborted (core dumped)

mecej4 · ‎04-11-2023

The line number reported as the location of the FP overflow is not plausible, since the code on line 330 just increments an integer variable, "lt".

mecej4 · ‎04-11-2023

No problem Ron, and I wish to point out that the reason that I chose to report the issue as a suspected optimizer bug is that it took a considerable effort to isolate the cause of the program hangs.

The actual code where the issue occurred is a numerical nonlinear optimization program with about 18,000 lines of code plus Lapack and BLAS. I tried other compilers, and there were no hangs, but their runs were rather slow (about 30 seconds instead of less than 10 seconds with Ifort when it did not cause a program hang). One expects the possibility of incorrect/inaccurate results when optimization options are on, but hangs are tough to overcome -- no hang occurred if /Zi or /Zd options were used in an attempt to narrow down the area of code causing the hang.

Ron_Green · ‎04-11-2023

OK, so you can use

-fp-model source

along with optimizations. I would not call this an optimizer bug. Rather, looks like a recurring issue we see with legacy code. This loop, see what you think:

326 *+       WHILE( C.EQ.ONE )LOOP
327    30    CONTINUE
328          IF( c.EQ.one ) THEN
329             lt = lt + 1
330             a = a*lbeta
331             c = dlamc3( a, one )
332             c = dlamc3( c, -a )
333             GO TO 30
334          END IF
335 *+       END WHILE

I think you may see the problem with comparing C to ONE with .EQ.. Probably should be a check if they are close near epsilon. Also the initializations of these double precision vars. Right above this loop, look how REAL(8) C, A, LT are initialized:

322          lt = 0
323          a = 1
324          c = 1
325 *        
326 *+       WHILE( C.EQ.ONE )LOOP
327    30    CONTINUE
328          IF( c.EQ.one ) THEN
329             lt = lt + 1
330             a = a*lbeta
331             c = dlamc3( a, one )
332             c = dlamc3( c, -a )
333             GO TO 30
334          END IF
335 *+       END WHILE

You can backtrace ONE to see it's initialization also. a = 1 as an example:
With -fpe0 I saw LT overflow. obviously the c.eq.one never happened, hence the "hang"

I think the -fp-model source is a good fix without rewriting this routine to be numerically safe to modern standards.

jimdempseyatthecove · ‎04-11-2023

With the original code (using WHILE), compiled with optimizations .and. debug info...

When the loop hangs, issue a Pause (break into the program).

Then open a disassembly window, focus to that window, and step through the loop.

This should expose the issue as to if it is within function dlamc3 or if the C in the WHILE test is different from the C=.

Now I know the source code shows that they ought to be the same, but we are looking for (eliminating) potential optimization bug.

Jim Dempsey

mecej4 · ‎04-11-2023

@Ron wrote:

obviously the c.eq.one never happened, hence the "hang"

??? From the Fortran code, I have to conclude that the double precision variable C was initialized to 1.0D0. For the loop to run for ever, C would have to stay unchanged, at 1.0D0, i.e. c.eq.one remained .TRUE. all the time.

JohnNichols · ‎04-12-2023

I have never knowingly used LAPACK, but looking at the code for this routine, the use of FIRST shows the writers made a clever use of the save function, it was not obvious on the first 20 reads, but I got it, but then used C eq to one as a true statement.

The first was not obvious to a casual user and the second could have used logicals and not relied on one being one in Fortran. I would suggest it was not written by one person.

mecej4 · ‎04-12-2023

In Fortran 77, given the limitations of the DATA statement, the use of a saved logical or integer variable to control an IF..ENDIF block was a standard (the only?) way to initialize a variable with the value of an expression with an exact and unchanging value and avoid calculating that expression more than once. Here is an example that you are likely to find even in modern codes (F95, F2003, etc.)

logical, save :: begin = .true.
double precision, save :: pi
..
if(begin)then
   begin = .false.
   pi = 4d0*atan(1d0)
endif

More recent revisions of the Fortran standard have allowed more intrinsic functions to be used in initialization expressions, so you are now allowed to write

double precision, save :: pi = 4d0*atan(1d0)

There are also newer intrinsic modules that provide definitions for constants such as $π$ , so all that is needed is a USE statement.

Ron_Green · ‎04-11-2023

Ah, another mistake on my part. I'll revisit this later in the week when I have less interrupts. For now -fp-model precise or source is a reasonable work around. I'll need time to try gfortran with max optimizations and see what it does with this code. Maybe something in our optimizer. Maybe. just may be overly aggressive. I can try some older iforts as well.

Ron_Green · ‎04-11-2023

Interesting , gfortran with -ffast-math also hangs

gfortran -O2 tdlamc2.f90 dlamchf77.f lsame.f -o tdlamc2 -ffast-math
~/quad/tnew$ ./tdlamc2
^C
~/quad/tnew$ gfortran --version
GNU Fortran (GCC) 12.0.1 20220413 (Red Hat 12.0.1-0)
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Ron_Green · ‎04-11-2023

FMA is at play here as well, at least for the hang. Note that RMAX is infinity however. So other things at play in the optimizations.

ifort -g -O2 -xhost -traceback tdlamc2.f90 dlamchf77.f lsame.f -o tdlamc2 -no-fma
rwgreen@orcsle153:~/quad/tnew$ ./tdlamc2 
base,it,irnd,  imin,imax:        2      53       0   -1021    1025
eps,rmin,rmax:   2.220446049250E-16  2.225073858507-308            Infinity

mecej4 · ‎04-11-2023

@Ron_Green, it's my turn to apologize! While searching for more information about DLAMCH and its siblings, I found the following warning in the Lapack FAQ:

The routine dlamch.f (and its dependent subroutines dlamc1, dlamc2, dlamc3, dlamc4, dlamc5) MUST be compiled without optimization. If you downloaded the entire lapack distribution this will be taken care of by the LAPACK/SRC/Makefile. However, if you downloaded a specific LAPACK routine plus dependencies, you need to take care that slamch.f (if you downloaded a single precision real or single precision complex routine) or dlamch.f (if you downloaded a double precision real or double precision complex routine) has been included.

Perhaps this is the same caution that, not having been placed in the source file in plain sight, was not recognized when MKL was built, and that lapse caused the DLAMCH in MKL to give incorrect results. My attempt to mend matters by downloading individual Lapack source files and compiling them with optimizations turned on probably made things worse.

I did file a bug report in the MKL forum, before I started this thread, but it has not been answered yet. Some coordination between the Intel compiler and MKL teams seems to be required to make the necessary corrections to the DLAMCH within MKL.

Ron_Green · ‎04-11-2023

Like you, it would not occur to me to check the LAPACK faqs. Indeed that should be in the source header.

I will check in with a colleague on MKL to follow up on your post. Their forum is not as well monitored as ours.

Ron_Green · ‎04-11-2023

The MKL team is actually investigating already - good! Bug ID with them is MKLD-15217

I added your find in the FAQs to the bug report. Thank you for finding this, it's a huge help!

David_Billinghurst · ‎04-11-2023

This brings back memories. I recall issues with the optimization of DLAMCH with early releases (c1993) of LAPACK on VAXen and DECstation 3100 workstations.

mecej4 · ‎04-11-2023

The version of dlamch.f included in Lapack 3.11 gets all the information it needs using the standard intrinsic functions EPSILON(), TINY(), DIGITS(), etc., instead of tricky code that does not work well with optimization .

David_Billinghurst · ‎04-11-2023

Yes. The version using intrinsic functions version is much better - eliminating recurring issues. I had forgotten I had switched to that code eons ago.