I believe you have uncovered

ereisch · ‎03-29-2017

I've stumbled across a very interesting side-effect of the optimizer when compiling between -O0 and -O2 (there are some other flags, which I'll get to below). First, my code:

REAL*4 X1, X2, Y1, Y2
REAL*8 COORD
X1 = -321.90371704101562500
X2 = -225.23478698730468750
Y1 = -559.80267333984375000
Y2 = -714.07904052734375000

COORD = SQRT((X1-X2)**2+(Y1-Y2)**2)

The interesting result is that you get higher precision in the optimized version of code. The optimized version gets 182.06065368652344, and the non-optimized version gets 182.06063842773438 (a difference of 1.5e-5). A closer examination of the generated assembler shows that in the optimized version, the operands are subtracted, squared, and added all in the floating point registers, and then the "fsqrt st0" instruction is executed directly. However, in the non-optimized version, the intermediate values are computed, but then the resulting intermediate value is stored in a 32-bit register with "fstp dword ptr [esp], st0", and the "__bwr_sqrtf" function is called to perform the actual SQRT (apparently, after first checking to ensure the operand is positive). So with the optimized version you get the full precision of the FP registers for the intermediate, but in the non-optimized version the intermediate is cast to a single precision float prior to the fsqrt instruction. We're seeing it as a difference in our case because it appears as though the intermediate is large enough (33146.0795...) to lose enough precision in the cast. It appears as though the result of the SQRT(X) is cast to a single-precision float prior to being stored in double-precision COORD in both cases.

I understand the Fortran manual states that the result of SQRT(X) is the same type as X, but the generated assembler seems to be contrary to the -fltconsistency argument in that the intermediate is being [inconsistently] cast to a single precision value prior to calling fsqrt.

Our optimized compile flags are:

-extend-source 132 -assume nobscc -align dcommons -sox -fpe0 -fp-stack-check -march=corei7 -fimf-arch-consistency=true -fimf-precision=high -no-fma -prec-div -fp-port -fp-speculation=strict -prec-sqrt -fltconsistency -O2 -m32

Our non-optimized compile flags are:

-extend-source 132 -assume nobscc -align dcommons -sox -fpe0 -fp-stack-check -march=corei7 -fimf-arch-consistency=true -fimf-precision=high -no-fma -prec-div -fp-port -fp-speculation=strict -prec-sqrt -fltconsistency -g -debug full -debug-parameters -check bounds -gdwarf-2 -O0 -m32

TimP · ‎03-29-2017

You are probably running with x87 precision mask set to 53 bits so that the effect of register optimization in ia32 mode is the same as you would get by explicit promotion of expressions to double precision e.g. (dble(x1)-x2)**2 + (dble(y1)-y2)**2 . X87 code has always worked this way but the subject has become esoteric and little discussed in over a decade since the mode became obsolete.

As explained below, the deprecated option fltconsistency implies -mia32 (x87 code), apparently over-riding your -march setting.

ereisch · ‎03-29-2017

$fctrl is 0x262, so yes, the precision is 53 bits. But the problem here is that the call to SQRT in the non-optimized mode appears to be a library call (__bwr_sqrtf) instead of a direct CPU instruction. The sequence is:

fmulp st1, st0
faddp st1, st0
fstp qword ptr [esp], st0        <-- Casts st0 from x87 register precision to single precision
call <__bwr_sqrt>

__bwr_sqrt:
 push ebp
 mov evp, esp
 and esp, 0xfffffff0
 sub esp, 0x10
 fld st0, qword ptr [ebp+0x8]    <-- st0 re-populated, but with different (single-precision) value
 <some limit checking>
 fsqrt st0
 mov esp, ebp
 pop ebp
 ret

So, by using the esp register to store the intermediate result, the value is being explicitly cast to single precision. This cast doesn't happen on the optimized builds, because a subroutine isn't used in this case; it instead directly calls the fsqrt opcode which accesses the st0 register that has remained untouched since the faddp instruction (which is still in double-precision resolution). The result is two different values.

mecej4 · ‎03-29-2017

The simplest explanation, I think, is that regardless of the specific properties of the processor (X87, SSE2, etc.), the argument of the SQRT in your program is a single-precision expression. The following example is constructed to highlight that aspect.

program xsqrt
implicit none
real x
double precision d
integer i
!
i=1048577*65
x=i
d=x*x
print *,i,sqrt(d)
end

The second output number will vary with different CPU families and the compiler options that you use.

Martyn_C_Intel · ‎03-29-2017

Hi,

You don't tell us which compiler version you are using. I don't see the behavior you describe in recent compilers, only in version 15 and older.

You are building a 32 bit application, though I don't believe that has any impact here. The compiler will not normally generate the old x87 floating-point instructions, except under special circumstances, or if you are targeting a very old IA-32 processor that doesn't support SSE2 instructions (essentially, Pentium III or older) using -mia32. SSE and later instructions perform intermediate calculations in source precision and do not have the complication of x87 arithmetic where intermediate results may be stored at higher precision in the 80 bit x87 registers and may or may not be truncated to lower or source precision.

In your case, you are using the option -fltconsistency. This is an old switch, with a somewhat unfortunate name, which is kept for command line consistency with GCC. Amongst other things, it causes x87 instructions and registers to be used for floating-point computations. It also disables some (but not all) optimizations and transformations that might impact floating-point reproducibility.

Since about the version 9.1 compiler, the recommended way to achieve floating-point reproducibility is with the -fp-model switch. (It actually says this at the end of the section on -fltconsistency in the User Guide). For your code above, -fp-model precise is sufficient to get the same results at different optimization levels, even with older compilers. The impact on performance is potentially a lot less than -fltconsistency. For the best reproducibility, including between different processor types and between processors that do or do not support FMAs, use -no-fma -fimf-arch-consistency=true as you were doing, along with -fp-model precise. If you use the version 17 compiler, you can do this with a single switch, -fp-model consistent. This is much more complete than -fltconsistency; if you see any variations (within the same architecture) when using -fp-model consistent, please report them as a bug.

So I would remove -fp-port and -fltconsistency and replace by -fp-model precise. If you don't want to do this, moving to the version 17 compiler should avoid the variations that you reported. Note that neither result is more "correct" than the other, they are consistent within the accuracy of single precision. If you need more accurate results, the best way is to go to double precision.

There's an online article about floating-point reproducibility attached at https://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler/ However, it doesn't cover -fltconsistency and doesn't talk much about x87 any more, you'd have to find an older version. Much better to avoid x87.

mecej4 · ‎03-29-2017

Ereisch: In your comments on the assembler code in #3, you are misinterpreting a couple of things. The ESP register is used to produce a memory address. The operand of the instruction is a qword, i.e., 64-bit (double precision) real.

fstp qword ptr [esp], st0 <-- ~~Casts st0 from x87 register precision to single precision~~

Similarly, in the subsequent FLD instruction in the subroutine, the value of ebp+8 is the same as the value of esp prior to the call. This is the standard pattern in the x86 architecture. When the subroutine is entered, the return address (+4 bytes) and the saved frame pointer (+4bytes) are pushed on to the stack, and they account for the displacement +8 that is applied to the new frame pointer.

ereisch · ‎04-10-2017

Assuming my code and arguments are correct, I believe this is still an issue in ifort 17.0.2 on Linux (please excuse the hastily-coded reproducer....my Fortran format-fu sucks).

test_a.for:

  PROGRAM OPTIMIZER_TEST

  IMPLICIT NONE
  REAL*8  R8_X
  REAL*4  Y, Z_1, Z_2

  DATA  R8_X /X'3FF913E540000000'/, Y /X'3F132DD0'/

  CALL F_SQRT_NOOPT( R8_X, Y, Z_1 )
  CALL F_SQRT_OPT( R8_X, Y, Z_2 )

  WRITE( *, '(F20.17, A, F20.17)' ) R8_X, ' ', Y
  WRITE( *, '(F20.17, A, F20.17)' ) Z_1, ' ', Z_2
  WRITE( *, '(Z, A, Z)' ) Z_1, ' ', Z_2
  END

test_b.for:

!DIR$ IF ( DBG == 1 )
  SUBROUTINE F_SQRT_NOOPT( X, Y, Z )
!DIR$ ELSE
  SUBROUTINE F_SQRT_OPT( X, Y, Z )
!DIR$ END IF
  IMPLICIT NONE
  REAL*8 X
  REAL*4 Y, Z
  INTENT(IN) X, Y
  INTENT(OUT) Z

  Z = SQRT( 1.0 / X ) * EXP10(Y)
  RETURN
  END

Compile commands:
ifort -extend-source 132 -assume nounderscore -assume nobscc -align dcommons -sox -assume protect-parens -m32 -c -ccdefault fortran -traceback -no-global-hoist -march=corei7 -auto -fimf-arch-consistency=true -fimf-precision=high -no-fma -prec-div -fp-speculation=strict -prec-sqrt -fp-model source -O3 -DOPT -fpp -c -o test_b_opt.o test_b.for
ifort -extend-source 132 -assume nounderscore -assume nobscc -align dcommons -sox -assume protect-parens -m32 -c -ccdefault fortran -traceback -no-global-hoist -march=corei7 -auto -fimf-arch-consistency=true -fimf-precision=high -no-fma -prec-div -fp-speculation=strict -prec-sqrt -fp-model source -O0 -DDBG -fpp -c -o test_b_noopt.o test_b.for
ifort -extend-source 132 -assume nounderscore -assume nobscc -align dcommons -sox -assume protect-parens -m32 -c -ccdefault fortran -traceback -no-global-hoist -march=corei7 -auto -fimf-arch-consistency=true -fimf-precision=high -no-fma -prec-div -fp-speculation=strict -prec-sqrt -fp-model source -O0 -c -o test_a.o test_a.for
ifort -m32 -o test test_a.o test_b_opt.o test_b_dbg.o

Output:

1.56735730171203613 0.57491779327392578
3.00146865844726563 3.00146842002868652
40401810 4040180F

ereisch · ‎04-10-2017

Additional observations:

Changing the "1.0/X" to "X" has no effect on accuracy
Removing the SQRT call makes the discrepancy go away
Removing the EXP10 call makes the discrepancy go away
Changing the EXP10 call's argument to X/(X+1.0) makes the discrepancy go away

jimdempseyatthecove · ‎04-11-2017

I think this is an expectation on your part about promotion of the precision of a mixed mode expression.

Try:

Z = SQRT( 1.0D0 / X ) * EXP10(Y)

or

Z = SQRT( 1.0D0 / X ) * EXP10(DREAL(Y))

Jim Dempsey

Martyn_C_Intel · ‎04-14-2017

I believe you have uncovered a real, if somewhat rare and subtle, difference between -O0 and -O2 on IA-32, even when all the consistency switches are set. I believe the conversion from single to double precision (for the result of the single precision exponential) is being done slightly differently in the two cases; at -O0 there is a double (two step) conversion, x87 to single precision in memory followed by SSE single to double precision with a cvtss2sd instruction, whereas at -O2 the conversion to double precision is being done in a single step in an x87 instruction and then copied via memory to SSE double precision directly. This issue does not arise on Intel64 where x87 instructions do not get generated.

I will need to check with a real expert, but I believe this is a known effect.

Martyn_C_Intel · ‎04-14-2017

(the IA32 and Intel64 ABIs are different; in the first case, function results are returned in x87 registers, in the second, in SSE registers).

Martyn_C_Intel · ‎04-28-2017

The experts have concluded that the compiler behavior at -O2 with -fp-model precise -fimf-arch-consistency=true on IA-32 is incorrect. I have escalated this internally as a bug. We'll post when a compiler containing a fix becomes available.

Thanks for the report!

Interesting SQRT behavior in optimizer