Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29274 Discussions

Commutativity of Floating Point Arithmetic

gongo
Beginner
4,745 Views
That FP arithmetic can be tricky in multi-threaded environment is well known. However, I am getting strange results depending on whether I use -O2 optimization or not, even though the code itself is serial. In particular, in the enclosed sample code, four complex arrays, z1, z2, z3, and z4, are initialized with random numbers so that their value ranges are pairwise at different scales. Then the operation z1*z2 + z3*z4 performed elementwise and stored in an array z. Finally sum(z) is computed. If I compile with -O2 flag, the result depends on whether the operands are in the above order, or like: z2*z1 + z4*z3 (controlled at runtime by a command line argument). If I compile with -g, or with -fp-model precise, the two results are identical. Can anyone explain why this could be happening?

Thanks,
George
0 Kudos
25 Replies
jimdempseyatthecove
Honored Contributor III
704 Views
George,

When accumulating (adding or subtracting) FP numbers who's resultant value's mantissa (non-exponent part) requires more bits than the system's FP mantissa holds, then rounding occurs. If you change the order of the accumulations, then you also can change the points at which the rounding occurs. This can change the result (usually by a small amount).

While working the code (options or statements)such that the same sequence and same precisions are used, thiswill produce reproducibility in your runs, it will not necessarily assure the highest precision. Higher precision (of accumulations) will occure if you accumulate the smallest values first and the largest last.
i.e. adding a very small number to a very large number where difference in magnitudes exceed precision of mantissa, is equivilent to accumulating 0's. Whereas adding all the small numbers together, might produce a result, who's difference in magnitude from the larger numbers is less than the precision of the mantissa. Coding this way runs slower, since it may require a sort first.

Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
704 Views
George,

To illustrate the point:

[bash]! rantest.f90
program rantest
    implicit none
    intrinsic random_number
    real, allocatable :: array(:)
    real :: sum1, sum2, clipLow, clipHigh
    real(8) :: dsum
    integer, parameter :: N = 100000
    integer :: I
    allocate(array(N))
    call random_number(array)
    sum1 = 0.0
    dsum = 0.0_8
    do I=1,N
      sum1 = sum1 + array(I)
      dsum = dsum + dble(array(I))
    end do
    sum2 = 0.0
    clipLow = 0.0
    clipHigh = 1.0E-30
    do while(clipLow .lt. 1.0)
        do I=1,N
          if((array(I) .le. clipHigh) .and. (array(I) .gt. clipLow)) sum2 = sum2 + array(I)
        end do
        clipLow = clipHigh
        clipHigh = clipHigh * 2.0
        if(clipHigh .gt. 1.0) clipHigh = 1.0
    end do
    write(*,100) sum1, sum2, dsum
100 format(x,Z8,x,Z8,x,Z16)
    write(*,200) sum1, sum2, dsum
200 format(x,E15.10,x,E15.10,x,D15.10)
end program rantest

Output:

 4743AF59 4743AF77 40E875F115F2F480
 .5009534766E+05 .5009546484E+05 .5009553393D+05

[/bash]
The second line of output shows a difference illustrates how the accumulation of disparate numbers behave like truncation.

The first float is the summation without regard to size. This is less than...
The second float is the summation by approximate order of size. This produces a slightly larger number that is closer to the true result.
The third number (double) is larger yet, and excepting for formatting/printing issues represents the correct (or near correct) result.

Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
704 Views
Follow-up

I extended the above program to tally dsum1 and dsum2 along side the sum1 and sum2. The results indicating:

4743AF59 4743AF77 40E875F115F2F480 40E875F115F2F480

.5009534766E+05 .5009546484E+05 .5009553393D+05 .5009553393D+05

Note the Z formatting of the two doubles (dsum1 and dsum2) are equal. Which should be the case (for this program) since array(N) was single precision. Conversion from single to double is exact. And for this program, the summation of the converted reals did not saturate the mantissa of the doubles (dsum1, dsum2).

George has to answer to himself: do you want consistent results or accurate results (or accurate consistent results)?

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
704 Views
The original examples don't raise the well-known issues with sum accumulation; as no "fused multiply add" was under discussion, the expectation of commutivity was reasonable, aside from possibilities related to inconsistent use of x87 extra precision.
0 Kudos
jimdempseyatthecove
Honored Contributor III
704 Views
>> the expectation of commutivity was reasonable

The expectations are reasonable, but the practicalities are it is not always possible. "fused multiply add" is not at issue here, but may be one of the players in the issue when used by the compiler. The root cause is the sequence in which the accumulations are performed using values who's summation(s) do not fit within the precision of the calculator (in this case FPU, FPU temp, or SSE register all differ). Additionaly, when the precision is insufficient, then roundoff rules apply and sequence of operations will affect outcomes.

This expectation is based on the requirement of evaluations made with exacting precision. Compiler optimizations, in particular common sub-expression elimination, combined with instruction/statement reordering to improve common sub-expression elimination, will (can) affect the sequence in which the accumulations occur. When these numbers cannot maintain exacting precision comutivity cannot be assured.

Jim Dempsey



0 Kudos
Reply