I'm running the latest ifort 2017.1 and found this unwanted behavior. It's illustrated by a short program and the result after two different compilations.
> cat test_sum.f
INTEGER(8),PARAMETER :: N = 1024 * 1024 * 1024
INTEGER(8) :: SEED(1) = (/12345678/)
REAL(4) :: A(N)
REAL(8) :: B(N)
! Fill the array A with random numbers in the range [0 - 1].
CALL RANDOM_NUMBER( A )
! Copy the values to the double precision array B.
B(:) = A(:)
! Expected sum = N * 0.5 = 536870912 = 5.3687091E+08
! The sum may differ a bit due to the random numbers drawn.
PRINT *, 'Sum(A):', REAL(SUM(A))
PRINT *, 'Sum(B):', REAL(SUM(B))
END PROGRAM TEST_SUM
> ifort -i8 test_sum.f
> ifort -i8 -fp-model precise test_sum.f
So the real*8 array was summed up correctly, but not the real*4 array when compiled with the -fp-model precise option. The result was the same if the arrays A and B were allocated arrays.
>>Expected sum = N * 0.5 = 536870912 = 5.3687091E+08
Worst case difference could be N lsb's as you accumulate each partial sum. Average may be N*0.5 lsb's
Float/Single has 7.22 digits of precision. 10^7.22 = 16595869 ~= 1.66E+7
When the partial sum of the float approaches and exceeds 1.66E7, the additional additions effectively add 0.0.
The summation of the float array is off by (5.3687091E+08 - 1.66E+7) ~= 5.2E+8. The actual difference will depend on when the partial sum approaches and exceeds 8.3E+6. The aove is thumbnail sketch, a mathematician can provide a proper statistical analysis.
The results appear realistic.
Try merging this into your program and report back the results:
REAL(4) :: T(N/2) ! N must be even (it is) REAL(4) :: SUMA INTEGER(8) :: I ... T = A(1:N/2) + A(N/2+1:N) I = N / 2 DO WHILE(I .gt. 1) T(1:I/2) = T(1:I/2) + T(I/2+1:I) I = I / 2 END DO SUMA = T(1)
In case someone wonders why -fp-precise might be responsible for such a difference, that option requires SUM to be implemented as a single chain of serial addition. Without that option, ifort can choose to optimize performance [ by splitting the sum into several partial sums (typically 8 individual 4- or 8- wide parallel sums, depending on target ISA), and this would delay the effect Jim described.
After thinking about my post #3, it would have worked as intended if N were a power of 2. The following removes any restrictions on N (other than it be positive non-zero and fit in memory).
REAL(4) :: T(N/2+1) ! No restrictions on N REAL(4) :: SUMA INTEGER(8) :: I,J ... J = N I = J / 2 T = A(1:I) + A(I+1:I*2) ! In event J is odd, sum all but last IF(I*2 .NE J) THEN ! (if N (now J) were odd) I = I+1 T(I) = T(J) ! copy last to after partial sum ENDIF DO WHILE(I .gt. 1) J = I I = J / 2 T(1:I/2) = T(1:I/2) + T(I/2+1:I*2) IF(I*2 .NE J) THEN I = I+1 T(I) = T(J) ENDIF END DO SUMA = T(1)
Performing the same thing on the double arrays would yield a better result.