Jim,

Shin__Dongbin · ‎03-06-2020

When i tried to test some cases for unrolling loop with below code, i found the difference of the results.

as i compiled the code with ifort, the results showed difference between results of two loops.

>> 16814100 for 1st loop and 16814098 for second loop

but with smaller number of it's iteration, the error does not occur...!

i cannot understand it... what's the problem..?TT

used compiler version

ifort (IFORT) 17.0.1 20161005

      PROGRAM test

        implicit none
        integer :: i, a, idxl, idxl2
        integer, parameter :: sz=4100
        real, dimension(sz) :: numbers, results
        real :: start_time, end_time

        a = 1
        numbers = (/((i*a),i=1,sz)/)
!#########################################################################
        idxl = 0
        start_time = 0
        end_time =0

        CALL cpu_time(start_time)
        do i = 1,sz
           idxl = idxl + numbers(i) + i
        enddo
        CALL cpu_time(end_time)


        print *, 'first loop took ', end_time-start_time, ' secs.'
        print *, 'first results: ', idxl
        print *, 'last value in 1st loop: ', numbers(i-1), i-1
        print *, '==================================================='
!##########################################################################

        idxl2 = 0
        start_time = 0
        end_time =0

        CALL cpu_time(start_time)
        DO i = 1,sz,4
           idxl2 = idxl2 + numbers(i) + i
           idxl2 = idxl2 + numbers(i+1) + i+1
           idxl2 = idxl2 + numbers(i+2) + i+2
           idxl2 = idxl2 + numbers(i+3) + i+3
        ENDDO
        CALL cpu_time(end_time)

        print *, 'second loop took ', end_time-start_time, ' secs.'
        print *, 'second results: ', idxl2
      END

==== ifort results ====

$ ifort test.f90 -o test.exe

$ ./test.exe

first loop took 0.0000000E+00 secs.
first results: 16814100
last value in 1st loop: 4100.000 4100
===================================================
second loop took 0.0000000E+00 secs.
second results: 16814098

jimdempseyatthecove · ‎03-07-2020

Version 19.1 Win32 Debug and Release gives same results
Version 19.1 x64 Debug gives same results
Version 19.1 x64 Release gives different results ***

I suspect the issue is the expression you provided "idxl2 + numbers(i) + i" (and its unrolled version) is a mixed integer and real expression. In the unrolled version the accumulator idxl2 is not updated 4 times. Instead, the expression is evaluated and held in temporary register as REAL as opposed to as INTEGER. The side effect of this is the precision of REAL has fewer bits (24 bits) than that of integer (32 bits). This is not a problem with relatively smaller REAL integer values (< 2**24 = 8,388,608) but your result is twice that, and as a result is one bit larger and is subject to drop the odd bit when sum is > 2**24. Modifying your code slightly gives the ~~correct~~ desired result:

program RoundOffError
        implicit none
        integer :: i, a, idxl, idxl2
        integer, parameter :: sz=4100
        real, dimension(sz) :: numbers, results
        real :: start_time, end_time

        a = 1
        numbers = (/((i*a),i=1,sz)/)
!#########################################################################
        idxl = 0
        start_time = 0
        end_time =0

        CALL cpu_time(start_time)
        do i = 1,sz
           idxl = idxl + int(numbers(i)) + i
        enddo
        CALL cpu_time(end_time)


        print *, 'first loop took ', end_time-start_time, ' secs.'
        print *, 'first results: ', idxl
        print *, 'last value in 1st loop: ', numbers(i-1), i-1
        print *, '==================================================='
!##########################################################################

        idxl2 = 0
        start_time = 0
        end_time =0

        CALL cpu_time(start_time)
        DO i = 1,sz,4
           idxl2 = idxl2 + int(numbers(i)) + i
           idxl2 = idxl2 + int(numbers(i+1)) + i+1
           idxl2 = idxl2 + int(numbers(i+2)) + i+2
           idxl2 = idxl2 + int(numbers(i+3)) + i+3
        ENDDO
        CALL cpu_time(end_time)

        print *, 'second loop took ', end_time-start_time, ' secs.'
        print *, 'second results: ', idxl2
end program RoundOffError

Not, as was written, the answer with the round off error was as correct as the one without round off errors. It is the responsibility of the programmer to formulate the code to provide the results within the desired precision.

Jim Dempsey

JohnNichols · ‎03-07-2020

Lisp teaches you that the side effect may not be as required -- you always check.

A real has a known error - every single time. if you add them enough you will get a statistical difference. Of course you use real8 etc... which only delays the problem.

mixing Integers and reals is a really bad idea. Your original code is poor, that is the main problem. I suggest a really good book by Abelson and Sussman Structure and Interpretation of Computer Programs - 2nd Edition (MIT Electrical Engineering and Computer Science) - of course Winston and Horne is more practical and Knuth should be on your desk. Although he might object if you grab the body and not the books.

Abelson is brilliant, of course after that you should read Alonzo Church's Lambda Calculus - after that you will not write line 9 10 or 18 in that form or probably use print, although that is personal preference - albeit a poor one.

Shin__Dongbin · ‎03-08-2020

Jim,

thanks for detail explanation.

But, when i tried compiling my code with gfortran, the result shows different value from the one of ifort.

Then, do the compilers have different rounding digits?

as far as i understood what you said, i think two results will shows same round-off-error based on the size of the bits of real.

John,

thanks for recommending the books and more details. i'll check all of those books.

Thanks again

jimdempseyatthecove · ‎03-09-2020

>>Then, do the compilers have different rounding digits?

You'd have to look at the disassembly code to know the difference. If I were to guess, one is converting the operands in the expression(s) from REAL to INTEGER and the other is converting the operands in the expression(s) from INTEGER to REAL (then performing the expression before converting back to INTEGER for the result). This is somewhat similar to an ambiguous expression written without sufficient ()'s and being instructed to observe the ()'s.

Jim Dempsey

Steve_Lionel · ‎03-09-2020

Dave Eklund's The Perils of Real Numbers series (three articles) may be worth a read here, especially his discussion of "Bankers' Rounding".

JohnNichols · ‎03-09-2020

Conte and de Boore wrote a standard text book on numerical analysis back in the 70s. I used this book in my education and it has stood the test of time.

The book also has a lot of sample Fortran code. Of course it is not 2008 Fortran but it is pretty good for learning.

The book is uploaded to the web, but I have place a copy here.

Rounding error is a curse -- but a necessary curse, we all have to deal with it. It pops up in unexpected places but one always has to look out for it.

I enclose a sample for setting up a module for base numbers. This comes from others on this list server, I only copied the idea. The use of base.f90 saves a lot of retyping and looking for constants.

John_Campbell · ‎03-09-2020

I did some changes which might show a way forward. The third option is faster in my tests, but might not be for you.

I was trying to keep the result in AVX registers. Also CPU_TIME (processor usage) is not an accurate timer, while SYSTEM_CLOCK (elapsed time) can be better.

      PROGRAM test

        implicit none
        integer :: i, a, idxl, idxl2, idxl3(4)
        integer, parameter :: sz=4100
        real, dimension(sz) :: numbers, results
        real*8 :: start_time, end_time

        a = 1
        numbers = (/((i*a),i=1,sz)/)
!#########################################################################
        idxl = 0
        start_time = 0
        end_time =0

        CALL elapse_time (start_time)
        do i = 1,sz
           idxl = idxl + numbers(i) + i
        enddo
        CALL elapse_time (end_time)


        print *, 'first loop took ', end_time-start_time, ' secs.'
        print *, 'first results: ', idxl
        print *, 'last value in 1st loop: ', numbers(i-1), i-1
        print *, '==================================================='
!##########################################################################

        idxl2 = 0
        start_time = 0
        end_time =0

        CALL elapse_time (start_time)
        DO i = 1,sz-3,4
           idxl2 = idxl2 + numbers(i)   + i
           idxl2 = idxl2 + numbers(i+1) + i+1
           idxl2 = idxl2 + numbers(i+2) + i+2
           idxl2 = idxl2 + numbers(i+3) + i+3
        ENDDO
        CALL elapse_time (end_time)

        print *, 'second loop took ', end_time-start_time, ' secs.'
        print *, 'second results: ', idxl2
!##########################################################################

        idxl3 = 0
        start_time = 0
        end_time =0

        CALL elapse_time(start_time)
        DO i = 1,sz-3,4
           idxl3(1) = idxl3(1) + numbers(i)   + i
           idxl3(2) = idxl3(2) + numbers(i+1) + i+1
           idxl3(3) = idxl3(3) + numbers(i+2) + i+2
           idxl3(4) = idxl3(4) + numbers(i+3) + i+3
        ENDDO
        idxl2 = sum(idxl3)
        CALL elapse_time (end_time)

        print *, 'third loop took ', end_time-start_time, ' secs.'
        print *, 'third results: ', idxl2
      END
      
      subroutine elapse_time (time)
      real*8 time
      integer*8 tick,rate
      call system_clock (tick, rate)
      time = dble(tick) / dble(rate)
      end subroutine elapse_time

jimdempseyatthecove · ‎03-10-2020

John,

Your code does not address the issue the accumulating sum of integer values within the REAL array numbers exceeding the 24-bit mantissa precision of REAL (32-bit) floating point numbers. While for the current implementation using idxl3 array for sz=4100 will not exhibit a problem (each cell will approximate 16814100 / 4 = 4,203,525 or about 1/2 the precision capacity when all bits required), use of a larger sz (~5000) may exhibit the problem.

If it is known that the array numbers will hold only integer values then the proper route may be to declare the array as INTEGER not REAL. Though this may postpone the point of error until the sum exceeds 2**31 (and in which case the summation should be placed into an INTEGER*8).

Jim Dempsey

John_Campbell · ‎03-11-2020

Jim,

I was looking at why the loop unrolling did not provide any performance benefit and not paying more attention to the main problem !

The most practical solution to round-off is to convert to REAL(8) accumulators. (high precision float that does not overflow and minimises round-off)

The use of 4 accumulators in the DO loop can have a marginal improvement with more varied values for real numbers, similar to what can be identified in multi-threaded examples.

jimdempseyatthecove · ‎03-13-2020

When the values in array numbers are always integers (and result and intermediaries integer) the best method is to have the array as INTEGER as opposed to REAL. IOW try not to mix expressions between integer and real. Simplifying this in this manner can provide sufficient information to the compiler to properly optimize (vectorize and unroll) the single statement loop.

Note, in the following source+assembly listing the single statement loop with int(numbers) is both vectorized (~equivalent to an unroll factor of 4) .AND. the vector version is unrolled an additional 8 (together ~equivalent to an unroll factor of 32):

;;;         do i = 1,sz

        movdqu    xmm1, XMMWORD PTR [_2il0floatpacket.0]        ;16.9
        xor       eax, eax                                      ;16.9
        pxor      xmm0, xmm0                                    ;11.9
        lea       rdx, QWORD PTR [__ImageBase]                  ;16.9
                                ; LOE rax rdx rbx rsi rdi r12 r13 r15d xmm0 xmm1 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.8::                         ; Preds .B1.8 .B1.7
                                ; Execution count [4.10e+003]

;;;            idxl = idxl + int(numbers(i)) + i

        cvttps2dq xmm2, XMMWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+rdx+rax*4] ;17.26
        cvttps2dq xmm3, XMMWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+16+rdx+rax*4] ;17.26
        cvttps2dq xmm5, XMMWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+32+rdx+rax*4] ;17.26
        cvttps2dq xmm4, XMMWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+48+rdx+rax*4] ;17.26
        paddd     xmm2, xmm12                                   ;17.12
        paddd     xmm12, xmm1                                   ;17.12
        paddd     xmm0, xmm2                                    ;17.24
        paddd     xmm3, xmm12                                   ;17.12
        paddd     xmm12, xmm1                                   ;17.12
        paddd     xmm0, xmm3                                    ;17.24
        cvttps2dq xmm2, XMMWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+64+rdx+rax*4] ;17.26
        cvttps2dq xmm3, XMMWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+112+rdx+rax*4] ;17.26
        paddd     xmm5, xmm12                                   ;17.12
        paddd     xmm12, xmm1                                   ;17.12
        paddd     xmm5, xmm0                                    ;17.24
        paddd     xmm4, xmm12                                   ;17.12
        cvttps2dq xmm0, XMMWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+80+rdx+rax*4] ;17.26
        paddd     xmm12, xmm1                                   ;17.12
        paddd     xmm5, xmm4                                    ;17.24
        paddd     xmm2, xmm12                                   ;17.12
        paddd     xmm12, xmm1                                   ;17.12
        paddd     xmm2, xmm5                                    ;17.24
        paddd     xmm0, xmm12                                   ;17.12
        paddd     xmm2, xmm0                                    ;17.24
        paddd     xmm12, xmm1                                   ;17.12
        cvttps2dq xmm0, XMMWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+96+rdx+rax*4] ;17.26
        paddd     xmm0, xmm12                                   ;17.12
        paddd     xmm12, xmm1                                   ;17.12
        add       rax, 32                                       ;16.9
        paddd     xmm0, xmm2                                    ;17.24
        paddd     xmm3, xmm12                                   ;17.12
        paddd     xmm12, xmm1                                   ;17.12
        paddd     xmm0, xmm3                                    ;17.24
        cmp       rax, 4096                                     ;16.9
        jb        .B1.8         ; Prob 99%                      ;16.9
                                ; LOE rax rdx rbx rsi rdi r12 r13 r15d xmm0 xmm1 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.9::                         ; Preds .B1.8
                                ; Execution count [1.00e+000]
        movdqa    xmm1, xmm0                                    ;11.9

;;;         enddo

Also note, that the convert from REAL to INTEGER is occurring on the memory fetch, thus the performance impact is minimal.

Using the self unrolled loop (with int(numbers(i)) confuses the optimization and performs the loop using scalars:
(then hand unrolled x4 is compiler unrolled x2 yielding ~x8 unroll ***scalar*** code)

;;;         DO i = 1,sz,4
;;;            idxl2 = idxl2 + int(numbers(i)) + i
;;;            idxl2 = idxl2 + int(numbers(i+1)) + i+1
;;;            idxl2 = idxl2 + int(numbers(i+2)) + i+2
;;;            idxl2 = idxl2 + int(numbers(i+3)) + i+3

        mov       r9d, r15d                                     ;37.12
        xor       r8d, r8d                                      ;33.9
        lea       rcx, QWORD PTR [__ImageBase]                  ;33.9
                                ; LOE rcx rbx rsi rdi r8 r12 r13 r9d r14d r15d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm13 xmm14 xmm15
.B1.21::                        ; Preds .B1.21 .B1.20
                                ; Execution count [5.12e+002]
        cvttss2si r10d, DWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+rcx+r8] ;34.28
        cvttss2si eax, DWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+8+rcx+r8] ;36.28
        cvttss2si r11d, DWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+4+rcx+r8] ;35.28
        lea       edx, DWORD PTR [r10+r15*8]                    ;34.12
        cvttss2si r10d, DWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+12+rcx+r8] ;37.28
        add       edx, r11d                                     ;37.12
        lea       eax, DWORD PTR [rax+r15*8]                    ;35.12
        lea       r11d, DWORD PTR [r10+r15*8]                   ;35.46
        add       eax, r11d                                     ;35.26
        add       edx, eax                                      ;37.12
        cvttss2si eax, DWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+16+rcx+r8] ;34.28
        cvttss2si r10d, DWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+24+rcx+r8] ;36.28
        cvttss2si r11d, DWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+28+rcx+r8] ;37.28
        lea       edx, DWORD PTR [10+rdx+r15*8]                 ;37.12
        add       r14d, edx                                     ;34.26
        lea       eax, DWORD PTR [rax+r15*8]                    ;34.12
        cvttss2si edx, DWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+20+rcx+r8] ;35.28
        add       eax, edx                                      ;37.12
        lea       edx, DWORD PTR [r10+r15*8]                    ;35.12
        add       r8, 32                                        ;33.9
        lea       r11d, DWORD PTR [r11+r15*8]                   ;35.46
        add       edx, r11d                                     ;35.26
        add       eax, edx                                      ;37.12
        lea       eax, DWORD PTR [26+rax+r15*8]                 ;37.12
        inc       r15d                                          ;33.9
        add       r9d, eax                                      ;34.26
        cmp       r15d, 512                                     ;33.9
        jb        .B1.21        ; Prob 99%                      ;33.9
                                ; LOE rcx rbx rsi rdi r8 r12 r13 r9d r14d r15d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm13 xmm14 xmm15
.B1.22::                        ; Preds .B1.21
                                ; Execution count [1.00e+000]
        cvttss2si edx, DWORD PTR [ROUNDOFFERROR$NUMBERS.0.1+16384] ;34.28
        cvttss2si eax, DWORD PTR [ROUNDOFFERROR$NUMBERS.0.1+16388] ;35.28
        cvttss2si r8d, DWORD PTR [ROUNDOFFERROR$NUMBERS.0.1+16396] ;37.28
        add       r14d, r9d                                     ;37.12
        lea       r10d, DWORD PTR [4096+rax+rdx]                ;37.12
        cvttss2si r9d, DWORD PTR [ROUNDOFFERROR$NUMBERS.0.1+16392] ;36.28
        lea       r11d, DWORD PTR [8192+r8+r9]                  ;35.26
        lea       r15d, DWORD PTR [4106+r10+r11]                ;37.12
        add       r14d, r15d                                    ;34.26

;;;         ENDDO

Conclusion: trying to improve the code through hand unrolling is counter-productive in this case.

Jim Dempsey

loop error occurs when increasing it's iteration