- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When i tried to test some cases for unrolling loop with below code, i found the difference of the results.
as i compiled the code with ifort, the results showed difference between results of two loops.
>> 16814100 for 1st loop and 16814098 for second loop
but with smaller number of it's iteration, the error does not occur...!
i cannot understand it... what's the problem..?TT
used compiler version
ifort (IFORT) 17.0.1 20161005
PROGRAM test implicit none integer :: i, a, idxl, idxl2 integer, parameter :: sz=4100 real, dimension(sz) :: numbers, results real :: start_time, end_time a = 1 numbers = (/((i*a),i=1,sz)/) !######################################################################### idxl = 0 start_time = 0 end_time =0 CALL cpu_time(start_time) do i = 1,sz idxl = idxl + numbers(i) + i enddo CALL cpu_time(end_time) print *, 'first loop took ', end_time-start_time, ' secs.' print *, 'first results: ', idxl print *, 'last value in 1st loop: ', numbers(i-1), i-1 print *, '===================================================' !########################################################################## idxl2 = 0 start_time = 0 end_time =0 CALL cpu_time(start_time) DO i = 1,sz,4 idxl2 = idxl2 + numbers(i) + i idxl2 = idxl2 + numbers(i+1) + i+1 idxl2 = idxl2 + numbers(i+2) + i+2 idxl2 = idxl2 + numbers(i+3) + i+3 ENDDO CALL cpu_time(end_time) print *, 'second loop took ', end_time-start_time, ' secs.' print *, 'second results: ', idxl2 END
==== ifort results ====
$ ifort test.f90 -o test.exe
$ ./test.exe
first loop took 0.0000000E+00 secs.
first results: 16814100
last value in 1st loop: 4100.000 4100
===================================================
second loop took 0.0000000E+00 secs.
second results: 16814098
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Version 19.1 Win32 Debug and Release gives same results
Version 19.1 x64 Debug gives same results
Version 19.1 x64 Release gives different results ***
I suspect the issue is the expression you provided "idxl2 + numbers(i) + i" (and its unrolled version) is a mixed integer and real expression. In the unrolled version the accumulator idxl2 is not updated 4 times. Instead, the expression is evaluated and held in temporary register as REAL as opposed to as INTEGER. The side effect of this is the precision of REAL has fewer bits (24 bits) than that of integer (32 bits). This is not a problem with relatively smaller REAL integer values (< 2**24 = 8,388,608) but your result is twice that, and as a result is one bit larger and is subject to drop the odd bit when sum is > 2**24. Modifying your code slightly gives the correct desired result:
program RoundOffError implicit none integer :: i, a, idxl, idxl2 integer, parameter :: sz=4100 real, dimension(sz) :: numbers, results real :: start_time, end_time a = 1 numbers = (/((i*a),i=1,sz)/) !######################################################################### idxl = 0 start_time = 0 end_time =0 CALL cpu_time(start_time) do i = 1,sz idxl = idxl + int(numbers(i)) + i enddo CALL cpu_time(end_time) print *, 'first loop took ', end_time-start_time, ' secs.' print *, 'first results: ', idxl print *, 'last value in 1st loop: ', numbers(i-1), i-1 print *, '===================================================' !########################################################################## idxl2 = 0 start_time = 0 end_time =0 CALL cpu_time(start_time) DO i = 1,sz,4 idxl2 = idxl2 + int(numbers(i)) + i idxl2 = idxl2 + int(numbers(i+1)) + i+1 idxl2 = idxl2 + int(numbers(i+2)) + i+2 idxl2 = idxl2 + int(numbers(i+3)) + i+3 ENDDO CALL cpu_time(end_time) print *, 'second loop took ', end_time-start_time, ' secs.' print *, 'second results: ', idxl2 end program RoundOffError
Not, as was written, the answer with the round off error was as correct as the one without round off errors. It is the responsibility of the programmer to formulate the code to provide the results within the desired precision.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Lisp teaches you that the side effect may not be as required -- you always check.
A real has a known error - every single time. if you add them enough you will get a statistical difference. Of course you use real8 etc... which only delays the problem.
mixing Integers and reals is a really bad idea. Your original code is poor, that is the main problem. I suggest a really good book by Abelson and Sussman Structure and Interpretation of Computer Programs - 2nd Edition (MIT Electrical Engineering and Computer Science) - of course Winston and Horne is more practical and Knuth should be on your desk. Although he might object if you grab the body and not the books.
Abelson is brilliant, of course after that you should read Alonzo Church's Lambda Calculus - after that you will not write line 9 10 or 18 in that form or probably use print, although that is personal preference - albeit a poor one.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
thanks for detail explanation.
But, when i tried compiling my code with gfortran, the result shows different value from the one of ifort.
Then, do the compilers have different rounding digits?
as far as i understood what you said, i think two results will shows same round-off-error based on the size of the bits of real.
John,
thanks for recommending the books and more details. i'll check all of those books.
Thanks again
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Then, do the compilers have different rounding digits?
You'd have to look at the disassembly code to know the difference. If I were to guess, one is converting the operands in the expression(s) from REAL to INTEGER and the other is converting the operands in the expression(s) from INTEGER to REAL (then performing the expression before converting back to INTEGER for the result). This is somewhat similar to an ambiguous expression written without sufficient ()'s and being instructed to observe the ()'s.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dave Eklund's The Perils of Real Numbers series (three articles) may be worth a read here, especially his discussion of "Bankers' Rounding".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Conte and de Boore wrote a standard text book on numerical analysis back in the 70s. I used this book in my education and it has stood the test of time.
The book also has a lot of sample Fortran code. Of course it is not 2008 Fortran but it is pretty good for learning.
The book is uploaded to the web, but I have place a copy here.
Rounding error is a curse -- but a necessary curse, we all have to deal with it. It pops up in unexpected places but one always has to look out for it.
I enclose a sample for setting up a module for base numbers. This comes from others on this list server, I only copied the idea. The use of base.f90 saves a lot of retyping and looking for constants.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did some changes which might show a way forward. The third option is faster in my tests, but might not be for you.
I was trying to keep the result in AVX registers. Also CPU_TIME (processor usage) is not an accurate timer, while SYSTEM_CLOCK (elapsed time) can be better.
PROGRAM test implicit none integer :: i, a, idxl, idxl2, idxl3(4) integer, parameter :: sz=4100 real, dimension(sz) :: numbers, results real*8 :: start_time, end_time a = 1 numbers = (/((i*a),i=1,sz)/) !######################################################################### idxl = 0 start_time = 0 end_time =0 CALL elapse_time (start_time) do i = 1,sz idxl = idxl + numbers(i) + i enddo CALL elapse_time (end_time) print *, 'first loop took ', end_time-start_time, ' secs.' print *, 'first results: ', idxl print *, 'last value in 1st loop: ', numbers(i-1), i-1 print *, '===================================================' !########################################################################## idxl2 = 0 start_time = 0 end_time =0 CALL elapse_time (start_time) DO i = 1,sz-3,4 idxl2 = idxl2 + numbers(i) + i idxl2 = idxl2 + numbers(i+1) + i+1 idxl2 = idxl2 + numbers(i+2) + i+2 idxl2 = idxl2 + numbers(i+3) + i+3 ENDDO CALL elapse_time (end_time) print *, 'second loop took ', end_time-start_time, ' secs.' print *, 'second results: ', idxl2 !########################################################################## idxl3 = 0 start_time = 0 end_time =0 CALL elapse_time(start_time) DO i = 1,sz-3,4 idxl3(1) = idxl3(1) + numbers(i) + i idxl3(2) = idxl3(2) + numbers(i+1) + i+1 idxl3(3) = idxl3(3) + numbers(i+2) + i+2 idxl3(4) = idxl3(4) + numbers(i+3) + i+3 ENDDO idxl2 = sum(idxl3) CALL elapse_time (end_time) print *, 'third loop took ', end_time-start_time, ' secs.' print *, 'third results: ', idxl2 END subroutine elapse_time (time) real*8 time integer*8 tick,rate call system_clock (tick, rate) time = dble(tick) / dble(rate) end subroutine elapse_time
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John,
Your code does not address the issue the accumulating sum of integer values within the REAL array numbers exceeding the 24-bit mantissa precision of REAL (32-bit) floating point numbers. While for the current implementation using idxl3 array for sz=4100 will not exhibit a problem (each cell will approximate 16814100 / 4 = 4,203,525 or about 1/2 the precision capacity when all bits required), use of a larger sz (~5000) may exhibit the problem.
If it is known that the array numbers will hold only integer values then the proper route may be to declare the array as INTEGER not REAL. Though this may postpone the point of error until the sum exceeds 2**31 (and in which case the summation should be placed into an INTEGER*8).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
I was looking at why the loop unrolling did not provide any performance benefit and not paying more attention to the main problem !
The most practical solution to round-off is to convert to REAL(8) accumulators. (high precision float that does not overflow and minimises round-off)
The use of 4 accumulators in the DO loop can have a marginal improvement with more varied values for real numbers, similar to what can be identified in multi-threaded examples.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When the values in array numbers are always integers (and result and intermediaries integer) the best method is to have the array as INTEGER as opposed to REAL. IOW try not to mix expressions between integer and real. Simplifying this in this manner can provide sufficient information to the compiler to properly optimize (vectorize and unroll) the single statement loop.
Note, in the following source+assembly listing the single statement loop with int(numbers) is both vectorized (~equivalent to an unroll factor of 4) .AND. the vector version is unrolled an additional 8 (together ~equivalent to an unroll factor of 32):
;;; do i = 1,sz movdqu xmm1, XMMWORD PTR [_2il0floatpacket.0] ;16.9 xor eax, eax ;16.9 pxor xmm0, xmm0 ;11.9 lea rdx, QWORD PTR [__ImageBase] ;16.9 ; LOE rax rdx rbx rsi rdi r12 r13 r15d xmm0 xmm1 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15 .B1.8:: ; Preds .B1.8 .B1.7 ; Execution count [4.10e+003] ;;; idxl = idxl + int(numbers(i)) + i cvttps2dq xmm2, XMMWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+rdx+rax*4] ;17.26 cvttps2dq xmm3, XMMWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+16+rdx+rax*4] ;17.26 cvttps2dq xmm5, XMMWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+32+rdx+rax*4] ;17.26 cvttps2dq xmm4, XMMWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+48+rdx+rax*4] ;17.26 paddd xmm2, xmm12 ;17.12 paddd xmm12, xmm1 ;17.12 paddd xmm0, xmm2 ;17.24 paddd xmm3, xmm12 ;17.12 paddd xmm12, xmm1 ;17.12 paddd xmm0, xmm3 ;17.24 cvttps2dq xmm2, XMMWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+64+rdx+rax*4] ;17.26 cvttps2dq xmm3, XMMWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+112+rdx+rax*4] ;17.26 paddd xmm5, xmm12 ;17.12 paddd xmm12, xmm1 ;17.12 paddd xmm5, xmm0 ;17.24 paddd xmm4, xmm12 ;17.12 cvttps2dq xmm0, XMMWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+80+rdx+rax*4] ;17.26 paddd xmm12, xmm1 ;17.12 paddd xmm5, xmm4 ;17.24 paddd xmm2, xmm12 ;17.12 paddd xmm12, xmm1 ;17.12 paddd xmm2, xmm5 ;17.24 paddd xmm0, xmm12 ;17.12 paddd xmm2, xmm0 ;17.24 paddd xmm12, xmm1 ;17.12 cvttps2dq xmm0, XMMWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+96+rdx+rax*4] ;17.26 paddd xmm0, xmm12 ;17.12 paddd xmm12, xmm1 ;17.12 add rax, 32 ;16.9 paddd xmm0, xmm2 ;17.24 paddd xmm3, xmm12 ;17.12 paddd xmm12, xmm1 ;17.12 paddd xmm0, xmm3 ;17.24 cmp rax, 4096 ;16.9 jb .B1.8 ; Prob 99% ;16.9 ; LOE rax rdx rbx rsi rdi r12 r13 r15d xmm0 xmm1 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15 .B1.9:: ; Preds .B1.8 ; Execution count [1.00e+000] movdqa xmm1, xmm0 ;11.9 ;;; enddo
Also note, that the convert from REAL to INTEGER is occurring on the memory fetch, thus the performance impact is minimal.
Using the self unrolled loop (with int(numbers(i)) confuses the optimization and performs the loop using scalars:
(then hand unrolled x4 is compiler unrolled x2 yielding ~x8 unroll ***scalar*** code)
;;; DO i = 1,sz,4 ;;; idxl2 = idxl2 + int(numbers(i)) + i ;;; idxl2 = idxl2 + int(numbers(i+1)) + i+1 ;;; idxl2 = idxl2 + int(numbers(i+2)) + i+2 ;;; idxl2 = idxl2 + int(numbers(i+3)) + i+3 mov r9d, r15d ;37.12 xor r8d, r8d ;33.9 lea rcx, QWORD PTR [__ImageBase] ;33.9 ; LOE rcx rbx rsi rdi r8 r12 r13 r9d r14d r15d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm13 xmm14 xmm15 .B1.21:: ; Preds .B1.21 .B1.20 ; Execution count [5.12e+002] cvttss2si r10d, DWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+rcx+r8] ;34.28 cvttss2si eax, DWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+8+rcx+r8] ;36.28 cvttss2si r11d, DWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+4+rcx+r8] ;35.28 lea edx, DWORD PTR [r10+r15*8] ;34.12 cvttss2si r10d, DWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+12+rcx+r8] ;37.28 add edx, r11d ;37.12 lea eax, DWORD PTR [rax+r15*8] ;35.12 lea r11d, DWORD PTR [r10+r15*8] ;35.46 add eax, r11d ;35.26 add edx, eax ;37.12 cvttss2si eax, DWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+16+rcx+r8] ;34.28 cvttss2si r10d, DWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+24+rcx+r8] ;36.28 cvttss2si r11d, DWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+28+rcx+r8] ;37.28 lea edx, DWORD PTR [10+rdx+r15*8] ;37.12 add r14d, edx ;34.26 lea eax, DWORD PTR [rax+r15*8] ;34.12 cvttss2si edx, DWORD PTR [imagerel(ROUNDOFFERROR$NUMBERS.0.1)+20+rcx+r8] ;35.28 add eax, edx ;37.12 lea edx, DWORD PTR [r10+r15*8] ;35.12 add r8, 32 ;33.9 lea r11d, DWORD PTR [r11+r15*8] ;35.46 add edx, r11d ;35.26 add eax, edx ;37.12 lea eax, DWORD PTR [26+rax+r15*8] ;37.12 inc r15d ;33.9 add r9d, eax ;34.26 cmp r15d, 512 ;33.9 jb .B1.21 ; Prob 99% ;33.9 ; LOE rcx rbx rsi rdi r8 r12 r13 r9d r14d r15d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm13 xmm14 xmm15 .B1.22:: ; Preds .B1.21 ; Execution count [1.00e+000] cvttss2si edx, DWORD PTR [ROUNDOFFERROR$NUMBERS.0.1+16384] ;34.28 cvttss2si eax, DWORD PTR [ROUNDOFFERROR$NUMBERS.0.1+16388] ;35.28 cvttss2si r8d, DWORD PTR [ROUNDOFFERROR$NUMBERS.0.1+16396] ;37.28 add r14d, r9d ;37.12 lea r10d, DWORD PTR [4096+rax+rdx] ;37.12 cvttss2si r9d, DWORD PTR [ROUNDOFFERROR$NUMBERS.0.1+16392] ;36.28 lea r11d, DWORD PTR [8192+r8+r9] ;35.26 lea r15d, DWORD PTR [4106+r10+r11] ;37.12 add r14d, r15d ;34.26 ;;; ENDDO
Conclusion: trying to improve the code through hand unrolling is counter-productive in this case.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page