- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello!
I am running a simply do loop to change the values of a 21*21 double precision real matrix under the RELEASE mode. This loop takes me 58 seconds. (I call cpu_time() at the very beginning of the codes as well as at the end). However, when I comment out the call cpu_time, the running time becomes 87 seconds.
I am surprised to see this large difference in running time (the only difference in the codes is call cpu_time). Anyone has the clue?
It seems that the difference disappears under DEBUG.
The codes that I ran is as follow:
-------------------------------------------------------------------------------------------------------------------------------------
real(kind=8), dimension(21,21) :: OCC
call system_clock(time0)
call CPU_TIME(time0_CPU)
OCC=1
do ic=1,4
do i1=1,90
do i2=1,2000000
OCC=9.D0*9.D0
end do
end do
end do
call CPU_TIME(time1_CPU)
call system_clock(time1, time_rate)
write(*,*) time1_CPU-time0_CPU
write(*,*) real((time1 - time0),8) / real(time_rate,8)
read(*,*)
stop
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is what Andrew is getting at:
First measure of optimization is to reduce compile time sub-expressions:
real(kind=8), dimension(21,21) :: OCC call system_clock(time0) call CPU_TIME(time0_CPU) OCC=1 do ic=1,4 do i1=1,90 do i2=1,2000000 OCC=81.D0 ! was 9.D0*9.D0 end do end do end do call CPU_TIME(time1_CPU) call system_clock(time1, time_rate) write(*,*) time1_CPU-time0_CPU write(*,*) real((time1 - time0),8) / real(time_rate,8) read(*,*) stop
Second measure is to lift loop invariant code out of the loop(s)
real(kind=8), dimension(21,21) :: OCC call system_clock(time0) call CPU_TIME(time0_CPU) OCC=1 OCC=9.D0*9.D0 do ic=1,4 do i1=1,90 do i2=1,2000000 end do end do end do call CPU_TIME(time1_CPU) call system_clock(time1, time_rate) write(*,*) time1_CPU-time0_CPU write(*,*) real((time1 - time0),8) / real(time_rate,8) read(*,*) stop
Third measure is to remove "useless code", in this case the loops that do nothing and also do not reference the last value of the loop control variables:
real(kind=8), dimension(21,21) :: OCC call system_clock(time0) call CPU_TIME(time0_CPU) OCC=9.D0*9.D0 call CPU_TIME(time1_CPU) call system_clock(time1, time_rate) write(*,*) time1_CPU-time0_CPU write(*,*) real((time1 - time0),8) / real(time_rate,8) read(*,*) stop
Note, if you did use the resultant loop control variables: ic and/or i1 and/or i2, any good compiler optimization would simply set the variable to the exit value and not execute the loop(s) at all.
Forth measure is to remove results generated but not used (remove the array OCC and code to initialize it)
program call system_clock(time0) call CPU_TIME(time0_CPU) call CPU_TIME(time1_CPU) call system_clock(time1, time_rate) write(*,*) time1_CPU-time0_CPU write(*,*) real((time1 - time0),8) / real(time_rate,8) read(*,*) stop end program
When writing sample code for performance test purposes you have to assure that the compiler optimizations do not do something that is not expected:
real(kind=8), dimension(21,21) :: OCC call system_clock(time0) call CPU_TIME(time0_CPU) OCC=time0CPU ! a value indeterminable at compile time do ic=1,4 do i1=1,90 do i2=1,2000000 OCC=OCC*1.00000009D0*1.0000009D0 !Avoid overflow end do end do end do call CPU_TIME(time1_CPU) call system_clock(time1, time_rate) write(*,*) time1_CPU-time0_CPU write(*,*) real((time1 - time0),8) / real(time_rate,8) read(*,*) ! now use OCC in a manner that the compiler cannot predetermine if(sum(OCC) .lt. 0.0) print *,"this shouldn't print" stop
Also, consider using the OpenMP omp_get_wtime() function. Reason being is CPU_TIME returns the sum of all threads CPU runtime. If for example you enabled auto-parallelization, and 4 threads were employed and the actual runtime (wall clock) was 1/4th that of the single thread version CPU_TIME will return ~4x the wall clock time (~the single thread runtime).
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Perhaps you do not see the humor in your description:
This loop takes me 58 seconds. (I call cpu_time() at the very beginning of the codes as well as at the end). However, when I comment out the call cpu_time, the running time becomes 87 seconds.
It that were really true, adding a few more calls to CPU_TIME() should reduce your run time to zero (well, we know that negative run times violate the principle of causality, so we won't go that far).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I surely understand that call cpu_time has no function to reduce the running time of the codes. I would just like to ask (not making jokes) is why under release mode, by commenting out the call CPU_time, the running time of the codes can vary so much?
After posting this question here, I've tried more and it seems that if I give the matrix initial values when I declare it, there is no this issue.
mecej4 wrote:
Perhaps you do not see the humor in your description:
This loop takes me 58 seconds. (I call cpu_time() at the very beginning of the codes as well as at the end). However, when I comment out the call cpu_time, the running time becomes 87 seconds.
It that were really true, adding a few more calls to CPU_TIME() should reduce your run time to zero (well, we know that negative run times violate the principle of causality, so we won't go that far).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Given the loop does nothing (the loop indicices are not used) the optimiser can have great fun reducing your code to very little.
Try the test case with optimisation off (you get that in debug mode)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is what Andrew is getting at:
First measure of optimization is to reduce compile time sub-expressions:
real(kind=8), dimension(21,21) :: OCC call system_clock(time0) call CPU_TIME(time0_CPU) OCC=1 do ic=1,4 do i1=1,90 do i2=1,2000000 OCC=81.D0 ! was 9.D0*9.D0 end do end do end do call CPU_TIME(time1_CPU) call system_clock(time1, time_rate) write(*,*) time1_CPU-time0_CPU write(*,*) real((time1 - time0),8) / real(time_rate,8) read(*,*) stop
Second measure is to lift loop invariant code out of the loop(s)
real(kind=8), dimension(21,21) :: OCC call system_clock(time0) call CPU_TIME(time0_CPU) OCC=1 OCC=9.D0*9.D0 do ic=1,4 do i1=1,90 do i2=1,2000000 end do end do end do call CPU_TIME(time1_CPU) call system_clock(time1, time_rate) write(*,*) time1_CPU-time0_CPU write(*,*) real((time1 - time0),8) / real(time_rate,8) read(*,*) stop
Third measure is to remove "useless code", in this case the loops that do nothing and also do not reference the last value of the loop control variables:
real(kind=8), dimension(21,21) :: OCC call system_clock(time0) call CPU_TIME(time0_CPU) OCC=9.D0*9.D0 call CPU_TIME(time1_CPU) call system_clock(time1, time_rate) write(*,*) time1_CPU-time0_CPU write(*,*) real((time1 - time0),8) / real(time_rate,8) read(*,*) stop
Note, if you did use the resultant loop control variables: ic and/or i1 and/or i2, any good compiler optimization would simply set the variable to the exit value and not execute the loop(s) at all.
Forth measure is to remove results generated but not used (remove the array OCC and code to initialize it)
program call system_clock(time0) call CPU_TIME(time0_CPU) call CPU_TIME(time1_CPU) call system_clock(time1, time_rate) write(*,*) time1_CPU-time0_CPU write(*,*) real((time1 - time0),8) / real(time_rate,8) read(*,*) stop end program
When writing sample code for performance test purposes you have to assure that the compiler optimizations do not do something that is not expected:
real(kind=8), dimension(21,21) :: OCC call system_clock(time0) call CPU_TIME(time0_CPU) OCC=time0CPU ! a value indeterminable at compile time do ic=1,4 do i1=1,90 do i2=1,2000000 OCC=OCC*1.00000009D0*1.0000009D0 !Avoid overflow end do end do end do call CPU_TIME(time1_CPU) call system_clock(time1, time_rate) write(*,*) time1_CPU-time0_CPU write(*,*) real((time1 - time0),8) / real(time_rate,8) read(*,*) ! now use OCC in a manner that the compiler cannot predetermine if(sum(OCC) .lt. 0.0) print *,"this shouldn't print" stop
Also, consider using the OpenMP omp_get_wtime() function. Reason being is CPU_TIME returns the sum of all threads CPU runtime. If for example you enabled auto-parallelization, and 4 threads were employed and the actual runtime (wall clock) was 1/4th that of the single thread version CPU_TIME will return ~4x the wall clock time (~the single thread runtime).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am a beginner and I speculated some configuration difference between release and debug leads to the issue I described.
But I was guessing it comes from some checks (e.g. the initial values of the array) are skipped. Thank you very much for letting me know that it comes from the optimization option!
andrew_4619 wrote:
Given the loop does nothing (the loop indicices are not used) the optimiser can have great fun reducing your code to very little.
Try the test case with optimisation off (you get that in debug mode)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so much for your detailed explanation for how the optimization works. I am a beginner. I learned from Andrew's answer that the optimization plays a big role in the example I tried. From your answer, I've better understood why there is such a large difference in the running time with optimization on/ off.
With the help of your answer, I also understand why even after I added billions of billions loops in my example, the running time was still of no big difference under the release mode.
Thanks a lot!
jimdempseyatthecove wrote:
This is what Andrew is getting at:
First measure of optimization is to reduce compile time sub-expressions:
real(kind=8), dimension(21,21) :: OCC call system_clock(time0) call CPU_TIME(time0_CPU) OCC=1 do ic=1,4 do i1=1,90 do i2=1,2000000 OCC=81.D0 ! was 9.D0*9.D0 end do end do end do call CPU_TIME(time1_CPU) call system_clock(time1, time_rate) write(*,*) time1_CPU-time0_CPU write(*,*) real((time1 - time0),8) / real(time_rate,8) read(*,*) stopSecond measure is to lift loop invariant code out of the loop(s)
real(kind=8), dimension(21,21) :: OCC call system_clock(time0) call CPU_TIME(time0_CPU) OCC=1 OCC=9.D0*9.D0 do ic=1,4 do i1=1,90 do i2=1,2000000 end do end do end do call CPU_TIME(time1_CPU) call system_clock(time1, time_rate) write(*,*) time1_CPU-time0_CPU write(*,*) real((time1 - time0),8) / real(time_rate,8) read(*,*) stopThird measure is to remove "useless code", in this case the loops that do nothing and also do not reference the last value of the loop control variables:
real(kind=8), dimension(21,21) :: OCC call system_clock(time0) call CPU_TIME(time0_CPU) OCC=9.D0*9.D0 call CPU_TIME(time1_CPU) call system_clock(time1, time_rate) write(*,*) time1_CPU-time0_CPU write(*,*) real((time1 - time0),8) / real(time_rate,8) read(*,*) stopNote, if you did use the resultant loop control variables: ic and/or i1 and/or i2, any good compiler optimization would simply set the variable to the exit value and not execute the loop(s) at all.
Forth measure is to remove results generated but not used (remove the array OCC and code to initialize it)
program call system_clock(time0) call CPU_TIME(time0_CPU) call CPU_TIME(time1_CPU) call system_clock(time1, time_rate) write(*,*) time1_CPU-time0_CPU write(*,*) real((time1 - time0),8) / real(time_rate,8) read(*,*) stop end programWhen writing sample code for performance test purposes you have to assure that the compiler optimizations do not do something that is not expected:
real(kind=8), dimension(21,21) :: OCC call system_clock(time0) call CPU_TIME(time0_CPU) OCC=time0CPU ! a value indeterminable at compile time do ic=1,4 do i1=1,90 do i2=1,2000000 OCC=OCC*1.00000009D0*1.0000009D0 !Avoid overflow end do end do end do call CPU_TIME(time1_CPU) call system_clock(time1, time_rate) write(*,*) time1_CPU-time0_CPU write(*,*) real((time1 - time0),8) / real(time_rate,8) read(*,*) ! now use OCC in a manner that the compiler cannot predetermine if(sum(OCC) .lt. 0.0) print *,"this shouldn't print" stopAlso, consider using the OpenMP omp_get_wtime() function. Reason being is CPU_TIME returns the sum of all threads CPU runtime. If for example you enabled auto-parallelization, and 4 threads were employed and the actual runtime (wall clock) was 1/4th that of the single thread version CPU_TIME will return ~4x the wall clock time (~the single thread runtime).
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page