Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Call cpu_time makes program significantly slower?

Jiayi_W_
Beginner
739 Views

Hello!

I am running a simply do loop to change the values of a 21*21 double precision real matrix under the RELEASE mode. This loop takes me 58 seconds. (I call cpu_time() at the very beginning of the codes as well as at the end). However, when I comment out the call cpu_time, the running time becomes 87 seconds. 

I am surprised to see this large difference in running time (the only difference in the codes is call cpu_time). Anyone has the clue?

It seems that the difference disappears under DEBUG.

The codes that I ran is as follow:

-------------------------------------------------------------------------------------------------------------------------------------

real(kind=8), dimension(21,21) :: OCC


call system_clock(time0)
call CPU_TIME(time0_CPU)
OCC=1

do ic=1,4
do i1=1,90
do i2=1,2000000
      OCC=9.D0*9.D0
end do
end do
end do

call CPU_TIME(time1_CPU)
call system_clock(time1, time_rate)
write(*,*) time1_CPU-time0_CPU
write(*,*) real((time1 - time0),8) / real(time_rate,8)
read(*,*)
stop

0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
739 Views

This is what Andrew is getting at:

First measure of optimization is to reduce compile time sub-expressions:

real(kind=8), dimension(21,21) :: OCC

call system_clock(time0)
 call CPU_TIME(time0_CPU)
 OCC=1

do ic=1,4
 do i1=1,90
  do i2=1,2000000
       OCC=81.D0 ! was 9.D0*9.D0
  end do
 end do
end do

call CPU_TIME(time1_CPU)
 call system_clock(time1, time_rate)
 write(*,*) time1_CPU-time0_CPU
 write(*,*) real((time1 - time0),8) / real(time_rate,8)
 read(*,*)
 stop

Second measure is to lift loop invariant code out of the loop(s)

real(kind=8), dimension(21,21) :: OCC

call system_clock(time0)
 call CPU_TIME(time0_CPU)
 OCC=1

OCC=9.D0*9.D0
do ic=1,4
 do i1=1,90
 do i2=1,2000000
 end do
 end do
 end do

call CPU_TIME(time1_CPU)
 call system_clock(time1, time_rate)
 write(*,*) time1_CPU-time0_CPU
 write(*,*) real((time1 - time0),8) / real(time_rate,8)
 read(*,*)
 stop

Third measure is to remove "useless code", in this case the loops that do nothing and also do not reference the last value of the loop control variables:

real(kind=8), dimension(21,21) :: OCC

call system_clock(time0)
 call CPU_TIME(time0_CPU)

OCC=9.D0*9.D0

call CPU_TIME(time1_CPU)
 call system_clock(time1, time_rate)
 write(*,*) time1_CPU-time0_CPU
 write(*,*) real((time1 - time0),8) / real(time_rate,8)
 read(*,*)
 stop

Note, if you did use the resultant loop control variables: ic and/or i1 and/or i2, any good compiler optimization would simply set the variable to the exit value and not execute the loop(s) at all.

Forth measure is to remove results generated but not used (remove the array OCC and code to initialize it)

program
call system_clock(time0)
 call CPU_TIME(time0_CPU)

call CPU_TIME(time1_CPU)
 call system_clock(time1, time_rate)
 write(*,*) time1_CPU-time0_CPU
 write(*,*) real((time1 - time0),8) / real(time_rate,8)
 read(*,*)
 stop
end program

When writing sample code for performance test purposes you have to assure that the compiler optimizations do not do something that is not expected:

real(kind=8), dimension(21,21) :: OCC

call system_clock(time0)
 call CPU_TIME(time0_CPU)
 OCC=time0CPU ! a value indeterminable at compile time

do ic=1,4
 do i1=1,90
 do i2=1,2000000
       OCC=OCC*1.00000009D0*1.0000009D0 !Avoid overflow
 end do
 end do
 end do

call CPU_TIME(time1_CPU)
 call system_clock(time1, time_rate)
 write(*,*) time1_CPU-time0_CPU
 write(*,*) real((time1 - time0),8) / real(time_rate,8)
 read(*,*)
 ! now use OCC in a manner that the compiler cannot predetermine
 if(sum(OCC) .lt. 0.0) print *,"this shouldn't print"
 stop

Also, consider using the OpenMP omp_get_wtime() function. Reason being is CPU_TIME returns the sum of all threads CPU runtime. If for example you enabled auto-parallelization, and 4 threads were employed and the actual runtime (wall clock) was 1/4th that of the single thread version CPU_TIME will return ~4x the wall clock time (~the single thread runtime).

Jim Dempsey

View solution in original post

0 Kudos
6 Replies
mecej4
Honored Contributor III
739 Views

Perhaps you do not see the humor in your description:

This loop takes me 58 seconds. (I call cpu_time() at the very beginning of the codes as well as at the end). However, when I comment out the call cpu_time, the running time becomes 87 seconds. 

It that were really true, adding a few more calls to CPU_TIME() should reduce your run time to zero (well, we know that negative run times violate the principle of causality, so we won't go that far).

0 Kudos
Jiayi_W_
Beginner
739 Views

I surely understand that call cpu_time has no function to reduce the running time of the codes. I would just like to ask (not making jokes) is why under release mode, by commenting out the call CPU_time, the running time of the codes can vary so much? 

 After posting this question here, I've tried more and it seems that if I give the matrix initial values when I declare it, there is no this issue. 

 

mecej4 wrote:

Perhaps you do not see the humor in your description:

This loop takes me 58 seconds. (I call cpu_time() at the very beginning of the codes as well as at the end). However, when I comment out the call cpu_time, the running time becomes 87 seconds. 

It that were really true, adding a few more calls to CPU_TIME() should reduce your run time to zero (well, we know that negative run times violate the principle of causality, so we won't go that far).

0 Kudos
andrew_4619
Honored Contributor II
739 Views

Given the loop  does nothing (the loop indicices are not used) the optimiser can have great fun reducing your code to very little. 

Try the test case with optimisation off (you get that in debug mode)

0 Kudos
jimdempseyatthecove
Honored Contributor III
740 Views

This is what Andrew is getting at:

First measure of optimization is to reduce compile time sub-expressions:

real(kind=8), dimension(21,21) :: OCC

call system_clock(time0)
 call CPU_TIME(time0_CPU)
 OCC=1

do ic=1,4
 do i1=1,90
  do i2=1,2000000
       OCC=81.D0 ! was 9.D0*9.D0
  end do
 end do
end do

call CPU_TIME(time1_CPU)
 call system_clock(time1, time_rate)
 write(*,*) time1_CPU-time0_CPU
 write(*,*) real((time1 - time0),8) / real(time_rate,8)
 read(*,*)
 stop

Second measure is to lift loop invariant code out of the loop(s)

real(kind=8), dimension(21,21) :: OCC

call system_clock(time0)
 call CPU_TIME(time0_CPU)
 OCC=1

OCC=9.D0*9.D0
do ic=1,4
 do i1=1,90
 do i2=1,2000000
 end do
 end do
 end do

call CPU_TIME(time1_CPU)
 call system_clock(time1, time_rate)
 write(*,*) time1_CPU-time0_CPU
 write(*,*) real((time1 - time0),8) / real(time_rate,8)
 read(*,*)
 stop

Third measure is to remove "useless code", in this case the loops that do nothing and also do not reference the last value of the loop control variables:

real(kind=8), dimension(21,21) :: OCC

call system_clock(time0)
 call CPU_TIME(time0_CPU)

OCC=9.D0*9.D0

call CPU_TIME(time1_CPU)
 call system_clock(time1, time_rate)
 write(*,*) time1_CPU-time0_CPU
 write(*,*) real((time1 - time0),8) / real(time_rate,8)
 read(*,*)
 stop

Note, if you did use the resultant loop control variables: ic and/or i1 and/or i2, any good compiler optimization would simply set the variable to the exit value and not execute the loop(s) at all.

Forth measure is to remove results generated but not used (remove the array OCC and code to initialize it)

program
call system_clock(time0)
 call CPU_TIME(time0_CPU)

call CPU_TIME(time1_CPU)
 call system_clock(time1, time_rate)
 write(*,*) time1_CPU-time0_CPU
 write(*,*) real((time1 - time0),8) / real(time_rate,8)
 read(*,*)
 stop
end program

When writing sample code for performance test purposes you have to assure that the compiler optimizations do not do something that is not expected:

real(kind=8), dimension(21,21) :: OCC

call system_clock(time0)
 call CPU_TIME(time0_CPU)
 OCC=time0CPU ! a value indeterminable at compile time

do ic=1,4
 do i1=1,90
 do i2=1,2000000
       OCC=OCC*1.00000009D0*1.0000009D0 !Avoid overflow
 end do
 end do
 end do

call CPU_TIME(time1_CPU)
 call system_clock(time1, time_rate)
 write(*,*) time1_CPU-time0_CPU
 write(*,*) real((time1 - time0),8) / real(time_rate,8)
 read(*,*)
 ! now use OCC in a manner that the compiler cannot predetermine
 if(sum(OCC) .lt. 0.0) print *,"this shouldn't print"
 stop

Also, consider using the OpenMP omp_get_wtime() function. Reason being is CPU_TIME returns the sum of all threads CPU runtime. If for example you enabled auto-parallelization, and 4 threads were employed and the actual runtime (wall clock) was 1/4th that of the single thread version CPU_TIME will return ~4x the wall clock time (~the single thread runtime).

Jim Dempsey

0 Kudos
Jiayi_W_
Beginner
739 Views

I am a beginner and I speculated some configuration difference between release and debug leads to the issue I described.

But I was guessing it comes from some checks (e.g. the initial values of the array) are skipped. Thank you very much for letting me know that it comes from the optimization option! 

  

andrew_4619 wrote:

Given the loop  does nothing (the loop indicices are not used) the optimiser can have great fun reducing your code to very little. 

Try the test case with optimisation off (you get that in debug mode)

0 Kudos
Jiayi_W_
Beginner
739 Views

Thank you so much for your detailed explanation for how the optimization works. I am a beginner. I learned from Andrew's answer that the optimization plays a big role in the example I tried. From your answer, I've better understood why there is such a large difference in the running time with optimization on/ off.

With the help of your answer, I also understand why even after I added billions of billions loops in my example, the running time was still of no big difference under the release mode.  

Thanks a lot!

jimdempseyatthecove wrote:

This is what Andrew is getting at:

First measure of optimization is to reduce compile time sub-expressions:

real(kind=8), dimension(21,21) :: OCC

call system_clock(time0)
 call CPU_TIME(time0_CPU)
 OCC=1

do ic=1,4
 do i1=1,90
  do i2=1,2000000
       OCC=81.D0 ! was 9.D0*9.D0
  end do
 end do
end do

call CPU_TIME(time1_CPU)
 call system_clock(time1, time_rate)
 write(*,*) time1_CPU-time0_CPU
 write(*,*) real((time1 - time0),8) / real(time_rate,8)
 read(*,*)
 stop

Second measure is to lift loop invariant code out of the loop(s)

real(kind=8), dimension(21,21) :: OCC

call system_clock(time0)
 call CPU_TIME(time0_CPU)
 OCC=1

OCC=9.D0*9.D0
do ic=1,4
 do i1=1,90
 do i2=1,2000000
 end do
 end do
 end do

call CPU_TIME(time1_CPU)
 call system_clock(time1, time_rate)
 write(*,*) time1_CPU-time0_CPU
 write(*,*) real((time1 - time0),8) / real(time_rate,8)
 read(*,*)
 stop

Third measure is to remove "useless code", in this case the loops that do nothing and also do not reference the last value of the loop control variables:

real(kind=8), dimension(21,21) :: OCC

call system_clock(time0)
 call CPU_TIME(time0_CPU)

OCC=9.D0*9.D0

call CPU_TIME(time1_CPU)
 call system_clock(time1, time_rate)
 write(*,*) time1_CPU-time0_CPU
 write(*,*) real((time1 - time0),8) / real(time_rate,8)
 read(*,*)
 stop

Note, if you did use the resultant loop control variables: ic and/or i1 and/or i2, any good compiler optimization would simply set the variable to the exit value and not execute the loop(s) at all.

Forth measure is to remove results generated but not used (remove the array OCC and code to initialize it)

program
call system_clock(time0)
 call CPU_TIME(time0_CPU)

call CPU_TIME(time1_CPU)
 call system_clock(time1, time_rate)
 write(*,*) time1_CPU-time0_CPU
 write(*,*) real((time1 - time0),8) / real(time_rate,8)
 read(*,*)
 stop
end program

When writing sample code for performance test purposes you have to assure that the compiler optimizations do not do something that is not expected:

real(kind=8), dimension(21,21) :: OCC

call system_clock(time0)
 call CPU_TIME(time0_CPU)
 OCC=time0CPU ! a value indeterminable at compile time

do ic=1,4
 do i1=1,90
 do i2=1,2000000
       OCC=OCC*1.00000009D0*1.0000009D0 !Avoid overflow
 end do
 end do
 end do

call CPU_TIME(time1_CPU)
 call system_clock(time1, time_rate)
 write(*,*) time1_CPU-time0_CPU
 write(*,*) real((time1 - time0),8) / real(time_rate,8)
 read(*,*)
 ! now use OCC in a manner that the compiler cannot predetermine
 if(sum(OCC) .lt. 0.0) print *,"this shouldn't print"
 stop

Also, consider using the OpenMP omp_get_wtime() function. Reason being is CPU_TIME returns the sum of all threads CPU runtime. If for example you enabled auto-parallelization, and 4 threads were employed and the actual runtime (wall clock) was 1/4th that of the single thread version CPU_TIME will return ~4x the wall clock time (~the single thread runtime).

Jim Dempsey

0 Kudos
Reply