O3 optimizations make the code slower

Sijie_P_ · ‎09-14-2017

I'm trying to use different optimizations to compare differences in computation time. It seems that O3 optimization largely decrease the performance of the following code compared to the O2 optimizations. It take 43s with O3 (~34s with O0) but only 19s with O2 option.

DO k= 1,50

DO i = 1,n
DO j = 1,n
a(i,j) = (a(i,j)*b(i,j) + c(i,j)*d(i,j) +g(i,j)*h(i,j))
ENDDO
ENDDO

ENDDO

$ ifort --version
ifort (IFORT) 16.0.3 20160415
Copyright (C) 1985-2016 Intel Corporation. All rights reserved.

I know that change the order of loops can improve the efficiency of memory access and finally lead to faster execution time. But I am curious the reason that causes this problem. By using ifort (IFORT) 14.0.1 20131008, this problem has never happened.

Thanks, Jason

mecej4 · ‎09-15-2017

While it is possible for code compiled with /O3 to run slower than with /O2, the information that you have provided is not enough to draw any conclusions. In particular, what is the value of n, and how and what did you time? Did you time just the loops that you showed above, or did you time the whole program?

The loops in the example program below took 2.1 s with /O2 and 1.5 s with /O3 when compiled with 2016U4-64 bit on Windows. These timings are for just the loops. The whole program took about 3 X longer.

program sijie
implicit none
integer :: i,j,k
integer, parameter :: n=9000
real, dimension(n,n) :: a,b,c,d,g,h
real :: t1,t2
!
call random_number(a)
call random_number(b)
call random_number(c)
call random_number(d)
call random_number(g)
call random_number(h)

call cpu_time(t1)
DO k= 1,50

  DO i = 1,n
    DO j = 1,n
      a(i,j) = (a(i,j)*b(i,j) + c(i,j)*d(i,j) +g(i,j)*h(i,j))
    ENDDO
  ENDDO

ENDDO
call cpu_time(t2)
print *,t2-t1
end program

I obtained similar results with other versions of the compiler.

Sijie_P_ · ‎09-15-2017

mecej4 wrote:

While it is possible for code compiled with /O3 to run slower than with /O2, the information that you have provided is not enough to draw any conclusions. In particular, what is the value of n, and how and what did you time? Did you time just the loops that you showed above, or did you time the whole program?

The loops in the example program below took 2.1 s with /O2 and 1.5 s with /O3 when compiled with 2016U4-64 bit on Windows. These timings are for just the loops. The whole program took about 3 X longer.
program sijie
implicit none
integer :: i,j,k
integer, parameter :: n=9000
real, dimension(n,n) :: a,b,c,d,g,h
real :: t1,t2
!
call random_number(a)
call random_number(b)
call random_number(c)
call random_number(d)
call random_number(g)
call random_number(h)

call cpu_time(t1)
DO k= 1,50

  DO i = 1,n
    DO j = 1,n
      a(i,j) = (a(i,j)*b(i,j) + c(i,j)*d(i,j) +g(i,j)*h(i,j))
    ENDDO
  ENDDO

ENDDO
call cpu_time(t2)
print *,t2-t1
end program
I obtained similar results with other versions of the compiler.

Hi mecej4,

Thanks for your reply. The value of n in my case is even less than yours. It's 5000 here and the program is posted as below. The time is just used by the loops in the shown code. The code is compiled with intel fortran and running on a distributed system with Linux version 3.10.0-514.10.2.el7.x86_64. Any idea about this problem?

PROGRAM HW1

  IMPLICIT NONE
  INTEGER :: n
  PARAMETER(n=5000)
  REAL :: a(n,n),b(n,n),c(n,n),d(n,n),e(n,n),f(n,n),g(n,n),h(n,n),m(n,n)
  REAL :: t1,t2,cputime,w1,w2,walltime
  REAL :: tmp(2)
  REAL :: f_cputime,f_walltime
  INTEGER :: omp_get_max_threads

  INTEGER :: i,j,k

  a = 1.0
  b = 2.0
  c = 2.0
  d = 4.0
  g = 5.0
  h = 6.0

  call init_walltime

!------------------------------------------------------------------------------
! Section 1a
!------------------------------------------------------------------------------

  print*,'Number of OpenMP threads to be used =', omp_get_max_threads()

  t1 = f_cputime()
  w1 = f_walltime()
  DO k= 1,50

  DO i = 1,n
    DO j = 1,n
      a(i,j) = (a(i,j)*b(i,j) + c(i,j)*d(i,j) +g(i,j)*h(i,j))
    ENDDO
  ENDDO

  ENDDO
  t2 = f_cputime()
  w2 = f_walltime()

  PRINT*,'CPU and wall times used by section 1a was ',t2-t1,',',w2-w1,'(s)'

END PROGRAM

mecej4 · ‎09-15-2017

Using IFort 16U4-64 on Windows 10, and using your code with the intrinsic CPU_TIME() instead of your non-standard timing routines, with 4 cores/8 threads on a laptop with an i7-2720 CPU, I obtain the following results:

With /O2 /MT /Qopenmp /link /stack:1000000000: 0.67 s.

Same, but /O3 : 0.42 s.

Surely, the times that you reported in #1 (43 s and 19 s) were not for the program of #3?

Sijie_P_ · ‎09-15-2017

Yes, they are. I'll try intrinsic fortran function.

mecej4 wrote:

Using IFort 16U4-64 on Windows 10, and using your code with the intrinsic CPU_TIME() instead of your non-standard timing routines, with 4 cores/8 threads on a laptop with an i7-2720 CPU, I obtain the following results:

With /O2 /MT /Qopenmp /link /stack:1000000000: 0.67 s.

Same, but /O3 : 0.42 s.

Surely, the times that you reported in #1 (43 s and 19 s) were not for the program of #3?

mecej4 · ‎09-15-2017

Is the Linux system you use a cluster? Is your program utilizing more than one node? It is unlikely that if the system runs 50 times slower than a laptop, unless it has processors that are over a decade old or the buses and fabric are slow.

Information on the processor model, etc., which can be obtained from the virtual file /proc/cpuinfo , would be helpful in placing these timing results in perspective.

Sijie_P_ · ‎09-15-2017

mecej4 wrote:

Is the Linux system you use a cluster? Is your program utilizing more than one node? It is unlikely that if the system runs 50 times slower than a laptop, unless it has processors that are over a decade old or the buses and fabric are slow.

Information on the processor model, etc., which can be obtained from the virtual file /proc/cpuinfo , would be helpful in placing these timing results in perspective.

Yes, it is a cluster. The code is compiled with O3 option only and is running with only 1 processor. The processor info is attached. I think this much longer running time is due to the memory/cache management. It only takes 2s (similar to your result) when I do a loop interchange, i.e.,

DO k=1,50
DO j=1,n
DO i=1,n
     a(i,j)=...
END DO
END DO
END DO

processor : 0

vendor_id : GenuineIntel

cpu family : 6

model : 63

model name : Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz

stepping : 2

microcode : 0x38

cpu MHz : 1418.453

cache size : 25600 KB

physical id : 0

siblings : 20

core id : 0

cpu cores : 10

apicid : 0

initial apicid : 0

fpu : yes

fpu_exception : yes

cpuid level : 15

wp : yes

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2

ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf

eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe p

opcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept

vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc

bogomips : 4594.99

clflush size : 64

cache_alignment : 64

address sizes : 46 bits physical, 48 bits virtual

power management: