Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!

O3 optimizations make the code slower

Beginner
210 Views

I'm trying to use different optimizations to compare differences in computation time. It seems that O3 optimization largely decrease the performance of the following code compared to the O2 optimizations. It take 43s with O3 (~34s with O0) but only 19s with O2 option.

DO k= 1,50

DO i = 1,n
DO j = 1,n
a(i,j) = (a(i,j)*b(i,j) + c(i,j)*d(i,j) +g(i,j)*h(i,j))
ENDDO
ENDDO

ENDDO

\$ ifort --version
ifort (IFORT) 16.0.3 20160415

I know that change the order of loops can improve the efficiency of memory access and finally lead to faster execution time. But I am curious the reason that causes this problem. By using ifort (IFORT) 14.0.1 20131008, this problem has never happened.

Thanks, Jason

6 Replies
Black Belt
210 Views

While it is possible for code compiled with /O3 to run slower than with /O2, the information that you have provided is not enough to draw any conclusions. In particular, what is the value of n, and how and what did you time? Did you time just the loops that you showed above, or did you time the whole program?

The loops in the example program below took 2.1 s with /O2 and 1.5 s with /O3 when compiled with 2016U4-64 bit on Windows. These timings are for just the loops. The whole program took about 3 X longer.

```program sijie
implicit none
integer :: i,j,k
integer, parameter :: n=9000
real, dimension(n,n) :: a,b,c,d,g,h
real :: t1,t2
!
call random_number(a)
call random_number(b)
call random_number(c)
call random_number(d)
call random_number(g)
call random_number(h)

call cpu_time(t1)
DO k= 1,50

DO i = 1,n
DO j = 1,n
a(i,j) = (a(i,j)*b(i,j) + c(i,j)*d(i,j) +g(i,j)*h(i,j))
ENDDO
ENDDO

ENDDO
call cpu_time(t2)
print *,t2-t1
end program```

I obtained similar results with other versions of the compiler.

Beginner
210 Views

mecej4 wrote:

While it is possible for code compiled with /O3 to run slower than with /O2, the information that you have provided is not enough to draw any conclusions. In particular, what is the value of n, and how and what did you time? Did you time just the loops that you showed above, or did you time the whole program?

The loops in the example program below took 2.1 s with /O2 and 1.5 s with /O3 when compiled with 2016U4-64 bit on Windows. These timings are for just the loops. The whole program took about 3 X longer.

```program sijie
implicit none
integer :: i,j,k
integer, parameter :: n=9000
real, dimension(n,n) :: a,b,c,d,g,h
real :: t1,t2
!
call random_number(a)
call random_number(b)
call random_number(c)
call random_number(d)
call random_number(g)
call random_number(h)

call cpu_time(t1)
DO k= 1,50

DO i = 1,n
DO j = 1,n
a(i,j) = (a(i,j)*b(i,j) + c(i,j)*d(i,j) +g(i,j)*h(i,j))
ENDDO
ENDDO

ENDDO
call cpu_time(t2)
print *,t2-t1
end program```

I obtained similar results with other versions of the compiler.

Hi mecej4,

Thanks for your reply. The value of n in my case is even less than yours. It's 5000 here and the program is posted as below. The time is just used by the loops in the shown code. The code is compiled with intel fortran and running on a distributed system with Linux version 3.10.0-514.10.2.el7.x86_64. Any idea about this problem?

```PROGRAM HW1

IMPLICIT NONE
INTEGER :: n
PARAMETER(n=5000)
REAL :: a(n,n),b(n,n),c(n,n),d(n,n),e(n,n),f(n,n),g(n,n),h(n,n),m(n,n)
REAL :: t1,t2,cputime,w1,w2,walltime
REAL :: tmp(2)
REAL :: f_cputime,f_walltime

INTEGER :: i,j,k

a = 1.0
b = 2.0
c = 2.0
d = 4.0
g = 5.0
h = 6.0

call init_walltime

!------------------------------------------------------------------------------
! Section 1a
!------------------------------------------------------------------------------

t1 = f_cputime()
w1 = f_walltime()
DO k= 1,50

DO i = 1,n
DO j = 1,n
a(i,j) = (a(i,j)*b(i,j) + c(i,j)*d(i,j) +g(i,j)*h(i,j))
ENDDO
ENDDO

ENDDO
t2 = f_cputime()
w2 = f_walltime()

PRINT*,'CPU and wall times used by section 1a was ',t2-t1,',',w2-w1,'(s)'

END PROGRAM```

Black Belt
210 Views

Using IFort 16U4-64 on Windows 10, and using your code with the intrinsic CPU_TIME() instead of your non-standard timing routines, with 4 cores/8 threads on a laptop with an i7-2720 CPU, I obtain the following results:

With /O2 /MT /Qopenmp /link /stack:1000000000: 0.67 s.

Same, but /O3 : 0.42 s.

Surely, the times that you reported in #1 (43 s and 19 s) were not for the program of #3?

Beginner
210 Views

Yes, they are. I'll try intrinsic fortran function.

mecej4 wrote:

Using IFort 16U4-64 on Windows 10, and using your code with the intrinsic CPU_TIME() instead of your non-standard timing routines, with 4 cores/8 threads on a laptop with an i7-2720 CPU, I obtain the following results:

With /O2 /MT /Qopenmp /link /stack:1000000000: 0.67 s.

Same, but /O3 : 0.42 s.

Surely, the times that you reported in #1 (43 s and 19 s) were not for the program of #3?

Black Belt
210 Views

Is the Linux system you use a cluster? Is your program utilizing more than one node? It is unlikely that if the system runs 50 times slower than a laptop, unless it has processors that are over a decade old or the buses and fabric are slow.

Information on the processor model, etc., which can be obtained from the virtual file /proc/cpuinfo , would be helpful in placing these timing results in perspective.

Beginner
210 Views

mecej4 wrote:

Is the Linux system you use a cluster? Is your program utilizing more than one node? It is unlikely that if the system runs 50 times slower than a laptop, unless it has processors that are over a decade old or the buses and fabric are slow.

Information on the processor model, etc., which can be obtained from the virtual file /proc/cpuinfo , would be helpful in placing these timing results in perspective.

Yes, it is a cluster. The code is compiled with O3 option only and is running with only 1 processor. The processor info is attached. I think this much longer running time is due to the memory/cache management. It only takes 2s (similar to your result) when I do a loop interchange, i.e.,

```DO k=1,50
DO j=1,n
DO i=1,n
a(i,j)=...
END DO
END DO
END DO```
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
stepping : 2
microcode : 0x38
cpu MHz : 1418.453
cache size : 25600 KB
physical id : 0
siblings : 20
core id : 0
cpu cores : 10
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 15
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2
ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf
eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe p
opcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept
vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc
bogomips : 4594.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management: