- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to use different optimizations to compare differences in computation time. It seems that O3 optimization largely decrease the performance of the following code compared to the O2 optimizations. It take 43s with O3 (~34s with O0) but only 19s with O2 option.
DO k= 1,50
DO i = 1,n
DO j = 1,n
a(i,j) = (a(i,j)*b(i,j) + c(i,j)*d(i,j) +g(i,j)*h(i,j))
ENDDO
ENDDO
ENDDO
$ ifort --version
ifort (IFORT) 16.0.3 20160415
Copyright (C) 1985-2016 Intel Corporation. All rights reserved.
I know that change the order of loops can improve the efficiency of memory access and finally lead to faster execution time. But I am curious the reason that causes this problem. By using ifort (IFORT) 14.0.1 20131008, this problem has never happened.
Thanks, Jason
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
While it is possible for code compiled with /O3 to run slower than with /O2, the information that you have provided is not enough to draw any conclusions. In particular, what is the value of n, and how and what did you time? Did you time just the loops that you showed above, or did you time the whole program?
The loops in the example program below took 2.1 s with /O2 and 1.5 s with /O3 when compiled with 2016U4-64 bit on Windows. These timings are for just the loops. The whole program took about 3 X longer.
program sijie implicit none integer :: i,j,k integer, parameter :: n=9000 real, dimension(n,n) :: a,b,c,d,g,h real :: t1,t2 ! call random_number(a) call random_number(b) call random_number(c) call random_number(d) call random_number(g) call random_number(h) call cpu_time(t1) DO k= 1,50 DO i = 1,n DO j = 1,n a(i,j) = (a(i,j)*b(i,j) + c(i,j)*d(i,j) +g(i,j)*h(i,j)) ENDDO ENDDO ENDDO call cpu_time(t2) print *,t2-t1 end program
I obtained similar results with other versions of the compiler.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
mecej4 wrote:
While it is possible for code compiled with /O3 to run slower than with /O2, the information that you have provided is not enough to draw any conclusions. In particular, what is the value of n, and how and what did you time? Did you time just the loops that you showed above, or did you time the whole program?
The loops in the example program below took 2.1 s with /O2 and 1.5 s with /O3 when compiled with 2016U4-64 bit on Windows. These timings are for just the loops. The whole program took about 3 X longer.
program sijie implicit none integer :: i,j,k integer, parameter :: n=9000 real, dimension(n,n) :: a,b,c,d,g,h real :: t1,t2 ! call random_number(a) call random_number(b) call random_number(c) call random_number(d) call random_number(g) call random_number(h) call cpu_time(t1) DO k= 1,50 DO i = 1,n DO j = 1,n a(i,j) = (a(i,j)*b(i,j) + c(i,j)*d(i,j) +g(i,j)*h(i,j)) ENDDO ENDDO ENDDO call cpu_time(t2) print *,t2-t1 end programI obtained similar results with other versions of the compiler.
Hi mecej4,
Thanks for your reply. The value of n in my case is even less than yours. It's 5000 here and the program is posted as below. The time is just used by the loops in the shown code. The code is compiled with intel fortran and running on a distributed system with Linux version 3.10.0-514.10.2.el7.x86_64. Any idea about this problem?
PROGRAM HW1 IMPLICIT NONE INTEGER :: n PARAMETER(n=5000) REAL :: a(n,n),b(n,n),c(n,n),d(n,n),e(n,n),f(n,n),g(n,n),h(n,n),m(n,n) REAL :: t1,t2,cputime,w1,w2,walltime REAL :: tmp(2) REAL :: f_cputime,f_walltime INTEGER :: omp_get_max_threads INTEGER :: i,j,k a = 1.0 b = 2.0 c = 2.0 d = 4.0 g = 5.0 h = 6.0 call init_walltime !------------------------------------------------------------------------------ ! Section 1a !------------------------------------------------------------------------------ print*,'Number of OpenMP threads to be used =', omp_get_max_threads() t1 = f_cputime() w1 = f_walltime() DO k= 1,50 DO i = 1,n DO j = 1,n a(i,j) = (a(i,j)*b(i,j) + c(i,j)*d(i,j) +g(i,j)*h(i,j)) ENDDO ENDDO ENDDO t2 = f_cputime() w2 = f_walltime() PRINT*,'CPU and wall times used by section 1a was ',t2-t1,',',w2-w1,'(s)' END PROGRAM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Using IFort 16U4-64 on Windows 10, and using your code with the intrinsic CPU_TIME() instead of your non-standard timing routines, with 4 cores/8 threads on a laptop with an i7-2720 CPU, I obtain the following results:
With /O2 /MT /Qopenmp /link /stack:1000000000: 0.67 s.
Same, but /O3 : 0.42 s.
Surely, the times that you reported in #1 (43 s and 19 s) were not for the program of #3?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, they are. I'll try intrinsic fortran function.
mecej4 wrote:
Using IFort 16U4-64 on Windows 10, and using your code with the intrinsic CPU_TIME() instead of your non-standard timing routines, with 4 cores/8 threads on a laptop with an i7-2720 CPU, I obtain the following results:
With /O2 /MT /Qopenmp /link /stack:1000000000: 0.67 s.
Same, but /O3 : 0.42 s.
Surely, the times that you reported in #1 (43 s and 19 s) were not for the program of #3?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is the Linux system you use a cluster? Is your program utilizing more than one node? It is unlikely that if the system runs 50 times slower than a laptop, unless it has processors that are over a decade old or the buses and fabric are slow.
Information on the processor model, etc., which can be obtained from the virtual file /proc/cpuinfo , would be helpful in placing these timing results in perspective.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
mecej4 wrote:
Is the Linux system you use a cluster? Is your program utilizing more than one node? It is unlikely that if the system runs 50 times slower than a laptop, unless it has processors that are over a decade old or the buses and fabric are slow.
Information on the processor model, etc., which can be obtained from the virtual file /proc/cpuinfo , would be helpful in placing these timing results in perspective.
Yes, it is a cluster. The code is compiled with O3 option only and is running with only 1 processor. The processor info is attached. I think this much longer running time is due to the memory/cache management. It only takes 2s (similar to your result) when I do a loop interchange, i.e.,
DO k=1,50 DO j=1,n DO i=1,n a(i,j)=... END DO END DO END DO
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page