Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
26739 Discussions

Fortran low performance with allocatable arrays

Hahaha
Beginner
349 Views

When using allocatable arrays, the program is much slower than the one using static memory allocation. My program is too long to post here. I tried a few small codes. The results are a little bit strange. I tried three cases,

 

Method 1: takes 28.98s

module module_size_is_defined
  implicit none
  integer(4) :: n
end module

program main
  use module_size_is_defined
  implicit none
  
  integer(4) :: i
  real(8) :: y(50,50),z(50,50),t
  
  n = 50
  do i =1,50000
    t=dble(i) * 2.0D0
    call A(y,t)
    z = z + y
  end do
  write(*,*) z(1,1)
end
  
subroutine A(y,t)
  use module_size_is_defined
  implicit none
  real(8),intent(out):: y(n,n)
  real(8),intent(in) :: t
  integer(4) :: j
  real(8) :: x(1,50)
  
  y=0.0D0
  do j = 1, 200
    call getX(x,t,j)
    y = y + matmul( transpose(x) + dble(j)**2, x )
  end do
endsubroutine A
  
  
subroutine getX(x,t,j)
  use module_size_is_defined
  implicit none
  real(8),intent(out) :: x(1,n)
  real(8),intent(in) :: t
  integer(4),intent(in) :: j
  integer(4) :: i
  
  do i =1, n
    x(1,i)  = dble(i+j) * t ** (1.5D00) 
  end do
endsubroutine getX

Method 2: takes 30.56s

module module_size_is_defined
  implicit none
  integer(4) :: n
end module

program main
  use module_size_is_defined
  implicit none
  
  integer(4) :: i
  real(8) :: y(50,50),z(50,50),t
  
  n = 50
  do i =1,50000
    t=dble(i) * 2.0D0
    call A(y,t)
    z = z + y
  end do
  write(*,*) z(1,1)
end
  
subroutine A(y,t)
  use module_size_is_defined
  implicit none
  real(8),intent(out):: y(n,n)
  real(8),intent(in) :: t
  integer(4) :: j
  real(8),allocatable :: x(:,:)
  allocate(x(1,n))
  
  y=0.0D0
  do j = 1, 200
    call getX(x,t,j)
    y = y + matmul( transpose(x) + dble(j)**2, x )
  end do
endsubroutine A
  
  
subroutine getX(x,t,j)
  use module_size_is_defined
  implicit none
  real(8),intent(out) :: x(1,n)
  real(8),intent(in) :: t
  integer(4),intent(in) :: j
  integer(4) :: i
  
  do i =1, n
    x(1,i)  = dble(i+j) * t ** (1.5D00) 
  end do
endsubroutine getX

Method 3: takes 78.72s

module module_size_is_defined
  implicit none
  integer(4) :: n
endmodule

module module_array_is_allocated
  use module_size_is_defined
  implicit none
  real(8), allocatable,save :: x(:,:)

  contains
  subroutine init
    implicit none
    allocate(x(1,n))
  endsubroutine
endmodule module_array_is_allocated

program main
  use module_size_is_defined
  use module_array_is_allocated
  implicit none
  
  integer(4) :: i
  real(8) :: y(50,50),z(50,50),t
  
  n = 50
  call init
  do i =1,50000
    t=dble(i) * 2.0D0
    call A(y,t)
    z = z + y
  end do
  write(*,*) z(1,1)
end
  
subroutine A(y,t)
  use module_size_is_defined
  use module_array_is_allocated
  implicit none
  real(8),intent(out):: y(n,n)
  real(8),intent(in) :: t
  integer(4) :: j
  
  y=0.0D0
  do j = 1, 200
    call getX(x,t,j)
    y = y + matmul( transpose(x) + dble(j)**2, x )
  end do
endsubroutine A
  
  
subroutine getX(x,t,j)
  use module_size_is_defined
  implicit none
  real(8),intent(out) :: x(1,n)
  real(8),intent(in) :: t
  integer(4),intent(in) :: j
  integer(4) :: i
  
  do i =1, n
    x(1,i)  = dble(i+j) * t ** (1.5D00) 
  end do
endsubroutine getX

For this simple problem, Method 1 and Method 2 is almost same time. Method 3 is much slower. But Method 3 should be better than Method 2, since it only allocate x(1,n) once, right? But it is much slower. But in my previous program, Method 2 gives almost the same time as Method 3. Although same compile options, the speed is different. All codes are complied with -O2 option. 

 

Here is my question,

1. why Method 2 and even faster than Method 3?

2. Any ideas and suggestions that to allocate the arrays in a more efficient manner to reduce the performance penalty? I want to dynamic allocate arrays.

 

Thanks

0 Kudos
7 Replies
mecej4
Black Belt
325 Views

It is tricky and unreliable to time the execution of a short piece of code (which does nothing particularly useful) by repeating it thousands of times. Nor should one jump to simple explanations for why one version runs faster or slower or another (such as changing from static to dynamic array allocation). Usually, one has to work with the full program and try various improvements.

I reduced the loop counts in your three programs from 50000 to 5000, and you can see below the timings that I obtained (on an i7-10710U). We could run a profiler and we could look at the assembly code, but those are more useful when performed on the real code than this toy example. To me, the results do not appear useful.

Run times for 5000 iterations (seconds)

  Prog-1 Prog-2 Prog-3
Ifort /Qxhost 1.441 1.458 1.292
gfortran -O2 1.790 3.323 3.788
Hahaha
Beginner
297 Views

Thanks. I just want to use this short piece of code to illustrate the problem. I do noticed that different optimization can results into very different performance. However, it is still strange that allocate once can be much slower that allocate the array in the do-loop. 

I tried to use /Qxhost option today. But in my computer, for this short example core, 1 is still fastest while 2 and 3 still almost the same. For my real code, /Qxhost is even slower than /O2. Are there any documents available that can detailed describe these optimization options? I am doing scientific calculations. The real code is too long. Sometimes, each case is only run for once. Thus, it is better to know why one option is better than the other instead of just pick one option.

andrew_4619
Honored Contributor I
289 Views

Analyse the code with Vtune for the three cases  to see where the time is spent.  But you are better doing such things on a real application as what you might learn from your tests will probably not apply in the same way and will thus be of little benefit.

 

mecej4
Black Belt
283 Views

I think that you are drawing an incorrect conclusion regarding allocatation consuming a significant portion of the run time. It is likely that 95 percent of the time is consumed in the subroutine A.

You can see the expansion of the code generation and optimization options by using the additional option /#, or by requesting a compiler listing.

JohnNichols
Valued Contributor II
275 Views

I have a program that runs millions of times on a core i3 - we record the loop time which is about 8 seconds on the program using a timer and sql server, the loop time can vary quite a lot for exactly the same code.  Your results are not surprising, you would need to play with the code to find out what is causing the delays and that can be time consuming and interesting.  At six million replicates we have a good idea of the average and standard deviation.  If you worry about this sort of stuff - buy a faster computer. 

Hahaha
Beginner
239 Views

Thanks JohnNichols. You are right. But it is really interesting to figure out why. Now, I just let it go and live with that. A faster computer is the simplest way to solve it. 

Hahaha
Beginner
242 Views

Thanks, mecej4. matmul takes most of the time. It seems that the matmul takes different time when using allocated array and static array. 

Reply