Re: Fortran low performance with allocatable arrays

Hahaha · ‎12-08-2020

When using allocatable arrays, the program is much slower than the one using static memory allocation. My program is too long to post here. I tried a few small codes. The results are a little bit strange. I tried three cases,

Method 1: takes 28.98s

module module_size_is_defined
  implicit none
  integer(4) :: n
end module

program main
  use module_size_is_defined
  implicit none
  
  integer(4) :: i
  real(8) :: y(50,50),z(50,50),t
  
  n = 50
  do i =1,50000
    t=dble(i) * 2.0D0
    call A(y,t)
    z = z + y
  end do
  write(*,*) z(1,1)
end
  
subroutine A(y,t)
  use module_size_is_defined
  implicit none
  real(8),intent(out):: y(n,n)
  real(8),intent(in) :: t
  integer(4) :: j
  real(8) :: x(1,50)
  
  y=0.0D0
  do j = 1, 200
    call getX(x,t,j)
    y = y + matmul( transpose(x) + dble(j)**2, x )
  end do
endsubroutine A
  
  
subroutine getX(x,t,j)
  use module_size_is_defined
  implicit none
  real(8),intent(out) :: x(1,n)
  real(8),intent(in) :: t
  integer(4),intent(in) :: j
  integer(4) :: i
  
  do i =1, n
    x(1,i)  = dble(i+j) * t ** (1.5D00) 
  end do
endsubroutine getX

Method 2: takes 30.56s

module module_size_is_defined
  implicit none
  integer(4) :: n
end module

program main
  use module_size_is_defined
  implicit none
  
  integer(4) :: i
  real(8) :: y(50,50),z(50,50),t
  
  n = 50
  do i =1,50000
    t=dble(i) * 2.0D0
    call A(y,t)
    z = z + y
  end do
  write(*,*) z(1,1)
end
  
subroutine A(y,t)
  use module_size_is_defined
  implicit none
  real(8),intent(out):: y(n,n)
  real(8),intent(in) :: t
  integer(4) :: j
  real(8),allocatable :: x(:,:)
  allocate(x(1,n))
  
  y=0.0D0
  do j = 1, 200
    call getX(x,t,j)
    y = y + matmul( transpose(x) + dble(j)**2, x )
  end do
endsubroutine A
  
  
subroutine getX(x,t,j)
  use module_size_is_defined
  implicit none
  real(8),intent(out) :: x(1,n)
  real(8),intent(in) :: t
  integer(4),intent(in) :: j
  integer(4) :: i
  
  do i =1, n
    x(1,i)  = dble(i+j) * t ** (1.5D00) 
  end do
endsubroutine getX

Method 3: takes 78.72s

module module_size_is_defined
  implicit none
  integer(4) :: n
endmodule

module module_array_is_allocated
  use module_size_is_defined
  implicit none
  real(8), allocatable,save :: x(:,:)

  contains
  subroutine init
    implicit none
    allocate(x(1,n))
  endsubroutine
endmodule module_array_is_allocated

program main
  use module_size_is_defined
  use module_array_is_allocated
  implicit none
  
  integer(4) :: i
  real(8) :: y(50,50),z(50,50),t
  
  n = 50
  call init
  do i =1,50000
    t=dble(i) * 2.0D0
    call A(y,t)
    z = z + y
  end do
  write(*,*) z(1,1)
end
  
subroutine A(y,t)
  use module_size_is_defined
  use module_array_is_allocated
  implicit none
  real(8),intent(out):: y(n,n)
  real(8),intent(in) :: t
  integer(4) :: j
  
  y=0.0D0
  do j = 1, 200
    call getX(x,t,j)
    y = y + matmul( transpose(x) + dble(j)**2, x )
  end do
endsubroutine A
  
  
subroutine getX(x,t,j)
  use module_size_is_defined
  implicit none
  real(8),intent(out) :: x(1,n)
  real(8),intent(in) :: t
  integer(4),intent(in) :: j
  integer(4) :: i
  
  do i =1, n
    x(1,i)  = dble(i+j) * t ** (1.5D00) 
  end do
endsubroutine getX

For this simple problem, Method 1 and Method 2 is almost same time. Method 3 is much slower. But Method 3 should be better than Method 2, since it only allocate x(1,n) once, right? But it is much slower. But in my previous program, Method 2 gives almost the same time as Method 3. Although same compile options, the speed is different. All codes are complied with -O2 option.

Here is my question,

1. why Method 2 and even faster than Method 3?

2. Any ideas and suggestions that to allocate the arrays in a more efficient manner to reduce the performance penalty? I want to dynamic allocate arrays.

Thanks

mecej4 · ‎12-09-2020

It is tricky and unreliable to time the execution of a short piece of code (which does nothing particularly useful) by repeating it thousands of times. Nor should one jump to simple explanations for why one version runs faster or slower or another (such as changing from static to dynamic array allocation). Usually, one has to work with the full program and try various improvements.

I reduced the loop counts in your three programs from 50000 to 5000, and you can see below the timings that I obtained (on an i7-10710U). We could run a profiler and we could look at the assembly code, but those are more useful when performed on the real code than this toy example. To me, the results do not appear useful.

Run times for 5000 iterations (seconds)

	Prog-1	Prog-2	Prog-3
Ifort /Qxhost	1.441	1.458	1.292
gfortran -O2	1.790	3.323	3.788

Hahaha · ‎12-09-2020

Thanks. I just want to use this short piece of code to illustrate the problem. I do noticed that different optimization can results into very different performance. However, it is still strange that allocate once can be much slower that allocate the array in the do-loop.

I tried to use /Qxhost option today. But in my computer, for this short example core, 1 is still fastest while 2 and 3 still almost the same. For my real code, /Qxhost is even slower than /O2. Are there any documents available that can detailed describe these optimization options? I am doing scientific calculations. The real code is too long. Sometimes, each case is only run for once. Thus, it is better to know why one option is better than the other instead of just pick one option.

andrew_4619 · ‎12-09-2020

Analyse the code with Vtune for the three cases to see where the time is spent. But you are better doing such things on a real application as what you might learn from your tests will probably not apply in the same way and will thus be of little benefit.

mecej4 · ‎12-09-2020

I think that you are drawing an incorrect conclusion regarding allocatation consuming a significant portion of the run time. It is likely that 95 percent of the time is consumed in the subroutine A.

You can see the expansion of the code generation and optimization options by using the additional option /#, or by requesting a compiler listing.

JohnNichols · ‎12-09-2020

I have a program that runs millions of times on a core i3 - we record the loop time which is about 8 seconds on the program using a timer and sql server, the loop time can vary quite a lot for exactly the same code. Your results are not surprising, you would need to play with the code to find out what is causing the delays and that can be time consuming and interesting. At six million replicates we have a good idea of the average and standard deviation. If you worry about this sort of stuff - buy a faster computer.

Hahaha · ‎12-14-2020

Thanks JohnNichols. You are right. But it is really interesting to figure out why. Now, I just let it go and live with that. A faster computer is the simplest way to solve it.

Hahaha · ‎12-14-2020

Thanks, mecej4. matmul takes most of the time. It seems that the matmul takes different time when using allocated array and static array.