- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When using allocatable arrays, the program is much slower than the one using static memory allocation. My program is too long to post here. I tried a few small codes. The results are a little bit strange. I tried three cases,
Method 1: takes 28.98s
module module_size_is_defined
implicit none
integer(4) :: n
end module
program main
use module_size_is_defined
implicit none
integer(4) :: i
real(8) :: y(50,50),z(50,50),t
n = 50
do i =1,50000
t=dble(i) * 2.0D0
call A(y,t)
z = z + y
end do
write(*,*) z(1,1)
end
subroutine A(y,t)
use module_size_is_defined
implicit none
real(8),intent(out):: y(n,n)
real(8),intent(in) :: t
integer(4) :: j
real(8) :: x(1,50)
y=0.0D0
do j = 1, 200
call getX(x,t,j)
y = y + matmul( transpose(x) + dble(j)**2, x )
end do
endsubroutine A
subroutine getX(x,t,j)
use module_size_is_defined
implicit none
real(8),intent(out) :: x(1,n)
real(8),intent(in) :: t
integer(4),intent(in) :: j
integer(4) :: i
do i =1, n
x(1,i) = dble(i+j) * t ** (1.5D00)
end do
endsubroutine getX
Method 2: takes 30.56s
module module_size_is_defined
implicit none
integer(4) :: n
end module
program main
use module_size_is_defined
implicit none
integer(4) :: i
real(8) :: y(50,50),z(50,50),t
n = 50
do i =1,50000
t=dble(i) * 2.0D0
call A(y,t)
z = z + y
end do
write(*,*) z(1,1)
end
subroutine A(y,t)
use module_size_is_defined
implicit none
real(8),intent(out):: y(n,n)
real(8),intent(in) :: t
integer(4) :: j
real(8),allocatable :: x(:,:)
allocate(x(1,n))
y=0.0D0
do j = 1, 200
call getX(x,t,j)
y = y + matmul( transpose(x) + dble(j)**2, x )
end do
endsubroutine A
subroutine getX(x,t,j)
use module_size_is_defined
implicit none
real(8),intent(out) :: x(1,n)
real(8),intent(in) :: t
integer(4),intent(in) :: j
integer(4) :: i
do i =1, n
x(1,i) = dble(i+j) * t ** (1.5D00)
end do
endsubroutine getX
Method 3: takes 78.72s
module module_size_is_defined
implicit none
integer(4) :: n
endmodule
module module_array_is_allocated
use module_size_is_defined
implicit none
real(8), allocatable,save :: x(:,:)
contains
subroutine init
implicit none
allocate(x(1,n))
endsubroutine
endmodule module_array_is_allocated
program main
use module_size_is_defined
use module_array_is_allocated
implicit none
integer(4) :: i
real(8) :: y(50,50),z(50,50),t
n = 50
call init
do i =1,50000
t=dble(i) * 2.0D0
call A(y,t)
z = z + y
end do
write(*,*) z(1,1)
end
subroutine A(y,t)
use module_size_is_defined
use module_array_is_allocated
implicit none
real(8),intent(out):: y(n,n)
real(8),intent(in) :: t
integer(4) :: j
y=0.0D0
do j = 1, 200
call getX(x,t,j)
y = y + matmul( transpose(x) + dble(j)**2, x )
end do
endsubroutine A
subroutine getX(x,t,j)
use module_size_is_defined
implicit none
real(8),intent(out) :: x(1,n)
real(8),intent(in) :: t
integer(4),intent(in) :: j
integer(4) :: i
do i =1, n
x(1,i) = dble(i+j) * t ** (1.5D00)
end do
endsubroutine getX
For this simple problem, Method 1 and Method 2 is almost same time. Method 3 is much slower. But Method 3 should be better than Method 2, since it only allocate x(1,n) once, right? But it is much slower. But in my previous program, Method 2 gives almost the same time as Method 3. Although same compile options, the speed is different. All codes are complied with -O2 option.
Here is my question,
1. why Method 2 and even faster than Method 3?
2. Any ideas and suggestions that to allocate the arrays in a more efficient manner to reduce the performance penalty? I want to dynamic allocate arrays.
Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is tricky and unreliable to time the execution of a short piece of code (which does nothing particularly useful) by repeating it thousands of times. Nor should one jump to simple explanations for why one version runs faster or slower or another (such as changing from static to dynamic array allocation). Usually, one has to work with the full program and try various improvements.
I reduced the loop counts in your three programs from 50000 to 5000, and you can see below the timings that I obtained (on an i7-10710U). We could run a profiler and we could look at the assembly code, but those are more useful when performed on the real code than this toy example. To me, the results do not appear useful.
Run times for 5000 iterations (seconds)
Prog-1 | Prog-2 | Prog-3 | |
Ifort /Qxhost | 1.441 | 1.458 | 1.292 |
gfortran -O2 | 1.790 | 3.323 | 3.788 |
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks. I just want to use this short piece of code to illustrate the problem. I do noticed that different optimization can results into very different performance. However, it is still strange that allocate once can be much slower that allocate the array in the do-loop.
I tried to use /Qxhost option today. But in my computer, for this short example core, 1 is still fastest while 2 and 3 still almost the same. For my real code, /Qxhost is even slower than /O2. Are there any documents available that can detailed describe these optimization options? I am doing scientific calculations. The real code is too long. Sometimes, each case is only run for once. Thus, it is better to know why one option is better than the other instead of just pick one option.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Analyse the code with Vtune for the three cases to see where the time is spent. But you are better doing such things on a real application as what you might learn from your tests will probably not apply in the same way and will thus be of little benefit.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think that you are drawing an incorrect conclusion regarding allocatation consuming a significant portion of the run time. It is likely that 95 percent of the time is consumed in the subroutine A.
You can see the expansion of the code generation and optimization options by using the additional option /#, or by requesting a compiler listing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a program that runs millions of times on a core i3 - we record the loop time which is about 8 seconds on the program using a timer and sql server, the loop time can vary quite a lot for exactly the same code. Your results are not surprising, you would need to play with the code to find out what is causing the delays and that can be time consuming and interesting. At six million replicates we have a good idea of the average and standard deviation. If you worry about this sort of stuff - buy a faster computer.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks JohnNichols. You are right. But it is really interesting to figure out why. Now, I just let it go and live with that. A faster computer is the simplest way to solve it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, mecej4. matmul takes most of the time. It seems that the matmul takes different time when using allocated array and static array.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page