- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The following code works very slow when the arrays are declared allocatable
> real , dimension (:,:), allocatable :: p,pp
> allocate ( p(x,y) , pp(x,y) )
- allocatable test
ifx tttt.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.2.0 Build 20240602
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:tttt.exe
-subsystem:console
tttt.obj
>tttt
Steps done = 1000
Steps done = 2000
Steps done = 3000
Steps done = 4000
Steps done = 5000
Steps done = 6000
Steps done = 7000
Steps done = 8000
Steps done = 9000
Steps done = 10000
Code time is = 89.00400 seconds
The code
module testing_pure
implicit none
contains
pure function pure_function ( r,i, j, pi, pj, mi, mj,x,y )
implicit none
integer , intent ( in ) :: x,y
real , dimension ( x, y), intent ( in ) :: r
real , dimension ( x, y ) :: pure_function
integer ,intent ( in ) :: i, j, pj, mj, pi,mi
pure_function(i,j) = r(pi,j) + r(mi,j) + r(i,mj) + r(i,pj) - 4.0*r(i,j)
end function pure_function
end module testing_pure
program test_pure_function
use testing_pure
implicit none
integer , parameter :: x = 100
integer , parameter :: y = 100
integer :: counter,i,j,pi,mi,pj,mj
integer :: rate, stop_count, start_count
real :: computed_time
!real, dimension (x,y) :: p,pp
real , dimension (:,:), allocatable :: p,pp
allocate ( p(x,y) , pp(x,y) )
call random_number (p)
! ==================
call system_clock (count=start_count , count_rate=rate)
do counter = 1, 10000
do j = 1, y
do i =1, x
pj = j + 1
mj = j - 1
pi = i + 1
mi = i - 1
if ( mi == 0 ) mi = x
if ( pi == ( x + 1 ) ) pi = 1
if ( mj == 0 ) mj = y
if ( pj == ( y + 1 ) ) pj = 1
!some computation
pp = pure_function ( p , i , j, pi, pj, mi, mj, x, y )
end do
end do
if ( mod( counter, 1000 ) .eq. 0 ) print *, ' Steps done = ', counter
end do
call system_clock (count=stop_count)
computed_time = real(max(stop_count - start_count , 1_8 )) /real(rate)
!=========
print*, ' Code time is = ', computed_time ,' seconds'
end program test_pure_function
- without allocatable
> real, dimension (x,y) :: p,pp
it runs very fast
ifx tttt.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.2.0 Build 20240602
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:tttt.exe
-subsystem:console
tttt.obj
>tttt
Steps done = 1000
Steps done = 2000
Steps done = 3000
Steps done = 4000
Steps done = 5000
Steps done = 6000
Steps done = 7000
Steps done = 8000
Steps done = 9000
Steps done = 10000
Code time is = 6.1999999E-02 seconds
with gfortran compiler
there is no change in performance i.e.
- with allocatable
>gfortran tttt.f90 -o t
>t
Steps done = 1000
Steps done = 2000
Steps done = 3000
Steps done = 4000
Steps done = 5000
Steps done = 6000
Steps done = 7000
Steps done = 8000
Steps done = 9000
Steps done = 10000
Code time is = 0.968999982 seconds
- without allocatable
>gfortran tttt.f90 -o t
>t
Steps done = 1000
Steps done = 2000
Steps done = 3000
Steps done = 4000
Steps done = 5000
Steps done = 6000
Steps done = 7000
Steps done = 8000
Steps done = 9000
Steps done = 10000
Code time is = 1.07799995 seconds
So what makes it slow with allocatable arrays ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suggest the following:
- If your work has any $$ attached to it (profit, funding, school research, etc.) and it's reliant on Intel Fortran in any way, procure Intel Premier Support and follow up on this case via Intel Online Service Center for detailed engagement on various aspects around this issue.
- Look into using Compiler Explorer (godbolt.org) to evaluate using various compilers and to analyze compiler response by studying the generated Assembly output,
- Note excessive white space and indentation can be extremely harmful, especially to any colleagues who have to work on your code,
- Note much of the Fortran compilers now rely on C/C++ - frontend, lowering to backend, etc. - and compiler developers tend to be proficient in these other languages but perhaps not as much as in Fortran. Yet the Fortran standard is somewhat weak when it comes for FUNCTION subprograms, specifically around the subject of Copy Elision and Return Value Optimization when it comes to function results. This can have a huge impact with function results that involve a lot of data. This can be critical in poor performance with Fortran compilers, especially due to special semantics in Fortran (RHS evaluation, allocation on assignment, etc.) that can come across as quite "foreign" to Fortran compiler writers and where the standard offers no guidance that can help with performance. Suboptimal copying can come into play.
- Given #4, consider using SUBROUTINE subprograms when objects with a lot of data are involved, meaning be careful with the data copy burden ,
- Be careful with instrumentation of unit tests in Fortran to do "profiling", avoid the risk of the compiler optimizing away everything, and avoid measuring any IO. This can all prove misleading. Note a really smart compiler would have optimized away your entire code and not done any computations and shown a time of 0 seconds in all cases because it would have recognized the computations do not affect any subsequent code instructions. In the variant below, a subsequent PRINT statement toward pp(42,43) can help prevent a processor from doing so.
In the meantime, consider reviewing a more simple-minded variant of your code and whether there are any ideas in it:
module m
contains
pure function update( r_pi_j, r_mi_j, r_i_mj, r_i_pj, r_i_j ) result( r )
! Argument list
real, intent(in) :: r_pi_j
real, intent(in) :: r_mi_j
real, intent(in) :: r_i_mj
real, intent(in) :: r_i_pj
real, intent(in) :: r_i_j
! Function result
real :: r
r = r_pi_j + r_mi_j + r_i_mj + r_i_pj - 4.0*r_i_j
end function
end module
program test
use, intrinsic :: iso_fortran_env, only : I8 => int64
use m
integer, parameter :: R8 = selected_real_kind( p=12 )
integer , parameter :: x = 100
integer , parameter :: y = 100
real(R8) :: t1, t2, t
integer :: counter, i, j, pi, mi, pj, mj
real , dimension (:,:), allocatable :: p, pp
allocate ( p(x,y) , pp(x,y) )
! ==================
t = 0.0
do counter = 1, 10000
call random_number( p )
call cpu_t( t1 )
do j = 1, y
do i = 1, x
pj = j + 1
mj = j - 1
pi = i + 1
mi = i - 1
if ( mi == 0 ) mi = x
if ( pi == ( x + 1 ) ) pi = 1
if ( mj == 0 ) mj = y
if ( pj == ( y + 1 ) ) pj = 1
!some computation
pp(i,j) = update( p(pi,j), p(mi,j), p(i,mj), p(i,pj), p(i,i) )
end do
end do
call cpu_t( t2 )
t = t + (t2 - t1)
if ( mod( counter, 1000 ) == 0 ) print *, ' Steps done = ', counter, '; pp(42,43): ', pp(42,43)
end do
! ==================
print *, ' Code time is = ', t ,' seconds'
contains
subroutine cpu_t( time )
! Argument list
real(R8), intent(inout) :: time
! Local variables
integer(I8) :: tick
integer(I8) :: rate
call system_clock (tick, rate)
time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )
return
end subroutine
end program
C:\temp>ifx /free /standard-semantics p.f
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.2.0 Build 20240602
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.36.32537.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:p.exe
-subsystem:console
p.obj
C:\temp>p.exe
Steps done = 1000 ; pp(42,43): -0.3779147
Steps done = 2000 ; pp(42,43): 1.6536832E-02
Steps done = 3000 ; pp(42,43): 1.551260
Steps done = 4000 ; pp(42,43): 0.2452765
Steps done = 5000 ; pp(42,43): 0.6278389
Steps done = 6000 ; pp(42,43): -1.001162
Steps done = 7000 ; pp(42,43): 1.457952
Steps done = 8000 ; pp(42,43): -0.3233254
Steps done = 9000 ; pp(42,43): -0.7149973
Steps done = 10000 ; pp(42,43): 1.124510
Code time is = 8.800101280212402E-002 seconds
C:\temp>
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are the allocates withing the timed loop? Repeat allocate/deallocate will always take some time. The load times may be different but you are not timing that. The saving with allocate is that you can grab the memory you need not your best case guess of max memory that you think might be needed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The arrays are already allocated before going to the time loop. I do not know if the allocatable attribute remains active for the whole code duration.
Also does that mean intel compiler (ifx) handles allocatable arrays differently than the gfortran ? I do not see any performance issues with gfortran. If that is the case how to make a portable code with allocatables ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you use heap arrays option? Allocates are always on the heap. Stack is usually faster but you are limited by stack size.
Something clearly goes wrong with the first test case! However I will note that having now looked at you code a smart optimiser could ditch your inner test loop as it does nothing other than set counter to the loop exit value if I have read that correctly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a newer version of the previous code.
Modification
I added pure subroutine and it was called from the time loop too.
The code (version 2)
! ifx allocatable array test
!
!
! version 2 :
! pure subroutine is used here.
module testing_pure
implicit none
contains
function pure_function ( r,i, j, pi, pj, mi, mj,x,y )
implicit none
integer , intent ( in ) :: x,y
real , dimension ( x, y), intent ( in ) :: r
real , dimension ( x, y ) :: pure_function
integer ,intent ( in ) :: i, j, pj, mj, pi,mi
pure_function(i,j) = r(pi,j) + r(mi,j) + r(i,mj) + r(i,pj) - 4.0*r(i,j)
end function pure_function
pure subroutine pure_test ( rr, r,i, j, pi, pj, mi, mj,x,y )
implicit none
integer , intent ( in ) :: x,y
real , dimension ( x, y), intent ( in ) :: r
real , dimension ( x, y ), intent (out) :: rr
integer ,intent ( in ) :: i, j, pj, mj, pi,mi
rr(i,j) = r(pi,j) + r(mi,j) + r(i,mj) + r(i,pj) - 4.0*r(i,j)
end subroutine pure_test
end module testing_pure
program test_pure_function
use testing_pure
implicit none
integer , parameter :: x = 100
integer , parameter :: y = 100
integer :: counter,i,j,pi,mi,pj,mj
integer :: rate, stop_count, start_count
real :: computed_time
! real, dimension (x,y) :: p,pp
real , dimension (:,:), allocatable :: p,pp
allocate ( p(x,y) , pp(x,y) )
call random_number (p)
! ==================
call system_clock (count=start_count , count_rate=rate)
do counter = 1, 10000
do j = 1, y
do i =1, x
pj = j + 1
mj = j - 1
pi = i + 1
mi = i - 1
if ( mi == 0 ) mi = x
if ( pi == ( x + 1 ) ) pi = 1
if ( mj == 0 ) mj = y
if ( pj == ( y + 1 ) ) pj = 1
!some computation
! pp = pure_function ( p , i , j, pi, pj, mi, mj, x, y )
call pure_test ( pp, p,i, j, pi, pj, mi, mj,x,y )
end do
end do
if ( mod( counter, 1000 ) .eq. 0 ) print *, ' Steps done = ', counter
end do
call system_clock (count=stop_count)
computed_time = real(max(stop_count - start_count , 1_8 )) /real(rate)
!=========
print*, ' Code time is = ', computed_time ,' seconds'
end program test_pure_function
test 1
I run the test as before
>ifx tttt1.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.2.0 Build 20240602
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:tttt1.exe
-subsystem:console
tttt1.obj
>tttt1
Steps done = 1000
Steps done = 2000
Steps done = 3000
Steps done = 4000
Steps done = 5000
Steps done = 6000
Steps done = 7000
Steps done = 8000
Steps done = 9000
Steps done = 10000
Code time is = 0.1090000 seconds
It seems like the calling of pure routine has no performance loss with the same allocatable arrays. So it seems like it has something to do with function.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK I did your second test and indeed the function and subroutine versions have run times that are 1000x different!!
Just to be sure I added a couple of lines at the end (outside the timer) to print a random element of the pp array and that made no difference so there is not some code elimination optimisation anomaly.
Dear compiler team there is indeed something very BAD happening!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I used VTune on the allocatable version:
Does look like massive amounts of unnecessary memory copying. Its worrying that an allocatable array can cause this much performance loss. I expect many of us will be watching for a resolution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hmmm that last post made me look harder at the example. The function is returning a 100x100 real array but only making assignment to a single element within it, that doesn't seem to be a good thing to do. None the less it should be making 10000 assignments on function return. The loc of the array in the function is not the same loc as the pp array in the caller as one would expect.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The pure function coding style, while compiling the function, the compiler has no way of determining if the result of the function is possibly an alias of an input argument. Therefore, it copies the/an array. In the call coding style, the programmer's requirement is to not use argument aliases. I do not know if the Fortran Specification states anything about function output aliasing an input argument and therefore presumes it may be a possibility, thus requiring the copying of the array (use of array temporary).
This is presumption on my part.
Note, in your first code example (using function), add ", intent(out)"
real , dimension ( x, y ), intent (out) :: pure_function
See if this makes a difference.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried and received error message
intel
ifx tttt.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.2.0 Build 20240602
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.
tttt.f90(21): error #6413: This global name is invalid in this context. [PURE_FUNCTION]
real , dimension ( x, y ), intent ( out) :: pure_function
----------------------------------------------------^
compilation aborted for tttt.f90 (code 1)
gfortran
gfortran tttt.f90
tttt.f90:16:4:
16 | function pure_function ( r,i, j, pi, pj, mi, mj,x,y )
| 1
Error: Symbol at (1) is not a DUMMY variable
tttt.f90:38:9:
38 | use testing_pure
| 1
Fatal Error: Cannot open module file 'testing_pure.mod' for reading at (1): No such file or directory
compilation terminated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suggest the following:
- If your work has any $$ attached to it (profit, funding, school research, etc.) and it's reliant on Intel Fortran in any way, procure Intel Premier Support and follow up on this case via Intel Online Service Center for detailed engagement on various aspects around this issue.
- Look into using Compiler Explorer (godbolt.org) to evaluate using various compilers and to analyze compiler response by studying the generated Assembly output,
- Note excessive white space and indentation can be extremely harmful, especially to any colleagues who have to work on your code,
- Note much of the Fortran compilers now rely on C/C++ - frontend, lowering to backend, etc. - and compiler developers tend to be proficient in these other languages but perhaps not as much as in Fortran. Yet the Fortran standard is somewhat weak when it comes for FUNCTION subprograms, specifically around the subject of Copy Elision and Return Value Optimization when it comes to function results. This can have a huge impact with function results that involve a lot of data. This can be critical in poor performance with Fortran compilers, especially due to special semantics in Fortran (RHS evaluation, allocation on assignment, etc.) that can come across as quite "foreign" to Fortran compiler writers and where the standard offers no guidance that can help with performance. Suboptimal copying can come into play.
- Given #4, consider using SUBROUTINE subprograms when objects with a lot of data are involved, meaning be careful with the data copy burden ,
- Be careful with instrumentation of unit tests in Fortran to do "profiling", avoid the risk of the compiler optimizing away everything, and avoid measuring any IO. This can all prove misleading. Note a really smart compiler would have optimized away your entire code and not done any computations and shown a time of 0 seconds in all cases because it would have recognized the computations do not affect any subsequent code instructions. In the variant below, a subsequent PRINT statement toward pp(42,43) can help prevent a processor from doing so.
In the meantime, consider reviewing a more simple-minded variant of your code and whether there are any ideas in it:
module m
contains
pure function update( r_pi_j, r_mi_j, r_i_mj, r_i_pj, r_i_j ) result( r )
! Argument list
real, intent(in) :: r_pi_j
real, intent(in) :: r_mi_j
real, intent(in) :: r_i_mj
real, intent(in) :: r_i_pj
real, intent(in) :: r_i_j
! Function result
real :: r
r = r_pi_j + r_mi_j + r_i_mj + r_i_pj - 4.0*r_i_j
end function
end module
program test
use, intrinsic :: iso_fortran_env, only : I8 => int64
use m
integer, parameter :: R8 = selected_real_kind( p=12 )
integer , parameter :: x = 100
integer , parameter :: y = 100
real(R8) :: t1, t2, t
integer :: counter, i, j, pi, mi, pj, mj
real , dimension (:,:), allocatable :: p, pp
allocate ( p(x,y) , pp(x,y) )
! ==================
t = 0.0
do counter = 1, 10000
call random_number( p )
call cpu_t( t1 )
do j = 1, y
do i = 1, x
pj = j + 1
mj = j - 1
pi = i + 1
mi = i - 1
if ( mi == 0 ) mi = x
if ( pi == ( x + 1 ) ) pi = 1
if ( mj == 0 ) mj = y
if ( pj == ( y + 1 ) ) pj = 1
!some computation
pp(i,j) = update( p(pi,j), p(mi,j), p(i,mj), p(i,pj), p(i,i) )
end do
end do
call cpu_t( t2 )
t = t + (t2 - t1)
if ( mod( counter, 1000 ) == 0 ) print *, ' Steps done = ', counter, '; pp(42,43): ', pp(42,43)
end do
! ==================
print *, ' Code time is = ', t ,' seconds'
contains
subroutine cpu_t( time )
! Argument list
real(R8), intent(inout) :: time
! Local variables
integer(I8) :: tick
integer(I8) :: rate
call system_clock (tick, rate)
time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )
return
end subroutine
end program
C:\temp>ifx /free /standard-semantics p.f
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.2.0 Build 20240602
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.36.32537.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:p.exe
-subsystem:console
p.obj
C:\temp>p.exe
Steps done = 1000 ; pp(42,43): -0.3779147
Steps done = 2000 ; pp(42,43): 1.6536832E-02
Steps done = 3000 ; pp(42,43): 1.551260
Steps done = 4000 ; pp(42,43): 0.2452765
Steps done = 5000 ; pp(42,43): 0.6278389
Steps done = 6000 ; pp(42,43): -1.001162
Steps done = 7000 ; pp(42,43): 1.457952
Steps done = 8000 ; pp(42,43): -0.3233254
Steps done = 9000 ; pp(42,43): -0.7149973
Steps done = 10000 ; pp(42,43): 1.124510
Code time is = 8.800101280212402E-002 seconds
C:\temp>
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- There is no $$ associated at this moment. Of course, that would be the best situation if there is!
- Thanks for sharing the link. This looks interesting, never heard of it before.
- This depends on the editor I think. Some editors show the space quite narrow, some fonts make it very widely spaced. Also with this markup, the preview is different, I feel. So it is still an evolving design architecture for me. Can't decide which way I should keep writing my codes.
- I can't say much about it. It is something I have no knowledge about.
- This is more relevant since Fortran is supposed to translate mathematical Formula. Hence quite natural to evaluate a mathematical function. The intuitive approach is to use Function subprograms. However, my experience (less than five years) is showing using function is not always a right way to solve problem. It may be solvable but at the expense of performance ( another concern and perhaps the main in some situations).
- This I/O insight is nice.
The code variant is useful. This is very tricky when dealing with whole array operations, scalar - array or with index array (my original code). Working with mixed-arrays (whole array and index array) requires some good experience otherwise I feel it is hard to debug.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Fortran10 wrote:..
The code variant is useful. This is very tricky when dealing with whole array operations, scalar - array or with index array (my original code). Working with mixed-arrays (whole array and index array) requires some good experience otherwise I feel it is hard to debug.
First, in case you have not done so, look also into ELEMENTAL subclause of Fortran subprograms and the facilities provided by this:
Element-wise Operations on Arrays — Fortran Programming Language (fortran-lang.org)
I suggest you follow up further at this site, Fortran-lang.org:
Fortran Discourse - Fortran open source community (fortran-lang.discourse.group)
Learn — Fortran Programming Language (fortran-lang.org)
Re: "This is very tricky when dealing with whole array operations, scalar - array or with index array (my original code). Working with mixed-arrays (whole array and index array) requires some good experience otherwise I feel it is hard to debug," you will find good guidance from many other Fortran practitioners at the above two links.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Note excessive white space and indentation can be extremely harmful "
Back in the day I used to like a few blank lines because it made code more readable, nowadays with editors that have syntax highlighting etc I hate blank lines, I can see less code and have to scroll more. I find it adds pretty much nothing to readability. I suspect other opinions exist.......
As for indentation that is great but if you get beyond the 4th level maybe the code structure needs some thought. Readability and clarity is very important for support and maintenance of a code.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page