Solved: performance of passing allocatable array to pure function

Fortran10 · ‎07-03-2024

The following code works very slow when the arrays are declared allocatable

> real , dimension (:,:), allocatable :: p,pp
> allocate ( p(x,y) , pp(x,y) )

allocatable test

ifx tttt.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.2.0 Build 20240602
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:tttt.exe
-subsystem:console
tttt.obj

>tttt
    Steps done  =          1000
    Steps done  =          2000
    Steps done  =          3000
    Steps done  =          4000
    Steps done  =          5000
    Steps done  =          6000
    Steps done  =          7000
    Steps done  =          8000
    Steps done  =          9000
    Steps done  =         10000
  Code time is     =    89.00400      seconds

The code

module testing_pure
    implicit none
    
    contains
    
    pure function pure_function ( r,i, j, pi, pj, mi, mj,x,y )
        implicit none
        
        integer , intent ( in )  :: x,y
        real , dimension ( x, y), intent ( in ) :: r
        real , dimension ( x, y )               :: pure_function
        integer ,intent ( in ) :: i, j, pj, mj, pi,mi
        
        
        
        pure_function(i,j) =  r(pi,j) + r(mi,j) + r(i,mj) +  r(i,pj) - 4.0*r(i,j) 
        
        
    end function pure_function
    
    
    
end module testing_pure



program test_pure_function
    use testing_pure
    implicit none
    
    integer , parameter :: x = 100
    integer , parameter :: y = 100 
    integer             :: counter,i,j,pi,mi,pj,mj  
    integer             :: rate, stop_count, start_count
    real                :: computed_time
    
    
    !real, dimension (x,y) :: p,pp    
    real , dimension (:,:), allocatable :: p,pp 
    allocate ( p(x,y) , pp(x,y) )
    
    call random_number (p)
    
    ! ==================
    
    
    
    
    call system_clock (count=start_count , count_rate=rate)
    
    
    do counter = 1, 10000
        
        
        do j = 1, y
            do i =1, x 
                
                
                pj = j + 1
                mj = j - 1
                pi = i + 1
                mi = i - 1
                
                if ( mi == 0 ) mi = x
                if ( pi == ( x + 1 ) ) pi = 1
                if ( mj == 0 ) mj = y
                if ( pj == ( y + 1 ) ) pj = 1
                
                
                
                !some computation
                
                
                pp   = pure_function ( p , i , j, pi, pj, mi, mj, x, y )
                
                
                
            end do
        end do
        

        if ( mod( counter, 1000 ) .eq. 0 ) print *, '   Steps done  =  ', counter
        
        
    end do 
    
    
    call system_clock (count=stop_count)
    
    computed_time = real(max(stop_count - start_count , 1_8 )) /real(rate)
    
    
    
    !=========
    print*, ' Code time is     = ',  computed_time ,' seconds'
    
    
    
    
end program test_pure_function

without allocatable

> real, dimension (x,y) :: p,pp

it runs very fast

ifx tttt.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.2.0 Build 20240602
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:tttt.exe
-subsystem:console
tttt.obj

>tttt
    Steps done  =          1000
    Steps done  =          2000
    Steps done  =          3000
    Steps done  =          4000
    Steps done  =          5000
    Steps done  =          6000
    Steps done  =          7000
    Steps done  =          8000
    Steps done  =          9000
    Steps done  =         10000
  Code time is     =   6.1999999E-02  seconds

with gfortran compiler

there is no change in performance i.e.

with allocatable

>gfortran tttt.f90 -o t

>t
    Steps done  =          1000
    Steps done  =          2000
    Steps done  =          3000
    Steps done  =          4000
    Steps done  =          5000
    Steps done  =          6000
    Steps done  =          7000
    Steps done  =          8000
    Steps done  =          9000
    Steps done  =         10000
  Code time is     =   0.968999982      seconds

without allocatable

>gfortran tttt.f90 -o t

>t
    Steps done  =          1000
    Steps done  =          2000
    Steps done  =          3000
    Steps done  =          4000
    Steps done  =          5000
    Steps done  =          6000
    Steps done  =          7000
    Steps done  =          8000
    Steps done  =          9000
    Steps done  =         10000
  Code time is     =    1.07799995      seconds

So what makes it slow with allocatable arrays ?

FortranFan · ‎07-04-2024

@Fortran10 ,

I suggest the following:

If your work has any $$ attached to it (profit, funding, school research, etc.) and it's reliant on Intel Fortran in any way, procure Intel Premier Support and follow up on this case via Intel Online Service Center for detailed engagement on various aspects around this issue.
Look into using Compiler Explorer (godbolt.org) to evaluate using various compilers and to analyze compiler response by studying the generated Assembly output,
Note excessive white space and indentation can be extremely harmful, especially to any colleagues who have to work on your code,
Note much of the Fortran compilers now rely on C/C++ - frontend, lowering to backend, etc. - and compiler developers tend to be proficient in these other languages but perhaps not as much as in Fortran. Yet the Fortran standard is somewhat weak when it comes for FUNCTION subprograms, specifically around the subject of Copy Elision and Return Value Optimization when it comes to function results. This can have a huge impact with function results that involve a lot of data. This can be critical in poor performance with Fortran compilers, especially due to special semantics in Fortran (RHS evaluation, allocation on assignment, etc.) that can come across as quite "foreign" to Fortran compiler writers and where the standard offers no guidance that can help with performance. Suboptimal copying can come into play.
Given #4, consider using SUBROUTINE subprograms when objects with a lot of data are involved, meaning be careful with the data copy burden ,
Be careful with instrumentation of unit tests in Fortran to do "profiling", avoid the risk of the compiler optimizing away everything, and avoid measuring any IO. This can all prove misleading. Note a really smart compiler would have optimized away your entire code and not done any computations and shown a time of 0 seconds in all cases because it would have recognized the computations do not affect any subsequent code instructions. In the variant below, a subsequent PRINT statement toward pp(42,43) can help prevent a processor from doing so.

In the meantime, consider reviewing a more simple-minded variant of your code and whether there are any ideas in it:

module m

contains
    
   pure function update( r_pi_j, r_mi_j, r_i_mj, r_i_pj, r_i_j ) result( r )

      ! Argument list
      real, intent(in) :: r_pi_j
      real, intent(in) :: r_mi_j
      real, intent(in) :: r_i_mj
      real, intent(in) :: r_i_pj
      real, intent(in) :: r_i_j
      ! Function result
      real :: r

      r =  r_pi_j + r_mi_j + r_i_mj +  r_i_pj - 4.0*r_i_j 
       
   end function

end module

program test

   use, intrinsic :: iso_fortran_env, only : I8 => int64
   use m
    
   integer, parameter :: R8 = selected_real_kind( p=12 )
   integer , parameter :: x = 100
   integer , parameter :: y = 100 
   real(R8) :: t1, t2, t
   integer :: counter, i, j, pi, mi, pj, mj  
    
   real , dimension (:,:), allocatable :: p, pp
   allocate ( p(x,y) , pp(x,y) )
    
   ! ==================
   t = 0.0
   do counter = 1, 10000
        
      call random_number( p )
      call cpu_t( t1 ) 

      do j = 1, y
         do i = 1, x 
                
            pj = j + 1
            mj = j - 1
            pi = i + 1
            mi = i - 1
            
            if ( mi == 0 ) mi = x
            if ( pi == ( x + 1 ) ) pi = 1
            if ( mj == 0 ) mj = y
            if ( pj == ( y + 1 ) ) pj = 1
            
            !some computation
            pp(i,j) = update( p(pi,j), p(mi,j), p(i,mj), p(i,pj), p(i,i) )
                
            end do
        end do
        call cpu_t( t2 ) 
        t = t + (t2 - t1)

        if ( mod( counter, 1000 ) == 0 ) print *, '   Steps done  =  ', counter, '; pp(42,43): ', pp(42,43)
        
    end do 
    
   ! ==================
   print *, ' Code time is     = ', t ,' seconds'

contains

   subroutine cpu_t( time )

      ! Argument list
      real(R8), intent(inout) :: time

      ! Local variables
      integer(I8) :: tick
      integer(I8) :: rate

      call system_clock (tick, rate)

      time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )

      return

   end subroutine
   
end program

C:\temp>ifx /free /standard-semantics p.f
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.2.0 Build 20240602
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.36.32537.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:p.exe
-subsystem:console
p.obj

C:\temp>p.exe
    Steps done  =   1000 ; pp(42,43):  -0.3779147
    Steps done  =   2000 ; pp(42,43):  1.6536832E-02
    Steps done  =   3000 ; pp(42,43):  1.551260
    Steps done  =   4000 ; pp(42,43):  0.2452765
    Steps done  =   5000 ; pp(42,43):  0.6278389
    Steps done  =   6000 ; pp(42,43):  -1.001162
    Steps done  =   7000 ; pp(42,43):  1.457952
    Steps done  =   8000 ; pp(42,43):  -0.3233254
    Steps done  =   9000 ; pp(42,43):  -0.7149973
    Steps done  =   10000 ; pp(42,43):  1.124510
  Code time is     =  8.800101280212402E-002  seconds

C:\temp>

View solution in original post

andrew_4619 · ‎07-04-2024

Are the allocates withing the timed loop? Repeat allocate/deallocate will always take some time. The load times may be different but you are not timing that. The saving with allocate is that you can grab the memory you need not your best case guess of max memory that you think might be needed.

Fortran10 · ‎07-04-2024

The arrays are already allocated before going to the time loop. I do not know if the allocatable attribute remains active for the whole code duration.

Also does that mean intel compiler (ifx) handles allocatable arrays differently than the gfortran ? I do not see any performance issues with gfortran. If that is the case how to make a portable code with allocatables ?

andrew_4619 · ‎07-04-2024

Do you use heap arrays option? Allocates are always on the heap. Stack is usually faster but you are limited by stack size.

Something clearly goes wrong with the first test case! However I will note that having now looked at you code a smart optimiser could ditch your inner test loop as it does nothing other than set counter to the loop exit value if I have read that correctly.

Fortran10 · ‎07-04-2024

I have a newer version of the previous code.

Modification

I added pure subroutine and it was called from the time loop too.

The code (version 2)


!     ifx allocatable array test
!
!
!    version 2 :
!                pure subroutine is used here.


module testing_pure
    implicit none
    
    contains
    
    function pure_function ( r,i, j, pi, pj, mi, mj,x,y )
        implicit none
        
        integer , intent ( in )  :: x,y
        real , dimension ( x, y), intent ( in ) :: r
        real , dimension ( x, y )               :: pure_function
        integer ,intent ( in ) :: i, j, pj, mj, pi,mi
        
        
        
        pure_function(i,j) =  r(pi,j) + r(mi,j) + r(i,mj) +  r(i,pj) - 4.0*r(i,j) 
        
        
    end function pure_function
    
    
    
    pure subroutine pure_test ( rr, r,i, j, pi, pj, mi, mj,x,y )
        implicit none
        
        integer , intent ( in )  :: x,y
        real , dimension ( x, y), intent ( in ) :: r
        real , dimension ( x, y ), intent (out) :: rr
        integer ,intent ( in ) :: i, j, pj, mj, pi,mi
        
        
        rr(i,j) =  r(pi,j) + r(mi,j) + r(i,mj) +  r(i,pj) - 4.0*r(i,j) 
        
        
    end  subroutine pure_test
    
end module testing_pure



program test_pure_function
    use testing_pure
    implicit none
    
    integer , parameter :: x = 100
    integer , parameter :: y = 100 
    integer             :: counter,i,j,pi,mi,pj,mj  
    integer             :: rate, stop_count, start_count
    real                :: computed_time
    
    
  !     real, dimension (x,y) :: p,pp    
    real , dimension (:,:), allocatable :: p,pp 
    allocate ( p(x,y) , pp(x,y) )
    
    call random_number (p)
    
    
    
    ! ==================
    
    
    
    
    call system_clock (count=start_count , count_rate=rate)
    
    
    do counter = 1, 10000
        
        
        do j = 1, y
            do i =1, x 
                
                
                pj = j + 1
                mj = j - 1
                pi = i + 1
                mi = i - 1
                
                if ( mi == 0 ) mi = x
                if ( pi == ( x + 1 ) ) pi = 1
                if ( mj == 0 ) mj = y
                if ( pj == ( y + 1 ) ) pj = 1
                
                
                
                !some computation
                
                
!                pp   = pure_function ( p , i , j, pi, pj, mi, mj, x, y )
                
                call pure_test ( pp, p,i, j, pi, pj, mi, mj,x,y )
                
                
                
            end do
        end do
        
        
        if ( mod( counter, 1000 ) .eq. 0 ) print *, '   Steps done  =  ', counter
        
        
    end do 
    
    
    call system_clock (count=stop_count)
    
    computed_time = real(max(stop_count - start_count , 1_8 )) /real(rate)
    
    
    
    !=========
    print*, ' Code time is     = ',  computed_time ,' seconds'
    
    
    

end program test_pure_function

test 1

I run the test as before

>ifx tttt1.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.2.0 Build 20240602
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:tttt1.exe
-subsystem:console
tttt1.obj

>tttt1
    Steps done  =          1000
    Steps done  =          2000
    Steps done  =          3000
    Steps done  =          4000
    Steps done  =          5000
    Steps done  =          6000
    Steps done  =          7000
    Steps done  =          8000
    Steps done  =          9000
    Steps done  =         10000
  Code time is     =   0.1090000      seconds

It seems like the calling of pure routine has no performance loss with the same allocatable arrays. So it seems like it has something to do with function.

andrew_4619 · ‎07-04-2024

OK I did your second test and indeed the function and subroutine versions have run times that are 1000x different!!

Just to be sure I added a couple of lines at the end (outside the timer) to print a random element of the pp array and that made no difference so there is not some code elimination optimisation anomaly.

Dear compiler team there is indeed something very BAD happening!

Andrew_Smith · ‎07-04-2024

I used VTune on the allocatable version:

Does look like massive amounts of unnecessary memory copying. Its worrying that an allocatable array can cause this much performance loss. I expect many of us will be watching for a resolution.

andrew_4619 · ‎07-04-2024

hmmm that last post made me look harder at the example. The function is returning a 100x100 real array but only making assignment to a single element within it, that doesn't seem to be a good thing to do. None the less it should be making 10000 assignments on function return. The loc of the array in the function is not the same loc as the pp array in the caller as one would expect.

jimdempseyatthecove · ‎07-04-2024

The pure function coding style, while compiling the function, the compiler has no way of determining if the result of the function is possibly an alias of an input argument. Therefore, it copies the/an array. In the call coding style, the programmer's requirement is to not use argument aliases. I do not know if the Fortran Specification states anything about function output aliasing an input argument and therefore presumes it may be a possibility, thus requiring the copying of the array (use of array temporary).

This is presumption on my part.

Note, in your first code example (using function), add ", intent(out)"

real , dimension ( x, y ), intent (out) :: pure_function

See if this makes a difference.

Jim Dempsey

Fortran10 · ‎07-04-2024

I tried and received error message

intel

ifx tttt.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.2.0 Build 20240602
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

tttt.f90(21): error #6413: This global name is invalid in this context.   [PURE_FUNCTION]
        real , dimension ( x, y ), intent ( out) :: pure_function
----------------------------------------------------^
compilation aborted for tttt.f90 (code 1)

gfortran

gfortran tttt.f90
tttt.f90:16:4:

   16 |     function pure_function ( r,i, j, pi, pj, mi, mj,x,y )
      |    1
Error: Symbol at (1) is not a DUMMY variable
tttt.f90:38:9:

   38 |     use testing_pure
      |         1
Fatal Error: Cannot open module file 'testing_pure.mod' for reading at (1): No such file or directory
compilation terminated.

FortranFan · ‎07-04-2024

@Fortran10 ,

I suggest the following:

If your work has any $$ attached to it (profit, funding, school research, etc.) and it's reliant on Intel Fortran in any way, procure Intel Premier Support and follow up on this case via Intel Online Service Center for detailed engagement on various aspects around this issue.
Look into using Compiler Explorer (godbolt.org) to evaluate using various compilers and to analyze compiler response by studying the generated Assembly output,
Note excessive white space and indentation can be extremely harmful, especially to any colleagues who have to work on your code,
Note much of the Fortran compilers now rely on C/C++ - frontend, lowering to backend, etc. - and compiler developers tend to be proficient in these other languages but perhaps not as much as in Fortran. Yet the Fortran standard is somewhat weak when it comes for FUNCTION subprograms, specifically around the subject of Copy Elision and Return Value Optimization when it comes to function results. This can have a huge impact with function results that involve a lot of data. This can be critical in poor performance with Fortran compilers, especially due to special semantics in Fortran (RHS evaluation, allocation on assignment, etc.) that can come across as quite "foreign" to Fortran compiler writers and where the standard offers no guidance that can help with performance. Suboptimal copying can come into play.
Given #4, consider using SUBROUTINE subprograms when objects with a lot of data are involved, meaning be careful with the data copy burden ,
Be careful with instrumentation of unit tests in Fortran to do "profiling", avoid the risk of the compiler optimizing away everything, and avoid measuring any IO. This can all prove misleading. Note a really smart compiler would have optimized away your entire code and not done any computations and shown a time of 0 seconds in all cases because it would have recognized the computations do not affect any subsequent code instructions. In the variant below, a subsequent PRINT statement toward pp(42,43) can help prevent a processor from doing so.

In the meantime, consider reviewing a more simple-minded variant of your code and whether there are any ideas in it:

module m

contains
    
   pure function update( r_pi_j, r_mi_j, r_i_mj, r_i_pj, r_i_j ) result( r )

      ! Argument list
      real, intent(in) :: r_pi_j
      real, intent(in) :: r_mi_j
      real, intent(in) :: r_i_mj
      real, intent(in) :: r_i_pj
      real, intent(in) :: r_i_j
      ! Function result
      real :: r

      r =  r_pi_j + r_mi_j + r_i_mj +  r_i_pj - 4.0*r_i_j 
       
   end function

end module

program test

   use, intrinsic :: iso_fortran_env, only : I8 => int64
   use m
    
   integer, parameter :: R8 = selected_real_kind( p=12 )
   integer , parameter :: x = 100
   integer , parameter :: y = 100 
   real(R8) :: t1, t2, t
   integer :: counter, i, j, pi, mi, pj, mj  
    
   real , dimension (:,:), allocatable :: p, pp
   allocate ( p(x,y) , pp(x,y) )
    
   ! ==================
   t = 0.0
   do counter = 1, 10000
        
      call random_number( p )
      call cpu_t( t1 ) 

      do j = 1, y
         do i = 1, x 
                
            pj = j + 1
            mj = j - 1
            pi = i + 1
            mi = i - 1
            
            if ( mi == 0 ) mi = x
            if ( pi == ( x + 1 ) ) pi = 1
            if ( mj == 0 ) mj = y
            if ( pj == ( y + 1 ) ) pj = 1
            
            !some computation
            pp(i,j) = update( p(pi,j), p(mi,j), p(i,mj), p(i,pj), p(i,i) )
                
            end do
        end do
        call cpu_t( t2 ) 
        t = t + (t2 - t1)

        if ( mod( counter, 1000 ) == 0 ) print *, '   Steps done  =  ', counter, '; pp(42,43): ', pp(42,43)
        
    end do 
    
   ! ==================
   print *, ' Code time is     = ', t ,' seconds'

contains

   subroutine cpu_t( time )

      ! Argument list
      real(R8), intent(inout) :: time

      ! Local variables
      integer(I8) :: tick
      integer(I8) :: rate

      call system_clock (tick, rate)

      time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )

      return

   end subroutine
   
end program

C:\temp>ifx /free /standard-semantics p.f
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.2.0 Build 20240602
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.36.32537.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:p.exe
-subsystem:console
p.obj

C:\temp>p.exe
    Steps done  =   1000 ; pp(42,43):  -0.3779147
    Steps done  =   2000 ; pp(42,43):  1.6536832E-02
    Steps done  =   3000 ; pp(42,43):  1.551260
    Steps done  =   4000 ; pp(42,43):  0.2452765
    Steps done  =   5000 ; pp(42,43):  0.6278389
    Steps done  =   6000 ; pp(42,43):  -1.001162
    Steps done  =   7000 ; pp(42,43):  1.457952
    Steps done  =   8000 ; pp(42,43):  -0.3233254
    Steps done  =   9000 ; pp(42,43):  -0.7149973
    Steps done  =   10000 ; pp(42,43):  1.124510
  Code time is     =  8.800101280212402E-002  seconds

C:\temp>

Fortran10 · ‎07-07-2024

There is no $$ associated at this moment. Of course, that would be the best situation if there is!
Thanks for sharing the link. This looks interesting, never heard of it before.
This depends on the editor I think. Some editors show the space quite narrow, some fonts make it very widely spaced. Also with this markup, the preview is different, I feel. So it is still an evolving design architecture for me. Can't decide which way I should keep writing my codes.
I can't say much about it. It is something I have no knowledge about.
This is more relevant since Fortran is supposed to translate mathematical Formula. Hence quite natural to evaluate a mathematical function. The intuitive approach is to use Function subprograms. However, my experience (less than five years) is showing using function is not always a right way to solve problem. It may be solvable but at the expense of performance ( another concern and perhaps the main in some situations).
This I/O insight is nice.

The code variant is useful. This is very tricky when dealing with whole array operations, scalar - array or with index array (my original code). Working with mixed-arrays (whole array and index array) requires some good experience otherwise I feel it is hard to debug.

FortranFan · ‎07-08-2024

@Fortran10 wrote:
..
The code variant is useful. This is very tricky when dealing with whole array operations, scalar - array or with index array (my original code). Working with mixed-arrays (whole array and index array) requires some good experience otherwise I feel it is hard to debug.

@Fortran10 ,

First, in case you have not done so, look also into ELEMENTAL subclause of Fortran subprograms and the facilities provided by this:

Element-wise Operations on Arrays — Fortran Programming Language (fortran-lang.org)

I suggest you follow up further at this site, Fortran-lang.org:

Fortran Discourse - Fortran open source community (fortran-lang.discourse.group)

Learn — Fortran Programming Language (fortran-lang.org)

Re: "This is very tricky when dealing with whole array operations, scalar - array or with index array (my original code). Working with mixed-arrays (whole array and index array) requires some good experience otherwise I feel it is hard to debug," you will find good guidance from many other Fortran practitioners at the above two links.

andrew_4619 · ‎07-05-2024

"Note excessive white space and indentation can be extremely harmful "

Back in the day I used to like a few blank lines because it made code more readable, nowadays with editors that have syntax highlighting etc I hate blank lines, I can see less code and have to scroll more. I find it adds pretty much nothing to readability. I suspect other opinions exist.......

As for indentation that is great but if you get beyond the 4th level maybe the code structure needs some thought. Readability and clarity is very important for support and maintenance of a code.

performance of passing allocatable array to pure function

Performance