performance of passing allocatable array to pure function

Fortran10 · ‎07-03-2024

The following code works very slow when the arrays are declared allocatable

> real , dimension (:,:), allocatable :: p,pp
> allocate ( p(x,y) , pp(x,y) )

allocatable test

ifx tttt.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.2.0 Build 20240602
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:tttt.exe
-subsystem:console
tttt.obj

>tttt
    Steps done  =          1000
    Steps done  =          2000
    Steps done  =          3000
    Steps done  =          4000
    Steps done  =          5000
    Steps done  =          6000
    Steps done  =          7000
    Steps done  =          8000
    Steps done  =          9000
    Steps done  =         10000
  Code time is     =    89.00400      seconds

The code

module testing_pure
    implicit none
    
    contains
    
    pure function pure_function ( r,i, j, pi, pj, mi, mj,x,y )
        implicit none
        
        integer , intent ( in )  :: x,y
        real , dimension ( x, y), intent ( in ) :: r
        real , dimension ( x, y )               :: pure_function
        integer ,intent ( in ) :: i, j, pj, mj, pi,mi
        
        
        
        pure_function(i,j) =  r(pi,j) + r(mi,j) + r(i,mj) +  r(i,pj) - 4.0*r(i,j) 
        
        
    end function pure_function
    
    
    
end module testing_pure



program test_pure_function
    use testing_pure
    implicit none
    
    integer , parameter :: x = 100
    integer , parameter :: y = 100 
    integer             :: counter,i,j,pi,mi,pj,mj  
    integer             :: rate, stop_count, start_count
    real                :: computed_time
    
    
    !real, dimension (x,y) :: p,pp    
    real , dimension (:,:), allocatable :: p,pp 
    allocate ( p(x,y) , pp(x,y) )
    
    call random_number (p)
    
    ! ==================
    
    
    
    
    call system_clock (count=start_count , count_rate=rate)
    
    
    do counter = 1, 10000
        
        
        do j = 1, y
            do i =1, x 
                
                
                pj = j + 1
                mj = j - 1
                pi = i + 1
                mi = i - 1
                
                if ( mi == 0 ) mi = x
                if ( pi == ( x + 1 ) ) pi = 1
                if ( mj == 0 ) mj = y
                if ( pj == ( y + 1 ) ) pj = 1
                
                
                
                !some computation
                
                
                pp   = pure_function ( p , i , j, pi, pj, mi, mj, x, y )
                
                
                
            end do
        end do
        

        if ( mod( counter, 1000 ) .eq. 0 ) print *, '   Steps done  =  ', counter
        
        
    end do 
    
    
    call system_clock (count=stop_count)
    
    computed_time = real(max(stop_count - start_count , 1_8 )) /real(rate)
    
    
    
    !=========
    print*, ' Code time is     = ',  computed_time ,' seconds'
    
    
    
    
end program test_pure_function

without allocatable

> real, dimension (x,y) :: p,pp

it runs very fast

ifx tttt.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.2.0 Build 20240602
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:tttt.exe
-subsystem:console
tttt.obj

>tttt
    Steps done  =          1000
    Steps done  =          2000
    Steps done  =          3000
    Steps done  =          4000
    Steps done  =          5000
    Steps done  =          6000
    Steps done  =          7000
    Steps done  =          8000
    Steps done  =          9000
    Steps done  =         10000
  Code time is     =   6.1999999E-02  seconds

with gfortran compiler

there is no change in performance i.e.

with allocatable

>gfortran tttt.f90 -o t

>t
    Steps done  =          1000
    Steps done  =          2000
    Steps done  =          3000
    Steps done  =          4000
    Steps done  =          5000
    Steps done  =          6000
    Steps done  =          7000
    Steps done  =          8000
    Steps done  =          9000
    Steps done  =         10000
  Code time is     =   0.968999982      seconds

without allocatable

>gfortran tttt.f90 -o t

>t
    Steps done  =          1000
    Steps done  =          2000
    Steps done  =          3000
    Steps done  =          4000
    Steps done  =          5000
    Steps done  =          6000
    Steps done  =          7000
    Steps done  =          8000
    Steps done  =          9000
    Steps done  =         10000
  Code time is     =    1.07799995      seconds

So what makes it slow with allocatable arrays ?

andrew_4619 · ‎07-04-2024

Are the allocates withing the timed loop? Repeat allocate/deallocate will always take some time. The load times may be different but you are not timing that. The saving with allocate is that you can grab the memory you need not your best case guess of max memory that you think might be needed.

Fortran10 · ‎07-04-2024

The arrays are already allocated before going to the time loop. I do not know if the allocatable attribute remains active for the whole code duration.

Also does that mean intel compiler (ifx) handles allocatable arrays differently than the gfortran ? I do not see any performance issues with gfortran. If that is the case how to make a portable code with allocatables ?

andrew_4619 · ‎07-04-2024

Do you use heap arrays option? Allocates are always on the heap. Stack is usually faster but you are limited by stack size.

Something clearly goes wrong with the first test case! However I will note that having now looked at you code a smart optimiser could ditch your inner test loop as it does nothing other than set counter to the loop exit value if I have read that correctly.

Fortran10 · ‎07-04-2024

I have a newer version of the previous code.

Modification

I added pure subroutine and it was called from the time loop too.

The code (version 2)


!     ifx allocatable array test
!
!
!    version 2 :
!                pure subroutine is used here.


module testing_pure
    implicit none
    
    contains
    
    function pure_function ( r,i, j, pi, pj, mi, mj,x,y )
        implicit none
        
        integer , intent ( in )  :: x,y
        real , dimension ( x, y), intent ( in ) :: r
        real , dimension ( x, y )               :: pure_function
        integer ,intent ( in ) :: i, j, pj, mj, pi,mi
        
        
        
        pure_function(i,j) =  r(pi,j) + r(mi,j) + r(i,mj) +  r(i,pj) - 4.0*r(i,j) 
        
        
    end function pure_function
    
    
    
    pure subroutine pure_test ( rr, r,i, j, pi, pj, mi, mj,x,y )
        implicit none
        
        integer , intent ( in )  :: x,y
        real , dimension ( x, y), intent ( in ) :: r
        real , dimension ( x, y ), intent (out) :: rr
        integer ,intent ( in ) :: i, j, pj, mj, pi,mi
        
        
        rr(i,j) =  r(pi,j) + r(mi,j) + r(i,mj) +  r(i,pj) - 4.0*r(i,j) 
        
        
    end  subroutine pure_test
    
end module testing_pure



program test_pure_function
    use testing_pure
    implicit none
    
    integer , parameter :: x = 100
    integer , parameter :: y = 100 
    integer             :: counter,i,j,pi,mi,pj,mj  
    integer             :: rate, stop_count, start_count
    real                :: computed_time
    
    
  !     real, dimension (x,y) :: p,pp    
    real , dimension (:,:), allocatable :: p,pp 
    allocate ( p(x,y) , pp(x,y) )
    
    call random_number (p)
    
    
    
    ! ==================
    
    
    
    
    call system_clock (count=start_count , count_rate=rate)
    
    
    do counter = 1, 10000
        
        
        do j = 1, y
            do i =1, x 
                
                
                pj = j + 1
                mj = j - 1
                pi = i + 1
                mi = i - 1
                
                if ( mi == 0 ) mi = x
                if ( pi == ( x + 1 ) ) pi = 1
                if ( mj == 0 ) mj = y
                if ( pj == ( y + 1 ) ) pj = 1
                
                
                
                !some computation
                
                
!                pp   = pure_function ( p , i , j, pi, pj, mi, mj, x, y )
                
                call pure_test ( pp, p,i, j, pi, pj, mi, mj,x,y )
                
                
                
            end do
        end do
        
        
        if ( mod( counter, 1000 ) .eq. 0 ) print *, '   Steps done  =  ', counter
        
        
    end do 
    
    
    call system_clock (count=stop_count)
    
    computed_time = real(max(stop_count - start_count , 1_8 )) /real(rate)
    
    
    
    !=========
    print*, ' Code time is     = ',  computed_time ,' seconds'
    
    
    

end program test_pure_function

test 1

I run the test as before

>ifx tttt1.f90
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2024.2.0 Build 20240602
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:tttt1.exe
-subsystem:console
tttt1.obj

>tttt1
    Steps done  =          1000
    Steps done  =          2000
    Steps done  =          3000
    Steps done  =          4000
    Steps done  =          5000
    Steps done  =          6000
    Steps done  =          7000
    Steps done  =          8000
    Steps done  =          9000
    Steps done  =         10000
  Code time is     =   0.1090000      seconds

It seems like the calling of pure routine has no performance loss with the same allocatable arrays. So it seems like it has something to do with function.

andrew_4619 · ‎07-04-2024

OK I did your second test and indeed the function and subroutine versions have run times that are 1000x different!!

Just to be sure I added a couple of lines at the end (outside the timer) to print a random element of the pp array and that made no difference so there is not some code elimination optimisation anomaly.

Dear compiler team there is indeed something very BAD happening!

Andrew_Smith · ‎07-04-2024

I used VTune on the allocatable version:

Does look like massive amounts of unnecessary memory copying. Its worrying that an allocatable array can cause this much performance loss. I expect many of us will be watching for a resolution.

andrew_4619 · ‎07-04-2024

hmmm that last post made me look harder at the example. The function is returning a 100x100 real array but only making assignment to a single element within it, that doesn't seem to be a good thing to do. None the less it should be making 10000 assignments on function return. The loc of the array in the function is not the same loc as the pp array in the caller as one would expect.

jimdempseyatthecove · ‎07-04-2024

The pure function coding style, while compiling the function, the compiler has no way of determining if the result of the function is possibly an alias of an input argument. Therefore, it copies the/an array. In the call coding style, the programmer's requirement is to not use argument aliases. I do not know if the Fortran Specification states anything about function output aliasing an input argument and therefore presumes it may be a possibility, thus requiring the copying of the array (use of array temporary).

This is presumption on my part.

Note, in your first code example (using function), add ", intent(out)"

real , dimension ( x, y ), intent (out) :: pure_function

See if this makes a difference.

Jim Dempsey

performance of passing allocatable array to pure function

Performance