Performance penalty associated with derived types with allocatable components

OP1 · ‎03-25-2010

File under "customer requests" :)

The small program below illustrates the performance penalty associated with the use of derived type variables having allocatable components. In debug mode, using fixed length components is about 5 times faster than when using allocatable components. In release mode, this drops to 1.5 or so, but is obviously still significant.

This issue has been addressed in the past in this forum- Steve mentioned that it is caused by the creation of a temporary argument during subroutine calls.

Having derived types with allocatable components is such a neat feature though!!!! My question to Intel is whether there are plans in the near term to enhance the behavior of the compiler...

Thanks!

Olivier

[fortran]PROGRAM MY_PROG
USE IFPORT
IMPLICIT NONE
TYPE T_1
    REAL(8),ALLOCATABLE :: X(:)
END TYPE T_1
TYPE T_2
    REAL(8) :: X(10000)
END TYPE T_2
INTEGER(8) :: I,J
TYPE(T_1) :: T1
TYPE(T_2) :: T2
REAL :: TSTART,TEND,TA(2),TT1,TT2
ALLOCATE(T1%X(10000))
T1%X = 1.0D+0
T2%X = 1.0D+0
DO J=1,10
    TSTART = ETIME(TA)
    DO I=1,100000
        CALL MY_SUB_1(T1)
    ENDDO
    TEND = ETIME(TA)
    TT1 = TEND-TSTART
    TSTART = ETIME(TA)
    DO I=1,100000
        CALL MY_SUB_2(T2)
    ENDDO
    TEND = ETIME(TA)
    TT2 = TEND-TSTART
    WRITE(*,*) TT1,TT2,TT1/TT2
ENDDO
CONTAINS
SUBROUTINE MY_SUB_1(T1)
IMPLICIT NONE
TYPE(T_1),INTENT(INOUT) :: T1
T1%X = T1%X+1.0D+0
END SUBROUTINE MY_SUB_1
SUBROUTINE MY_SUB_2(T2)
IMPLICIT NONE
TYPE(T_2),INTENT(INOUT) :: T2
T2%X = T2%X+1.0D+0
END SUBROUTINE MY_SUB_2
END PROGRAM MY_PROG[/fortran]

Martyn_C_Intel · ‎03-25-2010

There have actually been improvements in recent compiler versions to the performance (especially, the vectorization) of allocatable arrays that are components of derived types. I do not think you are seeing here an effect of temporary copies. In release mode,each subroutine in isolation executes at the same speed.
(To see this, build with /Ob0 or with inlining disabled under Optimization/Inline Function Expansion).
What happens after inlining, is that the compiler is able to invert the order of the two loops in the case where the array has a fixed dimension, adding 1 to the first element 100000 times without having to reload it from cache, and only then going to the second element, etc. (You can see this from the reports). For me, this gaveabout a 1.7x speedup. One could argue about whether the compiler should be able to do the same thing for an allocatable component whose dimension is not explicitly known at compile time - I don't see any fundamental reason why not. I have seen a similar example before, and developers are looking at it.But this seems like relatively fine tuning, and not the original intent of your program.

I see about the same 1.7x performance difference at /Od as I do at the release default of /O2. I believe youmay beseeing a bigger difference because arraybounds checking is enabled by default for debug builds, but not for release builds. It's plausible that checking against variable bounds might be somewhatslower than fixed bounds. Again, itlooks likesomething that could be improved, but it doesn't seem as fundamental as performance of the basic subroutines in the release build.