Performance of derived type function call

Patrice_l_ · ‎05-21-2015

Hi,

In the following simple code, where the operator .lt. is used a lot of times, i found that ifort is slower than gfortran by a factor of 2. First i thought it was the handling of the class(*) but in this version there is no class(*) , then i suspected a slow select type, but it just appears that accessing the component from the original class is the problem. what are your thoughts on this ?

gfortran -O3 -fno-inline-functions  tata.f90

T
 b1 extends btype to intb F T
 T
 T
 F
 time operator .lt.  0.40333300000000000     
 time selecttype .lt.  0.31666599999999995     
 time .lt. intb  0.31000000000000016     
 time .lt. btype   7.3333999999999788E-002

ifort -O3 -fno-inline-functions tata.f90

 T
 b1 extends btype to intb T T
 T
 T
 F
 time operator .lt.  0.723333000000000     
 time selecttype .lt.  0.670000000000000     
 time .lt. intb  0.670000000000000     
 time .lt. btype  0.110000000000000

module tata
implicit none

type :: b_type
        integer,allocatable :: b
        contains
                procedure,private :: lower_than
                generic,public :: operator(.lt.) => lower_than
end type

        type,extends(b_type) :: intb 
        contains
                procedure,private :: lower_than=>lowerthan_intb
                !generic,public :: operator(.lt.) => lowerthan_intb
        end type
        interface intb
                module procedure :: intb_constructor
        end interface
contains
function intb_constructor(i) result(a)
        type(intb) :: a
        integer :: i
        allocate(a%b,source=i)
end function        
function lowerthan_intb_intb(a,b) result(r)
        class(intb),intent(in) :: a!,b
        class(intb),intent(in) :: b
        logical :: r
        r=a%b<b%b
end function
function lowerthan_intb(a,b) result(r)
        class(intb),intent(in) :: a!,b
        class(b_type),intent(in) :: b
        logical :: r
        !print *,'lowerthan intb'
        !select type(i1 => a%b)
        !type is(integer)
                select type(b1 => b)
                class is (intb)
                !select type(i2 => b1%b)
                !type is(integer)
                        r=a%b<b1%b
                !end select
        end select
        !end select
end function
function lower_than(a,b)
        class(b_type),intent(in) :: a
        class(b_type),intent(in) :: b
        logical :: lower_than
        lower_than=.false.
        print *, 'operator lower than class* no implementation' 
end function
function lowerthan_bt_bt(a,b) result(r)
        class(b_type),intent(in) :: a!,b
        class(b_type),intent(in) :: b
        logical :: r
        r=a%b<b%b
end function


end module

program foo
use tata
implicit none
type(intb) :: d,d2
class(intb),allocatable :: cd,cd2
class(b_type),allocatable :: b1,b2
type(b_type) :: e1,e2
integer :: j
logical :: res
double precision :: t0,t1
j=1
allocate(e1%b,source=j)
allocate(e2%b,source=j)
        
d=intb(1)
d2=intb(3)
allocate(cd,source=d)        
allocate(cd2,source=d2)       
print *,d<d2 
allocate(b1,source=cd)
allocate(b2,source=cd2)
print *,'b1 extends btype to intb',extends_type_of(b1,d),same_type_as(b1,d)
print *,b1.lt.b2 
print *,b1<b2 
print *, cd<b1
deallocate(cd,cd2,b1,b2)

call cpu_time(t0)
do j=1,2000000
       allocate(cd,source=d)        
       allocate(cd2,source=d2)       
       res=cd<cd2 
       deallocate(cd,cd2)
end do
call cpu_time(t1)
print *,'time operator .lt.',t1-t0

call cpu_time(t0)
do j=1,2000000
       allocate(cd,source=d)        
       allocate(cd2,source=d2)       
       res=lowerthan_intb(cd,cd2) 
       deallocate(cd,cd2)
end do
call cpu_time(t1)
print *,'time selecttype .lt.',t1-t0

call cpu_time(t0)
do j=1,2000000
       allocate(cd,source=d)        
       allocate(cd2,source=d2)       
       res=lowerthan_intb_intb(cd,cd2) 
       deallocate(cd,cd2)
end do
call cpu_time(t1)
print *,'time .lt. intb',t1-t0
deallocate(e1%b,e2%b)

call cpu_time(t0)
do j=1,2000000
       allocate(e1%b,source=j)
       allocate(e2%b,source=j)
       res=lowerthan_bt_bt(e1,e2) 
       deallocate(e1%b,e2%b)
end do
call cpu_time(t1)
print *,'time .lt. btype',t1-t0

end program

Patrice_l_ · ‎05-21-2015

I did more testing, and most of the time comes from the allocation of the class with source= . but yet even without that I still found the gfortran version to be faster. The new code does some comparison between class and type and accessing the parent variable. I do not see it being an effect , so I narrowed it down to allocate class with source and call the lowerthan function .

Thanks for any hint on that.

module tata
implicit none

type :: b_type
        integer,allocatable :: b
        contains
                procedure,private :: lower_than
                generic,public :: operator(.lt.) => lower_than
end type

        type,extends(b_type) :: intb 
        integer,allocatable :: be
        contains
                procedure,private :: lower_than=>lowerthan_intb
                !generic,public :: operator(.lt.) => lowerthan_intb
        end type
        interface intb
                module procedure :: intb_constructor
        end interface
contains
function intb_constructor(i) result(a)
        type(intb) :: a
        integer :: i
        allocate(a%b,source=i)
end function        
function lowerthan_intb_intb(a,b) result(r)
        class(intb),intent(in) :: a!,b
        class(intb),intent(in) :: b
        logical :: r
        r=a%b<b%b
end function
function lowerthan_intb(a,b) result(r)
        class(intb),intent(in) :: a!,b
        class(b_type),intent(in) :: b
        logical :: r
        !print *,'lowerthan intb'
        !select type(i1 => a%b)
        !type is(integer)
                select type(b1 => b)
                class is (intb)
                !select type(i2 => b1%b)
                !type is(integer)
                        r=a%b<b1%b
                !end select
        end select
        !end select
end function
function lower_than(a,b)
        class(b_type),intent(in) :: a
        class(b_type),intent(in) :: b
        logical :: lower_than
        lower_than=.false.
        print *, 'operator lower than class* no implementation' 
end function
function lowerthan_bt_bt(a,b) result(r)
        class(b_type),intent(in) :: a!,b
        class(b_type),intent(in) :: b
        logical :: r
        r=a%b<b%b
end function
function lowerthan_be_be(a,b) result(r)
        class(intb),intent(in) :: a!,b
        class(intb),intent(in) :: b
        logical :: r
        r=a%be<b%be
end function


end module

program foo
use tata
implicit none
type(intb) :: d,d2
class(intb),allocatable :: cd,cd2
class(b_type),allocatable :: b1,b2
type(b_type) :: e1,e2
integer :: j
logical :: res
double precision :: t0,t1
j=1
allocate(e1%b,source=j)
allocate(e2%b,source=j)
        
d=intb(1)
d2=intb(3)
allocate(cd,source=d)        
allocate(cd2,source=d2)       
print *,d<d2 
allocate(b1,source=cd)
allocate(b2,source=cd2)
print *,'b1 extends btype to intb',extends_type_of(b1,d),same_type_as(b1,d)
print *,b1.lt.b2 
print *,b1<b2 
print *, cd<b1
deallocate(cd,cd2,b1,b2)

call cpu_time(t0)
do j=1,2000000
       allocate(cd,source=d)        
       allocate(cd2,source=d2)       
       res=cd<cd2 
       deallocate(cd,cd2)
end do
call cpu_time(t1)
print *,'time operator .lt.',t1-t0

call cpu_time(t0)
do j=1,2000000
       allocate(cd,source=d)        
       allocate(cd2,source=d2)       
       res=lowerthan_intb(cd,cd2) 
       deallocate(cd,cd2)
end do
call cpu_time(t1)
print *,'time selecttype .lt.',t1-t0

allocate(cd,source=d)        
allocate(cd2,source=d2)       
deallocate(cd%b,cd2%b)
call cpu_time(t0)
do j=1,2000000
       allocate(cd%b,source=j)        
       allocate(cd2%b,source=j)       
       res=lowerthan_intb_intb(cd,cd2) 
       deallocate(cd%b,cd2%b)
end do
call cpu_time(t1)
print *,'time .lt. intb class',t1-t0

call cpu_time(t0)
do j=1,2000000
       allocate(d%be,source=j)
       allocate(d2%be,source=j)
       res=lowerthan_intb_intb(d,d2) 
       deallocate(d%be,d2%be)
end do
call cpu_time(t1)
print *,'time .lt. intb type',t1-t0


deallocate(e1%b,e2%b)

call cpu_time(t0)
do j=1,2000000
       allocate(e1%b,source=j)
       allocate(e2%b,source=j)
       res=lowerthan_bt_bt(e1,e2) 
       deallocate(e1%b,e2%b)
end do
call cpu_time(t1)
print *,'time .lt. btype',t1-t0

call cpu_time(t0)
do j=1,2000000
       allocate(d%be,source=j)
       allocate(d2%be,source=j)
       res=lowerthan_be_be(d,d2) 
       deallocate(d%be,d2%be)
end do
call cpu_time(t1)
print *,'time .lt. be',t1-t0

end program

pbkenned1 · ‎05-22-2015

Thanks for submitting the issue. Sorry for the slow response, I'll take a look at this.

Patrick

FortranFan · ‎05-25-2015

It'll be interesting to see how Intel support team responds to this. The tests in the original post as well as in Message #2 mainly do allocate and deallocate and they do not appear to be a measure of performance for a derived type function call invoked via a generic operator. Besides, I don't quite think it is possible to get any meaningful measure of performance of such a function call any way and I don't think one can draw any meaningful comparisons among compilers - that is, the CPU time numbers are going to be so small to be of any relevance. Also, why should inlining be turned off?

Patrice_l_ · ‎05-26-2015

I turned off inlining because I thought it would be a better comparison of function call at the begin. Though now it does not appear to be a function/subroutine call problem.

pbkenned1 · ‎05-26-2015

I couldn't draw any meaningful conclusions here. With ifort, the top hot stops are heap allocation, followed by for__calc_num_elts(), followed by heap deallocation. With gfortran 4.9.1, all I'm getting is an endless series of 'operator lower than class* no implementation', I can't tell if it's in an infinite loop or not.

What gfortran version compiles/runs this normally (or am I missing something obvious)?

Patrick

Patrice_l_ · ‎05-26-2015

Oh yes there is a bug and it is fixed in gfortran 5.1 which i am using. I can provide you exec, object file and assembly if you want.

pbkenned1 · ‎05-26-2015

I compared the performance of gfortran-5.1 against ifort-16.0.0.056 (beta update #1). Indeed, gfortran is faster, regardless of whether or not I disable function inlining. Unfortunately I only have one Linux host capable of supporting gcc-5.1, and it doesn't have a performance analyzer installed, so it may take me some time to state anything conclusive.

Meantime, perhaps you could comment on gfortran's 'b1 extends btype to intb F T', versus ifort's 'b1 extends btype to intb T T'. Bug???

Patrick

[U558697]$ ./U558697-gf-5.1.0.exe
T
b1 extends btype to intb F T
T
T
F
time operator .lt. 0.87086799999999998
time selecttype .lt. 0.88086600000000004
time .lt. intb class 0.12498100000000001
time .lt. intb type 0.12698099999999979
time .lt. btype 0.12398100000000012
time .lt. be 0.12698000000000009

[U558697]$ ./U558697-if-16.0.0.056.exe
T
b1 extends btype to intb T T
T
T
F
time operator .lt. 1.03684200000000
time selecttype .lt. 1.01184600000000
time .lt. intb class 0.159976000000000
time .lt. intb type 0.161975000000000
time .lt. btype 0.161976000000000
time .lt. be 0.161975000000000
[U558697]$

Patrice_l_ · ‎05-26-2015

Thanks for confirming that, I have already reported the bug to gfortran here, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66227 . I have myself trouble running the analysis tool as I am already in linux 4.0 and the pin binary does not support it.

FortranFan · ‎05-26-2015

Patrick Kennedy (Intel) wrote:

I couldn't draw any meaningful conclusions here. With ifort, the top hot stops are heap allocation, followed by for__calc_num_elts(), followed by heap deallocation. With gfortran 4.9.1, all I'm getting is an endless series of 'operator lower than class* no implementation', I can't tell if it's in an infinite loop or not.

What gfortran version compiles/runs this normally (or am I missing something obvious)?

Patrick

It'll be great if Intel experts can explain and illustrate the best ways to measure performance of used defined operators and type-bound procedures and inlined function calls, etc. Coders can then decide the best way to design their code. My own impression is CPU performance comparison across compilers at such a unit test level may not be all that revealing or useful, but I'd like to learn otherwise.

Anyways looking at the case in this thread, I'd probably set it up along the following lines:

program p

   use, intrinsic :: iso_fortran_env, only : i4 => int32, r8 => real64

   use tata, only : intb

   implicit none

   integer(i4), parameter :: NMAX = 2**16
   integer(i4), parameter :: MAXREPEAT = 10
   integer(i4) :: i
   integer(i4) :: j
   logical, allocatable :: res(:)
   type(intb), allocatable :: a(:)
   type(intb), allocatable :: b(:)
   real(r8) :: x
   real(r8) :: Start_Time
   real(r8) :: End_Time
   real(r8) :: Cpu_Times(MAXREPEAT)

   Loop_OP: do j = 1, MAXREPEAT

      allocate(a(NMAX), b(NMAX), res(NMAX))

      !.. Create a and b vectors using random numbers and arbitrary logic
      do i = 1, NMAX
         call random_number(x)
         if (x < 0.25_r8) then
            a(i) = intb(0)
            b(i) = intb(3)
         else if (x < 0.5_r8) then
            a(i) = intb(1)
            b(i) = intb(2)
         else if (x < 0.75_r8) then
            a(i) = intb(2)
            b(i) = intb(1)
         else
            a(i) = intb(3)
            b(i) = intb(0)
         end if
      end do

      call cpu_time(Start_Time)
      do i = 1, NMAX
         res(i) = (a(i) < b(i))
      end do
      call cpu_time(End_Time)

      !.. print some results from 1st iteration as a check
      if (j == 1) then
         print *, " Iteration 1:"
         print *, " res(1), a(1)%b, b(1)%b = ", res(1), a(1)%b, b(1)%b
         print *, " res(999), a(999)%b, b(999)%b = ", res(999), a(999)%b, b(999)%b
         print *, " res(NMAX), a(NMAX)%b, b(NMAX)%b = ", res(NMAX), a(NMAX)%b, b(NMAX)%b
      end if

      Cpu_Times(j) = End_Time - Start_Time
      deallocate(a, b, res)

   end do Loop_OP

   print *,'time operator .lt.',sum(Cpu_Times)/real(MAXREPEAT, kind=r8)

   stop

end program p

And with a check such as above, my own read is both Intel Fortran and gfortran show negligible CPU load from the generic operator bound to the derived type:

Patrice_l_ · ‎05-26-2015

FortranFan, I tested your program , and I just increase NMAX to 20 millions, here are my results. I see a significant difference.

gfortran 5.1

$ \time ./a.out 
  Iteration 1:
  res(1), a(1)%b, b(1)%b =  F           3           0
  res(999), a(999)%b, b(999)%b =  T           0           3
  res(NMAX), a(NMAX)%b, b(NMAX)%b =  T           1           2
 time operator .lt.   1.4333299999999962E-002
1.99user 0.31system 0:02.31elapsed 99%CPU (0avgtext+0avgdata 197116maxresident)k
0inputs+24outputs (0major+318897minor)pagefaults 0swaps


ifort

 $ \time ./a.out 
  Iteration 1:
  res(1), a(1)%b, b(1)%b =  T           0           3
  res(999), a(999)%b, b(999)%b =  F           2           1
  res(NMAX), a(NMAX)%b, b(NMAX)%b =  T           0           3
 time operator .lt.  2.666669999999998E-002
3.79user 0.12system 0:03.92elapsed 99%CPU (0avgtext+0avgdata 197572maxresident)k
0inputs+0outputs (0major+190539minor)pagefaults 0swaps

FortranFan · ‎05-26-2015

Patrice l. wrote:

FortranFan, I tested your program , and I just increase NMAX to 20 millions, here are my results. I see a significant difference.

..

My test case was simply an illustration on how I would perhaps check performance. But again for the specific derived type at hand, I don't think it is a measure of the performance of the generic operator .lt. - it's probably some overhead for setting up the DO loops based on what the two compilers think is optimal given the O3 option, especially when NMAX becomes too high and the compilers can interpret differently on how to vectorize, etc.. There simply isn't enough "meat" in each iteration.

Lorri_M_Intel · ‎05-27-2015

The use of CLASS does incur a necessary overhead.

When something is declared as a TYPE (even an extended TYPE) the compiler knows everything about that type, including what size it is for an ALLOCATE statement, and which type-bound procedure to use for an operator, etc. This can (but not necessarily) result in better optimized code.

When something is declared as a CLASS, all sorts of information needs to be determined at runtime based on the currently set dynamic type. In our implementation, the data structure for a CLASS is more involved than the data structure for a TYPE. Also, there is more code written to interrogate the extended data structure, and/or to make calls into the Fortran RTL for even more extended support, etc.

I believe what you are seeing is the overhead in ALLOCATE/DEALLOCATE, as well as an extra indirect call to get to the correct type-bound procedure for the .lt. operator.

I don't know the details of what gfortran is doing in this case, so I can't comment about the difference between their implementation and ours.

Does this help?

--Lorri

Patrice_l_ · ‎05-27-2015

FortranFan wrote:

Quote:

My test case was simply an illustration on how I would perhaps check performance. But again for the specific derived type at hand, I don't think it is a measure of the performance of the generic operator .lt. - it's probably some overhead for setting up the DO loops based on what the two compilers think is optimal given the O3 option, especially when NMAX becomes too high and the compilers can interpret differently on how to vectorize, etc.. There simply isn't enough "meat" in each iteration.

Hi,

Lori thanks for your comment, I understand where the extra time is coming from.

FortranFan, Let me give a little more background about what the test case ended up with so less 'meat'. I have an implementation of a map container, and let say I have to insert around 20 millions element and more. With a key of type integer , most of the work is allocate a node and call the .lt. operator. On my linux box, gfortran is 2x faster to do this. I think that globally I would like to see the intel compiler performing a little closer, and it might be helpful for intel developers to look at this simple test case.

FortranFan · ‎05-27-2015

Patrice l. wrote:

... I have an implementation of a map container, and let say I have to insert around 20 millions element and more. With a key of type integer , most of the work is allocate a node and call the .lt. operator. On my linux box, gfortran is 2x faster to do this. I think that globally I would like to see the intel compiler performing a little closer, and it might be helpful for intel developers to look at this simple test case.

I was misled by the title of this topic and your beginning line ("In the following simple code, where the operator .lt. is used a lot of times..") which suggested your focus is on the generic operator performance but it is the data allocation that is of interest to you.

Fortran standard leaves certain aspects of language implementations up to the "processor" (compiler in this case) and as explained by Lorri, the details of polymorphic objects fall under this category and ALLOCATE/DEALLOCATE would follow this. So comparison across compilers does not seem appropriate. If ALLOCATE/DEALLOCATE plays a major role in the performance of one's code, then perhaps a code redesign whereby frequent, intensive object allocation is not required is worth a consideration - OO literature in general (i.e., beyond Fortran) on design patterns do pay attention to this aspect.

Patrice_l_ · ‎05-27-2015

I could use pointers and pre-allocate an array, but this requires more care not to have memory leak, if i use allocatable component in the type, I don't have to worry to much. Using your program and comparing the time to do the random initialization with the time to compare the whole array, I get 0.27s vs 2.73335E-002. So much more time is spent calling intb which is an scalar integer allocation. gfortran gives 0.12999 vs 1.53333E-002.

I tried to remove the allocatable in the integer b_type. So intb is just an assignment a%b=i , then I get 6.0000E-002 vs 2.7666E-002 for ifort and 4.33330E-002 vs 1.00E-002 for gfortran.