- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
In the following simple code, where the operator .lt. is used a lot of times, i found that ifort is slower than gfortran by a factor of 2. First i thought it was the handling of the class(*) but in this version there is no class(*) , then i suspected a slow select type, but it just appears that accessing the component from the original class is the problem. what are your thoughts on this ?
gfortran -O3 -fno-inline-functions tata.f90 T b1 extends btype to intb F T T T F time operator .lt. 0.40333300000000000 time selecttype .lt. 0.31666599999999995 time .lt. intb 0.31000000000000016 time .lt. btype 7.3333999999999788E-002
ifort -O3 -fno-inline-functions tata.f90 T b1 extends btype to intb T T T T F time operator .lt. 0.723333000000000 time selecttype .lt. 0.670000000000000 time .lt. intb 0.670000000000000 time .lt. btype 0.110000000000000
module tata implicit none type :: b_type integer,allocatable :: b contains procedure,private :: lower_than generic,public :: operator(.lt.) => lower_than end type type,extends(b_type) :: intb contains procedure,private :: lower_than=>lowerthan_intb !generic,public :: operator(.lt.) => lowerthan_intb end type interface intb module procedure :: intb_constructor end interface contains function intb_constructor(i) result(a) type(intb) :: a integer :: i allocate(a%b,source=i) end function function lowerthan_intb_intb(a,b) result(r) class(intb),intent(in) :: a!,b class(intb),intent(in) :: b logical :: r r=a%b<b%b end function function lowerthan_intb(a,b) result(r) class(intb),intent(in) :: a!,b class(b_type),intent(in) :: b logical :: r !print *,'lowerthan intb' !select type(i1 => a%b) !type is(integer) select type(b1 => b) class is (intb) !select type(i2 => b1%b) !type is(integer) r=a%b<b1%b !end select end select !end select end function function lower_than(a,b) class(b_type),intent(in) :: a class(b_type),intent(in) :: b logical :: lower_than lower_than=.false. print *, 'operator lower than class* no implementation' end function function lowerthan_bt_bt(a,b) result(r) class(b_type),intent(in) :: a!,b class(b_type),intent(in) :: b logical :: r r=a%b<b%b end function end module program foo use tata implicit none type(intb) :: d,d2 class(intb),allocatable :: cd,cd2 class(b_type),allocatable :: b1,b2 type(b_type) :: e1,e2 integer :: j logical :: res double precision :: t0,t1 j=1 allocate(e1%b,source=j) allocate(e2%b,source=j) d=intb(1) d2=intb(3) allocate(cd,source=d) allocate(cd2,source=d2) print *,d<d2 allocate(b1,source=cd) allocate(b2,source=cd2) print *,'b1 extends btype to intb',extends_type_of(b1,d),same_type_as(b1,d) print *,b1.lt.b2 print *,b1<b2 print *, cd<b1 deallocate(cd,cd2,b1,b2) call cpu_time(t0) do j=1,2000000 allocate(cd,source=d) allocate(cd2,source=d2) res=cd<cd2 deallocate(cd,cd2) end do call cpu_time(t1) print *,'time operator .lt.',t1-t0 call cpu_time(t0) do j=1,2000000 allocate(cd,source=d) allocate(cd2,source=d2) res=lowerthan_intb(cd,cd2) deallocate(cd,cd2) end do call cpu_time(t1) print *,'time selecttype .lt.',t1-t0 call cpu_time(t0) do j=1,2000000 allocate(cd,source=d) allocate(cd2,source=d2) res=lowerthan_intb_intb(cd,cd2) deallocate(cd,cd2) end do call cpu_time(t1) print *,'time .lt. intb',t1-t0 deallocate(e1%b,e2%b) call cpu_time(t0) do j=1,2000000 allocate(e1%b,source=j) allocate(e2%b,source=j) res=lowerthan_bt_bt(e1,e2) deallocate(e1%b,e2%b) end do call cpu_time(t1) print *,'time .lt. btype',t1-t0 end program
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did more testing, and most of the time comes from the allocation of the class with source= . but yet even without that I still found the gfortran version to be faster. The new code does some comparison between class and type and accessing the parent variable. I do not see it being an effect , so I narrowed it down to allocate class with source and call the lowerthan function .
Thanks for any hint on that.
module tata implicit none type :: b_type integer,allocatable :: b contains procedure,private :: lower_than generic,public :: operator(.lt.) => lower_than end type type,extends(b_type) :: intb integer,allocatable :: be contains procedure,private :: lower_than=>lowerthan_intb !generic,public :: operator(.lt.) => lowerthan_intb end type interface intb module procedure :: intb_constructor end interface contains function intb_constructor(i) result(a) type(intb) :: a integer :: i allocate(a%b,source=i) end function function lowerthan_intb_intb(a,b) result(r) class(intb),intent(in) :: a!,b class(intb),intent(in) :: b logical :: r r=a%b<b%b end function function lowerthan_intb(a,b) result(r) class(intb),intent(in) :: a!,b class(b_type),intent(in) :: b logical :: r !print *,'lowerthan intb' !select type(i1 => a%b) !type is(integer) select type(b1 => b) class is (intb) !select type(i2 => b1%b) !type is(integer) r=a%b<b1%b !end select end select !end select end function function lower_than(a,b) class(b_type),intent(in) :: a class(b_type),intent(in) :: b logical :: lower_than lower_than=.false. print *, 'operator lower than class* no implementation' end function function lowerthan_bt_bt(a,b) result(r) class(b_type),intent(in) :: a!,b class(b_type),intent(in) :: b logical :: r r=a%b<b%b end function function lowerthan_be_be(a,b) result(r) class(intb),intent(in) :: a!,b class(intb),intent(in) :: b logical :: r r=a%be<b%be end function end module program foo use tata implicit none type(intb) :: d,d2 class(intb),allocatable :: cd,cd2 class(b_type),allocatable :: b1,b2 type(b_type) :: e1,e2 integer :: j logical :: res double precision :: t0,t1 j=1 allocate(e1%b,source=j) allocate(e2%b,source=j) d=intb(1) d2=intb(3) allocate(cd,source=d) allocate(cd2,source=d2) print *,d<d2 allocate(b1,source=cd) allocate(b2,source=cd2) print *,'b1 extends btype to intb',extends_type_of(b1,d),same_type_as(b1,d) print *,b1.lt.b2 print *,b1<b2 print *, cd<b1 deallocate(cd,cd2,b1,b2) call cpu_time(t0) do j=1,2000000 allocate(cd,source=d) allocate(cd2,source=d2) res=cd<cd2 deallocate(cd,cd2) end do call cpu_time(t1) print *,'time operator .lt.',t1-t0 call cpu_time(t0) do j=1,2000000 allocate(cd,source=d) allocate(cd2,source=d2) res=lowerthan_intb(cd,cd2) deallocate(cd,cd2) end do call cpu_time(t1) print *,'time selecttype .lt.',t1-t0 allocate(cd,source=d) allocate(cd2,source=d2) deallocate(cd%b,cd2%b) call cpu_time(t0) do j=1,2000000 allocate(cd%b,source=j) allocate(cd2%b,source=j) res=lowerthan_intb_intb(cd,cd2) deallocate(cd%b,cd2%b) end do call cpu_time(t1) print *,'time .lt. intb class',t1-t0 call cpu_time(t0) do j=1,2000000 allocate(d%be,source=j) allocate(d2%be,source=j) res=lowerthan_intb_intb(d,d2) deallocate(d%be,d2%be) end do call cpu_time(t1) print *,'time .lt. intb type',t1-t0 deallocate(e1%b,e2%b) call cpu_time(t0) do j=1,2000000 allocate(e1%b,source=j) allocate(e2%b,source=j) res=lowerthan_bt_bt(e1,e2) deallocate(e1%b,e2%b) end do call cpu_time(t1) print *,'time .lt. btype',t1-t0 call cpu_time(t0) do j=1,2000000 allocate(d%be,source=j) allocate(d2%be,source=j) res=lowerthan_be_be(d,d2) deallocate(d%be,d2%be) end do call cpu_time(t1) print *,'time .lt. be',t1-t0 end program
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for submitting the issue. Sorry for the slow response, I'll take a look at this.
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It'll be interesting to see how Intel support team responds to this. The tests in the original post as well as in Message #2 mainly do allocate and deallocate and they do not appear to be a measure of performance for a derived type function call invoked via a generic operator. Besides, I don't quite think it is possible to get any meaningful measure of performance of such a function call any way and I don't think one can draw any meaningful comparisons among compilers - that is, the CPU time numbers are going to be so small to be of any relevance. Also, why should inlining be turned off?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I turned off inlining because I thought it would be a better comparison of function call at the begin. Though now it does not appear to be a function/subroutine call problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I couldn't draw any meaningful conclusions here. With ifort, the top hot stops are heap allocation, followed by for__calc_num_elts(), followed by heap deallocation. With gfortran 4.9.1, all I'm getting is an endless series of 'operator lower than class* no implementation', I can't tell if it's in an infinite loop or not.
What gfortran version compiles/runs this normally (or am I missing something obvious)?
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Oh yes there is a bug and it is fixed in gfortran 5.1 which i am using. I can provide you exec, object file and assembly if you want.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I compared the performance of gfortran-5.1 against ifort-16.0.0.056 (beta update #1). Indeed, gfortran is faster, regardless of whether or not I disable function inlining. Unfortunately I only have one Linux host capable of supporting gcc-5.1, and it doesn't have a performance analyzer installed, so it may take me some time to state anything conclusive.
Meantime, perhaps you could comment on gfortran's 'b1 extends btype to intb F T', versus ifort's 'b1 extends btype to intb T T'. Bug???
Patrick
[U558697]$ ./U558697-gf-5.1.0.exe
T
b1 extends btype to intb F T
T
T
F
time operator .lt. 0.87086799999999998
time selecttype .lt. 0.88086600000000004
time .lt. intb class 0.12498100000000001
time .lt. intb type 0.12698099999999979
time .lt. btype 0.12398100000000012
time .lt. be 0.12698000000000009
[U558697]$ ./U558697-if-16.0.0.056.exe
T
b1 extends btype to intb T T
T
T
F
time operator .lt. 1.03684200000000
time selecttype .lt. 1.01184600000000
time .lt. intb class 0.159976000000000
time .lt. intb type 0.161975000000000
time .lt. btype 0.161976000000000
time .lt. be 0.161975000000000
[U558697]$
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for confirming that, I have already reported the bug to gfortran here, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66227 . I have myself trouble running the analysis tool as I am already in linux 4.0 and the pin binary does not support it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Patrick Kennedy (Intel) wrote:
I couldn't draw any meaningful conclusions here. With ifort, the top hot stops are heap allocation, followed by for__calc_num_elts(), followed by heap deallocation. With gfortran 4.9.1, all I'm getting is an endless series of 'operator lower than class* no implementation', I can't tell if it's in an infinite loop or not.
What gfortran version compiles/runs this normally (or am I missing something obvious)?
Patrick
It'll be great if Intel experts can explain and illustrate the best ways to measure performance of used defined operators and type-bound procedures and inlined function calls, etc. Coders can then decide the best way to design their code. My own impression is CPU performance comparison across compilers at such a unit test level may not be all that revealing or useful, but I'd like to learn otherwise.
Anyways looking at the case in this thread, I'd probably set it up along the following lines:
program p use, intrinsic :: iso_fortran_env, only : i4 => int32, r8 => real64 use tata, only : intb implicit none integer(i4), parameter :: NMAX = 2**16 integer(i4), parameter :: MAXREPEAT = 10 integer(i4) :: i integer(i4) :: j logical, allocatable :: res(:) type(intb), allocatable :: a(:) type(intb), allocatable :: b(:) real(r8) :: x real(r8) :: Start_Time real(r8) :: End_Time real(r8) :: Cpu_Times(MAXREPEAT) Loop_OP: do j = 1, MAXREPEAT allocate(a(NMAX), b(NMAX), res(NMAX)) !.. Create a and b vectors using random numbers and arbitrary logic do i = 1, NMAX call random_number(x) if (x < 0.25_r8) then a(i) = intb(0) b(i) = intb(3) else if (x < 0.5_r8) then a(i) = intb(1) b(i) = intb(2) else if (x < 0.75_r8) then a(i) = intb(2) b(i) = intb(1) else a(i) = intb(3) b(i) = intb(0) end if end do call cpu_time(Start_Time) do i = 1, NMAX res(i) = (a(i) < b(i)) end do call cpu_time(End_Time) !.. print some results from 1st iteration as a check if (j == 1) then print *, " Iteration 1:" print *, " res(1), a(1)%b, b(1)%b = ", res(1), a(1)%b, b(1)%b print *, " res(999), a(999)%b, b(999)%b = ", res(999), a(999)%b, b(999)%b print *, " res(NMAX), a(NMAX)%b, b(NMAX)%b = ", res(NMAX), a(NMAX)%b, b(NMAX)%b end if Cpu_Times(j) = End_Time - Start_Time deallocate(a, b, res) end do Loop_OP print *,'time operator .lt.',sum(Cpu_Times)/real(MAXREPEAT, kind=r8) stop end program p
And with a check such as above, my own read is both Intel Fortran and gfortran show negligible CPU load from the generic operator bound to the derived type:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FortranFan, I tested your program , and I just increase NMAX to 20 millions, here are my results. I see a significant difference.
gfortran 5.1 $ \time ./a.out Iteration 1: res(1), a(1)%b, b(1)%b = F 3 0 res(999), a(999)%b, b(999)%b = T 0 3 res(NMAX), a(NMAX)%b, b(NMAX)%b = T 1 2 time operator .lt. 1.4333299999999962E-002 1.99user 0.31system 0:02.31elapsed 99%CPU (0avgtext+0avgdata 197116maxresident)k 0inputs+24outputs (0major+318897minor)pagefaults 0swaps ifort $ \time ./a.out Iteration 1: res(1), a(1)%b, b(1)%b = T 0 3 res(999), a(999)%b, b(999)%b = F 2 1 res(NMAX), a(NMAX)%b, b(NMAX)%b = T 0 3 time operator .lt. 2.666669999999998E-002 3.79user 0.12system 0:03.92elapsed 99%CPU (0avgtext+0avgdata 197572maxresident)k 0inputs+0outputs (0major+190539minor)pagefaults 0swaps
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Patrice l. wrote:
FortranFan, I tested your program , and I just increase NMAX to 20 millions, here are my results. I see a significant difference.
..
My test case was simply an illustration on how I would perhaps check performance. But again for the specific derived type at hand, I don't think it is a measure of the performance of the generic operator .lt. - it's probably some overhead for setting up the DO loops based on what the two compilers think is optimal given the O3 option, especially when NMAX becomes too high and the compilers can interpret differently on how to vectorize, etc.. There simply isn't enough "meat" in each iteration.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The use of CLASS does incur a necessary overhead.
When something is declared as a TYPE (even an extended TYPE) the compiler knows everything about that type, including what size it is for an ALLOCATE statement, and which type-bound procedure to use for an operator, etc. This can (but not necessarily) result in better optimized code.
When something is declared as a CLASS, all sorts of information needs to be determined at runtime based on the currently set dynamic type. In our implementation, the data structure for a CLASS is more involved than the data structure for a TYPE. Also, there is more code written to interrogate the extended data structure, and/or to make calls into the Fortran RTL for even more extended support, etc.
I believe what you are seeing is the overhead in ALLOCATE/DEALLOCATE, as well as an extra indirect call to get to the correct type-bound procedure for the .lt. operator.
I don't know the details of what gfortran is doing in this case, so I can't comment about the difference between their implementation and ours.
Does this help?
--Lorri
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FortranFan wrote:
Quote:
My test case was simply an illustration on how I would perhaps check performance. But again for the specific derived type at hand, I don't think it is a measure of the performance of the generic operator .lt. - it's probably some overhead for setting up the DO loops based on what the two compilers think is optimal given the O3 option, especially when NMAX becomes too high and the compilers can interpret differently on how to vectorize, etc.. There simply isn't enough "meat" in each iteration.
Hi,
Lori thanks for your comment, I understand where the extra time is coming from.
FortranFan, Let me give a little more background about what the test case ended up with so less 'meat'. I have an implementation of a map container, and let say I have to insert around 20 millions element and more. With a key of type integer , most of the work is allocate a node and call the .lt. operator. On my linux box, gfortran is 2x faster to do this. I think that globally I would like to see the intel compiler performing a little closer, and it might be helpful for intel developers to look at this simple test case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Patrice l. wrote:
... I have an implementation of a map container, and let say I have to insert around 20 millions element and more. With a key of type integer , most of the work is allocate a node and call the .lt. operator. On my linux box, gfortran is 2x faster to do this. I think that globally I would like to see the intel compiler performing a little closer, and it might be helpful for intel developers to look at this simple test case.
I was misled by the title of this topic and your beginning line ("In the following simple code, where the operator .lt. is used a lot of times..") which suggested your focus is on the generic operator performance but it is the data allocation that is of interest to you.
Fortran standard leaves certain aspects of language implementations up to the "processor" (compiler in this case) and as explained by Lorri, the details of polymorphic objects fall under this category and ALLOCATE/DEALLOCATE would follow this. So comparison across compilers does not seem appropriate. If ALLOCATE/DEALLOCATE plays a major role in the performance of one's code, then perhaps a code redesign whereby frequent, intensive object allocation is not required is worth a consideration - OO literature in general (i.e., beyond Fortran) on design patterns do pay attention to this aspect.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I could use pointers and pre-allocate an array, but this requires more care not to have memory leak, if i use allocatable component in the type, I don't have to worry to much. Using your program and comparing the time to do the random initialization with the time to compare the whole array, I get 0.27s vs 2.73335E-002. So much more time is spent calling intb which is an scalar integer allocation. gfortran gives 0.12999 vs 1.53333E-002.
I tried to remove the allocatable in the integer b_type. So intb is just an assignment a%b=i , then I get 6.0000E-002 vs 2.7666E-002 for ifort and 4.33330E-002 vs 1.00E-002 for gfortran.
![](/skins/images/DF2E495CEC88D713A66401CF495CD875/responsive_peak/images/icon_anonymous_message.png)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page