- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
module constants integer,parameter:: ip = 4 integer,parameter:: rp = 8 end module constants module vector_ use constants implicit none private public:: vector public:: operator(+),operator(*) type:: vector real(rp),allocatable,dimension(:):: vc contains generic:: init => init_ar procedure,private:: init_ar end type vector !--------------------------------------------- interface operator(+) procedure:: vplus end interface interface operator(*) procedure:: svproduct procedure:: vsproduct end interface !----- contains !&& pure subroutine init_ar(this,ar) class(vector),intent(out):: this real(rp),dimension(:),intent(in):: ar this%vc = ar end subroutine init_ar !----operator elemental type(vector) function vplus(lhs,rhs) result(vvp) type(vector),intent(in):: lhs,rhs vvp%vc = lhs%vc + rhs%vc end function vplus !-- elemental function svproduct(lhs,rhs) result(vr) real(rp),intent(in):: lhs type(vector),intent(in):: rhs type(vector):: vr vr%vc = lhs * rhs%vc end function svproduct !-- elemental function vsproduct(lhs,rhs) result(vr) type(vector),intent(in):: lhs real(rp),intent(in):: rhs type(vector):: vr vr%vc = rhs * lhs%vc end function vsproduct end module vector_ program test use constants use vector_ implicit none integer(ip):: i,n,j real(rp):: t1,t2,t3,t4,t5,t6,t7 type(vector):: p1 real(rp),dimension(:),allocatable:: p2 real(rp),dimension(100):: p3 p3 = 1.0001d0 p2 = p3 call p1%init(p2) n = 1e7 !1 call CPU_TIME(t1) !2.375 do i=1,n p1 = p1 + 2.d0 * p1 enddo !2 call CPU_TIME(t2) !0.297 do i=1,n p2 = p2 + 2.d0 * p2 enddo !3 call CPU_TIME(t3) !2.5 do i=1,n call op(p2) enddo !4 call CPU_TIME(t4) !2.531 do i=1,n p3 = p3 + 2.d0 * p3 enddo !5 call CPU_TIME(t5) !2.515 do i=1,n do j=1,100 p3(j) = p3(j) + 2.d0 * p3(j) enddo enddo !6 call CPU_TIME(t6) !0.234 do i=1,n call op(p3) enddo call CPU_TIME(t7) print*, '1',t2 - t1 print*, '2',t3 - t2 print*, '3',t4 - t3 print*, '4',t5 - t4 print*, '5',t6 - t5 print*, '6',t7 - t6 contains pure subroutine op(s) real(rp),dimension(:),intent(inout):: s s = s + 2.d0 * s end subroutine op end program test
here i test the operation of arraies, and three kinds of array are chosen
1. the derived type vector which is actually an array
2. the allocatable array with undetermined size
3. the array with determined size
and then 6 kinds of procedures are tested, which are all dealing with (s = s + 2.d0 * s )
then i find the difference time cost for each procedures
for O2, we get the time cost: proc1(2.375s), proc2(0.297s), proc3(2.5s), proc4(2.531s), proc5(2.515), proc6(0.234)
for O3, proc1, proc2, proc6 unchanged time cost, and proc3, proc4, proc5 decrease to 1.25s around
so i have a question: is it possible to get the speed as the proc6 for derived type with overriding operation?
how to do it?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have no idea, other than to make two observations:
- Each time a new, significant language feature was added, it took time for compilers to learn how to optimize them well. Consider array operations vs. DO loops.
- Any time you defer information to run-time, you lose performance. KIND type parameters are fine - those are always compile-time. But LEN parameters have been nothing but trouble for compiler implementors.
My advice would be to file a report with Intel and ask that the performance degradation be investigated. Maybe it's something simple, but don't get your hopes up too much.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I hope you know that your test causes a floating point overflow. Was that intentional? When doing these kinds of benchmark tests, I don't think it is appropriate to have values set to Inf, or Nan.
Also, can you add a print statement at the very end, such as print*,p3(1). Since the array is not accessed at the very end, the optimizer might to something "smart", and remove its calculation.
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Roman wrote:
I hope you know that your test causes a floating point overflow. Was that intentional? When doing these kinds of benchmark tests, I don't think it is appropriate to have values set to Inf, or Nan.
Also, can you add a print statement at the very end, such as print*,p3(1). Since the array is not accessed at the very end, the optimizer might to something "smart", and remove its calculation.
Roman
thanks for your correction. i realized my mistakes after my post, so i have another test
program test use vector_ implicit none integer,parameter:: ip=4,rp=8 integer(ip):: i,n real(rp):: t1,t2,t3,t4,t5,t6 real(rp),dimension(:),allocatable:: p,pp real(rp),dimension(:),pointer:: e,ee type(vector):: q,qq allocate(p(100),pp(100)) p = 1.001d0 pp = 0.d0 n=1e7 !the fastest i know call CPU_TIME(t3) do i=1,n call op(p,pp) enddo call CPU_TIME(t4) print*, t4 - t3 call qq%init(pp) call q%init(p) call CPU_TIME(t1) !e => q%ptr() !ee => qq%ptr() do i=1,n qq = qq + q + 2.d0 * q !call op(e,ee) enddo call CPU_TIME(t2) print*, qq%vc(1:10) print*, t2-t1 contains pure subroutine op(s,ss) real(rp),dimension(:),intent(in):: s real(rp),dimension(:),intent(inout):: ss ss = ss + s + 2.d0 * s end subroutine op end program test
here i post the fastest and slowest procedures. the second time cost is 10 times as the first time cost
i once changed the vector as wrapper of determined size array, like (real(rp),dimension(100):: vc)
then the calculation is around 2~3 times as the fastest procedure
so i realize the <allocatable> attr in operator function is manly responsible for the extra time cost
and more, i guess the pure function in Fortran is not as fast as inline function in C++ (ps: just guess, and i know little about C++)
i feel fortran lacks optimization for derived operator of derived type, or i lack some knowledge of fortran? i don't know.
i rewrite my code and let it compute under native data type and array.
any suggestions about the derived operator function is welcome
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you do the second test (qq = qq + q + 2.d0 * q
) , there are temporary array variables created at each operator function call. If I remove these calls, the test would look something like the following code. If you run it, you will get similar time values to what you had before. As you can see, all of these allocations, deallocations and array copies are slowing the program down.
type(vector):: v1,v2,v3 p = 1.001d0 pp = 0.d0 call qq%init(pp) call q%init(p) call CPU_TIME(t1) do i=1,n allocate(v1%vc(size(q%vc)), v2%vc(size(q%vc)), v3%vc(size(q%vc)) ) v1%vc = 2.d0 * q%vc v2%vc = q%vc + v1%vc v3%vc = qq%vc + v2%vc qq%vc = v3%vc deallocate(v1%vc, v2%vc, v3%vc) enddo call CPU_TIME(t2)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Roman wrote:
When you do the second test (qq = qq + q + 2.d0 * q) , there are temporary array variables created at each operator function call. If I remove these calls, the test would look something like the following code. If you run it, you will get similar time values to what you had before. As you can see, all of these allocations, deallocations and array copies are slowing the program down.
type(vector):: v1,v2,v3 p = 1.001d0 pp = 0.d0 call qq%init(pp) call q%init(p) call CPU_TIME(t1) do i=1,n allocate(v1%vc(size(q%vc)), v2%vc(size(q%vc)), v3%vc(size(q%vc)) ) v1%vc = 2.d0 * q%vc v2%vc = q%vc + v1%vc v3%vc = qq%vc + v2%vc qq%vc = v3%vc deallocate(v1%vc, v2%vc, v3%vc) enddo call CPU_TIME(t2)
that's the point!
as far as i know, there exists many derived operator in C++, and no one complaining the speed
so the question: is there a similar way to construct derived operator in Fortran which avoids temporary containers
or Fortran discourages the derived operator for derived type
you know, it's mentally hard to recover operating polynomials to dealing with native array.
polynomials is a special allocatable vector, and we always want to see the expression: a = b + c, other than: a%vc = b%vc + c%vc, then round(a)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>as far as i know, there exists many derived operator in C++, and no one complaining the speed
If your derived operator in C++ is operating on objects that are a container with variable capacity, then you will see similar allocations for temporaries.
>>we always want to see the expression: a = b + c, other than: a%vc = b%vc + c%vc
shorthand: associate (a=>oa%vc, b=>ob%vc, c=>ob%vc)
...
a = b + c
...
end associate shorthand
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The parameterized derived type should theoretically address your issue. However these are the run times for your last test program run on 64 bit and latest windows compiler:
Optimization level | Fixed size (100) | Allocatable | Parameterized |
O2 | 0.81 | 3.3 | 25.4 |
O3, IPO | 0.53 | 3.3 | 26.5 |
Web master: The above looks like a table when I edit the reply form but displays as a simple list when posted.
You can see the long run times for parameterized types. This is the first time I got to try this new feature and I am dissapointed by the performance. It should be close to the times for the fixed size array.
Here is the full code I used for the parametrized derived type. I made no attempt to tune it (data alignment, explicit vectoring etc.) but I doubt any tuning is going to fix it.
module constants integer,parameter:: ip = 4 integer,parameter:: rp = 8 end module constants module vector_ use constants implicit none private public :: vector public :: operator(+), operator(*) type :: vector(n) integer, len :: n real(rp) :: vc(n) contains generic :: init => init_ar procedure, private :: init_ar end type interface operator(+) procedure:: vplus end interface interface operator(*) procedure:: svproduct procedure:: vsproduct end interface contains pure subroutine init_ar(this,ar) class(vector(*)),intent(out) :: this real(rp), intent(in) :: ar(:) this%vc = ar end elemental function vplus(lhs,rhs) result(vvp) type(vector(*)),intent(in) :: lhs type(Vector(lhs%n)), intent(in) :: rhs type(vector(lhs%n)) vvp vvp%vc = lhs%vc + rhs%vc end elemental function svproduct(lhs,rhs) result(vr) real(rp),intent(in) :: lhs type(vector(*)),intent(in) :: rhs type(vector(rhs%n)) :: vr vr%vc = lhs * rhs%vc end function svproduct elemental function vsproduct(lhs,rhs) result(vr) type(vector(*)),intent(in) :: lhs real(rp),intent(in) :: rhs type(vector(lhs%n)) :: vr vr%vc = rhs * lhs%vc end end module vector_ program test use vector_ implicit none integer,parameter:: ip=4,rp=8 integer(ip):: i,n real(rp):: t1,t2,t3,t4,t5,t6 real(rp),dimension(:),allocatable:: p,pp real(rp),dimension(:),pointer:: e,ee type(vector(100)) :: q,qq allocate(p(100),pp(100)) p = 1.001_rp pp = 0.0_rp n = 10000000 !the fastest i know call CPU_TIME(t3) do i=1,n call op(p,pp) end do call CPU_TIME(t4) print*, t4 - t3 call qq%init(pp) call q%init(p) call CPU_TIME(t1) do i=1,n qq = qq + q + 2.d0 * q end do call CPU_TIME(t2) print*, qq%vc(1:10) print*, t2-t1 pause contains pure subroutine op(s,ss) real(rp),dimension(:),intent(in):: s real(rp),dimension(:),intent(inout):: ss ss = ss + s + 2.d0 * s end end program
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
>>as far as i know, there exists many derived operator in C++, and no one complaining the speed
If your derived operator in C++ is operating on objects that are a container with variable capacity, then you will see similar allocations for temporaries.
>>we always want to see the expression: a = b + c, other than: a%vc = b%vc + c%vc
shorthand: associate (a=>oa%vc, b=>ob%vc, c=>ob%vc)
...
a = b + c
...
end associate shorthandJim Dempsey
part1: i'm not familiar with c++, maybe i made a mistake
part2: that partly solves the problem, but it looks not elegant :(
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrew Smith wrote:
The parameterized derived type should theoretically address your issue. However these are the run times for your last test program run on 64 bit and latest windows compiler:
Optimization level
Fixed size (100)
Allocatable
ParameterizedO2
0.81
3.3
25.4O3, IPO
0.53
3.3
26.5Web master: The above looks like a table when I edit the reply form but displays as a simple list when posted.
You can see the long run times for parameterized types. This is the first time I got to try this new feature and I am dissapointed by the performance. It should be close to the times for the fixed size array.
Here is the full code I used for the parametrized derived type. I made no attempt to tune it (data alignment, explicit vectoring etc.) but I doubt any tuning is going to fix it.
module constants integer,parameter:: ip = 4 integer,parameter:: rp = 8 end module constants module vector_ use constants implicit none private public :: vector public :: operator(+), operator(*) type :: vector(n) integer, len :: n real(rp) :: vc(n) contains generic :: init => init_ar procedure, private :: init_ar end type interface operator(+) procedure:: vplus end interface interface operator(*) procedure:: svproduct procedure:: vsproduct end interface contains pure subroutine init_ar(this,ar) class(vector(*)),intent(out) :: this real(rp), intent(in) :: ar(:) this%vc = ar end elemental function vplus(lhs,rhs) result(vvp) type(vector(*)),intent(in) :: lhs type(Vector(lhs%n)), intent(in) :: rhs type(vector(lhs%n)) vvp vvp%vc = lhs%vc + rhs%vc end elemental function svproduct(lhs,rhs) result(vr) real(rp),intent(in) :: lhs type(vector(*)),intent(in) :: rhs type(vector(rhs%n)) :: vr vr%vc = lhs * rhs%vc end function svproduct elemental function vsproduct(lhs,rhs) result(vr) type(vector(*)),intent(in) :: lhs real(rp),intent(in) :: rhs type(vector(lhs%n)) :: vr vr%vc = rhs * lhs%vc end end module vector_ program test use vector_ implicit none integer,parameter:: ip=4,rp=8 integer(ip):: i,n real(rp):: t1,t2,t3,t4,t5,t6 real(rp),dimension(:),allocatable:: p,pp real(rp),dimension(:),pointer:: e,ee type(vector(100)) :: q,qq allocate(p(100),pp(100)) p = 1.001_rp pp = 0.0_rp n = 10000000 !the fastest i know call CPU_TIME(t3) do i=1,n call op(p,pp) end do call CPU_TIME(t4) print*, t4 - t3 call qq%init(pp) call q%init(p) call CPU_TIME(t1) do i=1,n qq = qq + q + 2.d0 * q end do call CPU_TIME(t2) print*, qq%vc(1:10) print*, t2-t1 pause contains pure subroutine op(s,ss) real(rp),dimension(:),intent(in):: s real(rp),dimension(:),intent(inout):: ss ss = ss + s + 2.d0 * s end end program
impressive test...
i know this type for the first time
what is the parameterized derived type designed for, with a so poor performance...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please raise this as a support issue for the performance of parameterized derived types. I would have expected performance somewhere between the fixed size and the allocatable vectors, not an order of magnitude slower! There really was no point in providing it like this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Blatent bump
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have been looking at it Andrew. I'll get it to Development soon.
(Internal tracking id: TBD)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Any progress please?
Was the idea of the parameterized derived type just an academic exercise or was it a serious attempt to introduce a unique performance benefit to the language?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Performance benefit? No. Indeed there are several on the standards committee who have recently expressed the opinion that this feature should never have been added to the language.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That is very dissapointing news. From snippets of conversations over the last few years I was under the impression that parameterized derived types would give improved opportunities for stack memory allocation and vectorization compared to allocatable vectors. I was hoping they would get near the speed of fixed size vectors since the performance loss from allocatable vectors is large (6x in my test above).
But why is the performance 10x slower again than allocatable vectors? If this is expected then they are pretty much useless in a high performance language
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have no idea, other than to make two observations:
- Each time a new, significant language feature was added, it took time for compilers to learn how to optimize them well. Consider array operations vs. DO loops.
- Any time you defer information to run-time, you lose performance. KIND type parameters are fine - those are always compile-time. But LEN parameters have been nothing but trouble for compiler implementors.
My advice would be to file a report with Intel and ask that the performance degradation be investigated. Maybe it's something simple, but don't get your hopes up too much.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Li L:
>>Web master: The above looks like a table when I edit the reply form but displays as a simple list when posted
The text of the general message on this forum is a variable pitch font. To get a fixed pitch font, Click on the {...} code button, select Plain Text, and enter/paste in your text.
As for performance, consider experimenting with specifying the vector bounds in the operation
elemental function svproduct(lhs,rhs) result(vr) real(rp),intent(in) :: lhs type(vector(*)),intent(in) :: rhs type(vector(rhs%n)) :: vr vr%vc(1:rhs%n) = lhs * rhs%vc(1:rhs%n) end function svproduct
and the same for the other functions.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Using my posted example I tried Jims suggestion and got no improvement.
Then I tried /Qhost and got a speedup of 0.15s for the fixed size vector and 0.3s for the parameterized derived type. This still leaves a whopping 25s discrepancy. Can Intel explain why this is?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrew,
In the example #9, I notice that the "member" functions are declared elemental, however the usage is as scalar (iow not array of type(vector)).
What happens when you remove "elemental" from the derived type functions?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No significant change without elemental
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrew Smith wrote:
Using my posted example I tried Jims suggestion and got no improvement.
Then I tried /Qhost and got a speedup of 0.15s for the fixed size vector and 0.3s for the parameterized derived type. This still leaves a whopping 25s discrepancy. Can Intel explain why this is?
Please submit the ticket via Online Service Center for further investigation.
Thank you,

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page