- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
module constants
integer,parameter:: ip = 4
integer,parameter:: rp = 8
end module constants
module vector_
use constants
implicit none
private
public:: vector
public:: operator(+),operator(*)
type:: vector
real(rp),allocatable,dimension(:):: vc
contains
generic:: init => init_ar
procedure,private:: init_ar
end type vector
!---------------------------------------------
interface operator(+)
procedure:: vplus
end interface
interface operator(*)
procedure:: svproduct
procedure:: vsproduct
end interface
!-----
contains
!&&
pure subroutine init_ar(this,ar)
class(vector),intent(out):: this
real(rp),dimension(:),intent(in):: ar
this%vc = ar
end subroutine init_ar
!----operator
elemental type(vector) function vplus(lhs,rhs) result(vvp)
type(vector),intent(in):: lhs,rhs
vvp%vc = lhs%vc + rhs%vc
end function vplus
!--
elemental function svproduct(lhs,rhs) result(vr)
real(rp),intent(in):: lhs
type(vector),intent(in):: rhs
type(vector):: vr
vr%vc = lhs * rhs%vc
end function svproduct
!--
elemental function vsproduct(lhs,rhs) result(vr)
type(vector),intent(in):: lhs
real(rp),intent(in):: rhs
type(vector):: vr
vr%vc = rhs * lhs%vc
end function vsproduct
end module vector_
program test
use constants
use vector_
implicit none
integer(ip):: i,n,j
real(rp):: t1,t2,t3,t4,t5,t6,t7
type(vector):: p1
real(rp),dimension(:),allocatable:: p2
real(rp),dimension(100):: p3
p3 = 1.0001d0
p2 = p3
call p1%init(p2)
n = 1e7
!1
call CPU_TIME(t1) !2.375
do i=1,n
p1 = p1 + 2.d0 * p1
enddo
!2
call CPU_TIME(t2) !0.297
do i=1,n
p2 = p2 + 2.d0 * p2
enddo
!3
call CPU_TIME(t3) !2.5
do i=1,n
call op(p2)
enddo
!4
call CPU_TIME(t4) !2.531
do i=1,n
p3 = p3 + 2.d0 * p3
enddo
!5
call CPU_TIME(t5) !2.515
do i=1,n
do j=1,100
p3(j) = p3(j) + 2.d0 * p3(j)
enddo
enddo
!6
call CPU_TIME(t6) !0.234
do i=1,n
call op(p3)
enddo
call CPU_TIME(t7)
print*, '1',t2 - t1
print*, '2',t3 - t2
print*, '3',t4 - t3
print*, '4',t5 - t4
print*, '5',t6 - t5
print*, '6',t7 - t6
contains
pure subroutine op(s)
real(rp),dimension(:),intent(inout):: s
s = s + 2.d0 * s
end subroutine op
end program test
here i test the operation of arraies, and three kinds of array are chosen
1. the derived type vector which is actually an array
2. the allocatable array with undetermined size
3. the array with determined size
and then 6 kinds of procedures are tested, which are all dealing with (s = s + 2.d0 * s )
then i find the difference time cost for each procedures
for O2, we get the time cost: proc1(2.375s), proc2(0.297s), proc3(2.5s), proc4(2.531s), proc5(2.515), proc6(0.234)
for O3, proc1, proc2, proc6 unchanged time cost, and proc3, proc4, proc5 decrease to 1.25s around
so i have a question: is it possible to get the speed as the proc6 for derived type with overriding operation?
how to do it?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have no idea, other than to make two observations:
- Each time a new, significant language feature was added, it took time for compilers to learn how to optimize them well. Consider array operations vs. DO loops.
- Any time you defer information to run-time, you lose performance. KIND type parameters are fine - those are always compile-time. But LEN parameters have been nothing but trouble for compiler implementors.
My advice would be to file a report with Intel and ask that the performance degradation be investigated. Maybe it's something simple, but don't get your hopes up too much.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I hope you know that your test causes a floating point overflow. Was that intentional? When doing these kinds of benchmark tests, I don't think it is appropriate to have values set to Inf, or Nan.
Also, can you add a print statement at the very end, such as print*,p3(1). Since the array is not accessed at the very end, the optimizer might to something "smart", and remove its calculation.
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Roman wrote:
I hope you know that your test causes a floating point overflow. Was that intentional? When doing these kinds of benchmark tests, I don't think it is appropriate to have values set to Inf, or Nan.
Also, can you add a print statement at the very end, such as print*,p3(1). Since the array is not accessed at the very end, the optimizer might to something "smart", and remove its calculation.
Roman
thanks for your correction. i realized my mistakes after my post, so i have another test
program test
use vector_
implicit none
integer,parameter:: ip=4,rp=8
integer(ip):: i,n
real(rp):: t1,t2,t3,t4,t5,t6
real(rp),dimension(:),allocatable:: p,pp
real(rp),dimension(:),pointer:: e,ee
type(vector):: q,qq
allocate(p(100),pp(100))
p = 1.001d0
pp = 0.d0
n=1e7
!the fastest i know
call CPU_TIME(t3)
do i=1,n
call op(p,pp)
enddo
call CPU_TIME(t4)
print*, t4 - t3
call qq%init(pp)
call q%init(p)
call CPU_TIME(t1)
!e => q%ptr()
!ee => qq%ptr()
do i=1,n
qq = qq + q + 2.d0 * q
!call op(e,ee)
enddo
call CPU_TIME(t2)
print*, qq%vc(1:10)
print*, t2-t1
contains
pure subroutine op(s,ss)
real(rp),dimension(:),intent(in):: s
real(rp),dimension(:),intent(inout):: ss
ss = ss + s + 2.d0 * s
end subroutine op
end program test
here i post the fastest and slowest procedures. the second time cost is 10 times as the first time cost
i once changed the vector as wrapper of determined size array, like (real(rp),dimension(100):: vc)
then the calculation is around 2~3 times as the fastest procedure
so i realize the <allocatable> attr in operator function is manly responsible for the extra time cost
and more, i guess the pure function in Fortran is not as fast as inline function in C++ (ps: just guess, and i know little about C++)
i feel fortran lacks optimization for derived operator of derived type, or i lack some knowledge of fortran? i don't know.
i rewrite my code and let it compute under native data type and array.
any suggestions about the derived operator function is welcome
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you do the second test (qq = qq + q + 2.d0 * q) , there are temporary array variables created at each operator function call. If I remove these calls, the test would look something like the following code. If you run it, you will get similar time values to what you had before. As you can see, all of these allocations, deallocations and array copies are slowing the program down.
type(vector):: v1,v2,v3
p = 1.001d0
pp = 0.d0
call qq%init(pp)
call q%init(p)
call CPU_TIME(t1)
do i=1,n
allocate(v1%vc(size(q%vc)), v2%vc(size(q%vc)), v3%vc(size(q%vc)) )
v1%vc = 2.d0 * q%vc
v2%vc = q%vc + v1%vc
v3%vc = qq%vc + v2%vc
qq%vc = v3%vc
deallocate(v1%vc, v2%vc, v3%vc)
enddo
call CPU_TIME(t2)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Roman wrote:
When you do the second test (qq = qq + q + 2.d0 * q) , there are temporary array variables created at each operator function call. If I remove these calls, the test would look something like the following code. If you run it, you will get similar time values to what you had before. As you can see, all of these allocations, deallocations and array copies are slowing the program down.
type(vector):: v1,v2,v3 p = 1.001d0 pp = 0.d0 call qq%init(pp) call q%init(p) call CPU_TIME(t1) do i=1,n allocate(v1%vc(size(q%vc)), v2%vc(size(q%vc)), v3%vc(size(q%vc)) ) v1%vc = 2.d0 * q%vc v2%vc = q%vc + v1%vc v3%vc = qq%vc + v2%vc qq%vc = v3%vc deallocate(v1%vc, v2%vc, v3%vc) enddo call CPU_TIME(t2)
that's the point!
as far as i know, there exists many derived operator in C++, and no one complaining the speed
so the question: is there a similar way to construct derived operator in Fortran which avoids temporary containers
or Fortran discourages the derived operator for derived type
you know, it's mentally hard to recover operating polynomials to dealing with native array.
polynomials is a special allocatable vector, and we always want to see the expression: a = b + c, other than: a%vc = b%vc + c%vc, then round(a)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>as far as i know, there exists many derived operator in C++, and no one complaining the speed
If your derived operator in C++ is operating on objects that are a container with variable capacity, then you will see similar allocations for temporaries.
>>we always want to see the expression: a = b + c, other than: a%vc = b%vc + c%vc
shorthand: associate (a=>oa%vc, b=>ob%vc, c=>ob%vc)
...
a = b + c
...
end associate shorthand
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The parameterized derived type should theoretically address your issue. However these are the run times for your last test program run on 64 bit and latest windows compiler:
| Optimization level | Fixed size (100) | Allocatable | Parameterized |
| O2 | 0.81 | 3.3 | 25.4 |
| O3, IPO | 0.53 | 3.3 | 26.5 |
Web master: The above looks like a table when I edit the reply form but displays as a simple list when posted.
You can see the long run times for parameterized types. This is the first time I got to try this new feature and I am dissapointed by the performance. It should be close to the times for the fixed size array.
Here is the full code I used for the parametrized derived type. I made no attempt to tune it (data alignment, explicit vectoring etc.) but I doubt any tuning is going to fix it.
module constants
integer,parameter:: ip = 4
integer,parameter:: rp = 8
end module constants
module vector_
use constants
implicit none
private
public :: vector
public :: operator(+), operator(*)
type :: vector(n)
integer, len :: n
real(rp) :: vc(n)
contains
generic :: init => init_ar
procedure, private :: init_ar
end type
interface operator(+)
procedure:: vplus
end interface
interface operator(*)
procedure:: svproduct
procedure:: vsproduct
end interface
contains
pure subroutine init_ar(this,ar)
class(vector(*)),intent(out) :: this
real(rp), intent(in) :: ar(:)
this%vc = ar
end
elemental function vplus(lhs,rhs) result(vvp)
type(vector(*)),intent(in) :: lhs
type(Vector(lhs%n)), intent(in) :: rhs
type(vector(lhs%n)) vvp
vvp%vc = lhs%vc + rhs%vc
end
elemental function svproduct(lhs,rhs) result(vr)
real(rp),intent(in) :: lhs
type(vector(*)),intent(in) :: rhs
type(vector(rhs%n)) :: vr
vr%vc = lhs * rhs%vc
end function svproduct
elemental function vsproduct(lhs,rhs) result(vr)
type(vector(*)),intent(in) :: lhs
real(rp),intent(in) :: rhs
type(vector(lhs%n)) :: vr
vr%vc = rhs * lhs%vc
end
end module vector_
program test
use vector_
implicit none
integer,parameter:: ip=4,rp=8
integer(ip):: i,n
real(rp):: t1,t2,t3,t4,t5,t6
real(rp),dimension(:),allocatable:: p,pp
real(rp),dimension(:),pointer:: e,ee
type(vector(100)) :: q,qq
allocate(p(100),pp(100))
p = 1.001_rp
pp = 0.0_rp
n = 10000000
!the fastest i know
call CPU_TIME(t3)
do i=1,n
call op(p,pp)
end do
call CPU_TIME(t4)
print*, t4 - t3
call qq%init(pp)
call q%init(p)
call CPU_TIME(t1)
do i=1,n
qq = qq + q + 2.d0 * q
end do
call CPU_TIME(t2)
print*, qq%vc(1:10)
print*, t2-t1
pause
contains
pure subroutine op(s,ss)
real(rp),dimension(:),intent(in):: s
real(rp),dimension(:),intent(inout):: ss
ss = ss + s + 2.d0 * s
end
end program
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
>>as far as i know, there exists many derived operator in C++, and no one complaining the speed
If your derived operator in C++ is operating on objects that are a container with variable capacity, then you will see similar allocations for temporaries.
>>we always want to see the expression: a = b + c, other than: a%vc = b%vc + c%vc
shorthand: associate (a=>oa%vc, b=>ob%vc, c=>ob%vc)
...
a = b + c
...
end associate shorthandJim Dempsey
part1: i'm not familiar with c++, maybe i made a mistake
part2: that partly solves the problem, but it looks not elegant :(
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrew Smith wrote:
The parameterized derived type should theoretically address your issue. However these are the run times for your last test program run on 64 bit and latest windows compiler:
Optimization level
Fixed size (100)
Allocatable
ParameterizedO2
0.81
3.3
25.4O3, IPO
0.53
3.3
26.5Web master: The above looks like a table when I edit the reply form but displays as a simple list when posted.
You can see the long run times for parameterized types. This is the first time I got to try this new feature and I am dissapointed by the performance. It should be close to the times for the fixed size array.
Here is the full code I used for the parametrized derived type. I made no attempt to tune it (data alignment, explicit vectoring etc.) but I doubt any tuning is going to fix it.
module constants integer,parameter:: ip = 4 integer,parameter:: rp = 8 end module constants module vector_ use constants implicit none private public :: vector public :: operator(+), operator(*) type :: vector(n) integer, len :: n real(rp) :: vc(n) contains generic :: init => init_ar procedure, private :: init_ar end type interface operator(+) procedure:: vplus end interface interface operator(*) procedure:: svproduct procedure:: vsproduct end interface contains pure subroutine init_ar(this,ar) class(vector(*)),intent(out) :: this real(rp), intent(in) :: ar(:) this%vc = ar end elemental function vplus(lhs,rhs) result(vvp) type(vector(*)),intent(in) :: lhs type(Vector(lhs%n)), intent(in) :: rhs type(vector(lhs%n)) vvp vvp%vc = lhs%vc + rhs%vc end elemental function svproduct(lhs,rhs) result(vr) real(rp),intent(in) :: lhs type(vector(*)),intent(in) :: rhs type(vector(rhs%n)) :: vr vr%vc = lhs * rhs%vc end function svproduct elemental function vsproduct(lhs,rhs) result(vr) type(vector(*)),intent(in) :: lhs real(rp),intent(in) :: rhs type(vector(lhs%n)) :: vr vr%vc = rhs * lhs%vc end end module vector_ program test use vector_ implicit none integer,parameter:: ip=4,rp=8 integer(ip):: i,n real(rp):: t1,t2,t3,t4,t5,t6 real(rp),dimension(:),allocatable:: p,pp real(rp),dimension(:),pointer:: e,ee type(vector(100)) :: q,qq allocate(p(100),pp(100)) p = 1.001_rp pp = 0.0_rp n = 10000000 !the fastest i know call CPU_TIME(t3) do i=1,n call op(p,pp) end do call CPU_TIME(t4) print*, t4 - t3 call qq%init(pp) call q%init(p) call CPU_TIME(t1) do i=1,n qq = qq + q + 2.d0 * q end do call CPU_TIME(t2) print*, qq%vc(1:10) print*, t2-t1 pause contains pure subroutine op(s,ss) real(rp),dimension(:),intent(in):: s real(rp),dimension(:),intent(inout):: ss ss = ss + s + 2.d0 * s end end program
impressive test...
i know this type for the first time
what is the parameterized derived type designed for, with a so poor performance...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please raise this as a support issue for the performance of parameterized derived types. I would have expected performance somewhere between the fixed size and the allocatable vectors, not an order of magnitude slower! There really was no point in providing it like this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Blatent bump
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have been looking at it Andrew. I'll get it to Development soon.
(Internal tracking id: TBD)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Any progress please?
Was the idea of the parameterized derived type just an academic exercise or was it a serious attempt to introduce a unique performance benefit to the language?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Performance benefit? No. Indeed there are several on the standards committee who have recently expressed the opinion that this feature should never have been added to the language.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That is very dissapointing news. From snippets of conversations over the last few years I was under the impression that parameterized derived types would give improved opportunities for stack memory allocation and vectorization compared to allocatable vectors. I was hoping they would get near the speed of fixed size vectors since the performance loss from allocatable vectors is large (6x in my test above).
But why is the performance 10x slower again than allocatable vectors? If this is expected then they are pretty much useless in a high performance language
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have no idea, other than to make two observations:
- Each time a new, significant language feature was added, it took time for compilers to learn how to optimize them well. Consider array operations vs. DO loops.
- Any time you defer information to run-time, you lose performance. KIND type parameters are fine - those are always compile-time. But LEN parameters have been nothing but trouble for compiler implementors.
My advice would be to file a report with Intel and ask that the performance degradation be investigated. Maybe it's something simple, but don't get your hopes up too much.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Li L:
>>Web master: The above looks like a table when I edit the reply form but displays as a simple list when posted
The text of the general message on this forum is a variable pitch font. To get a fixed pitch font, Click on the {...} code button, select Plain Text, and enter/paste in your text.
As for performance, consider experimenting with specifying the vector bounds in the operation
elemental function svproduct(lhs,rhs) result(vr) real(rp),intent(in) :: lhs type(vector(*)),intent(in) :: rhs type(vector(rhs%n)) :: vr vr%vc(1:rhs%n) = lhs * rhs%vc(1:rhs%n) end function svproduct
and the same for the other functions.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Using my posted example I tried Jims suggestion and got no improvement.
Then I tried /Qhost and got a speedup of 0.15s for the fixed size vector and 0.3s for the parameterized derived type. This still leaves a whopping 25s discrepancy. Can Intel explain why this is?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrew,
In the example #9, I notice that the "member" functions are declared elemental, however the usage is as scalar (iow not array of type(vector)).
What happens when you remove "elemental" from the derived type functions?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No significant change without elemental
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrew Smith wrote:
Using my posted example I tried Jims suggestion and got no improvement.
Then I tried /Qhost and got a speedup of 0.15s for the fixed size vector and 0.3s for the parameterized derived type. This still leaves a whopping 25s discrepancy. Can Intel explain why this is?
Please submit the ticket via Online Service Center for further investigation.
Thank you,
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page