Solved: how to get the fast operation for derived type. a comparison test show optimization do nothing

Li_L_ · ‎04-08-2017

module constants
    integer,parameter:: ip = 4
    integer,parameter:: rp = 8
end module constants

module vector_
use constants
implicit none

    private
    public:: vector
    public:: operator(+),operator(*)

    type::  vector
        real(rp),allocatable,dimension(:):: vc
    contains
        generic::           init    =>  init_ar
        procedure,private:: init_ar    
    end type vector

!---------------------------------------------
    interface operator(+)
        procedure::  vplus
    end interface

    interface operator(*)
        procedure::  svproduct
        procedure::  vsproduct
    end interface
    
!-----
contains

    !&&
    pure subroutine init_ar(this,ar)
    class(vector),intent(out)::         this
    real(rp),dimension(:),intent(in)::  ar
        this%vc = ar
    end subroutine init_ar
    
!----operator
    elemental type(vector) function vplus(lhs,rhs) result(vvp)
    type(vector),intent(in)::   lhs,rhs
        vvp%vc = lhs%vc + rhs%vc
    end function vplus
    
    !--
    elemental function svproduct(lhs,rhs) result(vr)
    real(rp),intent(in)::               lhs
    type(vector),intent(in)::           rhs
    type(vector)::                      vr
        vr%vc = lhs * rhs%vc
    end function svproduct
    
    !--
    elemental function vsproduct(lhs,rhs) result(vr)
    type(vector),intent(in)::           lhs
    real(rp),intent(in)::               rhs
    type(vector)::                      vr
        vr%vc = rhs * lhs%vc
    end function vsproduct
    
end module vector_

program test
use constants
use vector_
implicit none
integer(ip)::   i,n,j
real(rp)::      t1,t2,t3,t4,t5,t6,t7
type(vector)::  p1
real(rp),dimension(:),allocatable:: p2
real(rp),dimension(100):: p3


    p3 = 1.0001d0
    p2 = p3
    call p1%init(p2)
    
    n = 1e7
    
    !1
    call CPU_TIME(t1)   !2.375
    do i=1,n
        p1 = p1 + 2.d0 * p1
    enddo
    
    !2
    call CPU_TIME(t2)   !0.297
    do i=1,n
        p2 = p2 + 2.d0 * p2
    enddo
    
    !3
    call CPU_TIME(t3)   !2.5
    do i=1,n
        call op(p2)
    enddo
    
    !4
    call CPU_TIME(t4)   !2.531
    do i=1,n
        p3 = p3 + 2.d0 * p3
    enddo
    
    !5
    call CPU_TIME(t5)   !2.515
    do i=1,n
        do j=1,100
            p3(j) = p3(j) + 2.d0 * p3(j)
        enddo
    enddo
    
    !6
    call CPU_TIME(t6)   !0.234
    do i=1,n
        call op(p3)
    enddo
    
    call CPU_TIME(t7)
    
    print*, '1',t2 - t1
    print*, '2',t3 - t2
    print*, '3',t4 - t3
    print*, '4',t5 - t4
    print*, '5',t6 - t5
    print*, '6',t7 - t6

contains

    pure subroutine op(s)
    real(rp),dimension(:),intent(inout):: s
        s = s + 2.d0 * s
    end subroutine op
    
end program test

here i test the operation of arraies, and three kinds of array are chosen

1. the derived type vector which is actually an array

2. the allocatable array with undetermined size

3. the array with determined size

and then 6 kinds of procedures are tested, which are all dealing with (s = s + 2.d0 * s )

then i find the difference time cost for each procedures

for O2, we get the time cost: proc1(2.375s), proc2(0.297s), proc3(2.5s), proc4(2.531s), proc5(2.515), proc6(0.234)

for O3, proc1, proc2, proc6 unchanged time cost, and proc3, proc4, proc5 decrease to 1.25s around

so i have a question: is it possible to get the speed as the proc6 for derived type with overriding operation?

how to do it?

Steve_Lionel · ‎06-23-2017

I have no idea, other than to make two observations:

Each time a new, significant language feature was added, it took time for compilers to learn how to optimize them well. Consider array operations vs. DO loops.
Any time you defer information to run-time, you lose performance. KIND type parameters are fine - those are always compile-time. But LEN parameters have been nothing but trouble for compiler implementors.

My advice would be to file a report with Intel and ask that the performance degradation be investigated. Maybe it's something simple, but don't get your hopes up too much.

View solution in original post

Roman1 · ‎04-10-2017

I hope you know that your test causes a floating point overflow. Was that intentional? When doing these kinds of benchmark tests, I don't think it is appropriate to have values set to Inf, or Nan.

Also, can you add a print statement at the very end, such as print*,p3(1). Since the array is not accessed at the very end, the optimizer might to something "smart", and remove its calculation.

Roman

Li_L_ · ‎04-10-2017

Roman wrote:

I hope you know that your test causes a floating point overflow. Was that intentional? When doing these kinds of benchmark tests, I don't think it is appropriate to have values set to Inf, or Nan.

Also, can you add a print statement at the very end, such as print*,p3(1). Since the array is not accessed at the very end, the optimizer might to something "smart", and remove its calculation.

Roman

thanks for your correction. i realized my mistakes after my post, so i have another test

program test
use vector_
implicit none
integer,parameter:: ip=4,rp=8
integer(ip)::       i,n
real(rp)::          t1,t2,t3,t4,t5,t6
real(rp),dimension(:),allocatable:: p,pp
real(rp),dimension(:),pointer::     e,ee
type(vector)::                      q,qq

    allocate(p(100),pp(100))
    p = 1.001d0
    pp = 0.d0
    n=1e7
    
    !the fastest i know
    call CPU_TIME(t3)
    do i=1,n
        call op(p,pp)
    enddo
    call CPU_TIME(t4)
    
    print*, t4 - t3
    
    
    call qq%init(pp)
    call q%init(p)
    
    call CPU_TIME(t1)
    !e   =>  q%ptr()
    !ee  =>  qq%ptr()
    do i=1,n
        qq = qq + q + 2.d0 * q
        !call op(e,ee)
    enddo
    call CPU_TIME(t2)
    
    print*, qq%vc(1:10)
    print*, t2-t1

contains

    pure subroutine op(s,ss)
    real(rp),dimension(:),intent(in)::  s
    real(rp),dimension(:),intent(inout):: ss
        ss = ss + s + 2.d0 * s
    end subroutine op
    
end program test

here i post the fastest and slowest procedures. the second time cost is 10 times as the first time cost

i once changed the vector as wrapper of determined size array, like (real(rp),dimension(100):: vc)

then the calculation is around 2~3 times as the fastest procedure

so i realize the <allocatable> attr in operator function is manly responsible for the extra time cost

and more, i guess the pure function in Fortran is not as fast as inline function in C++ (ps: just guess, and i know little about C++)

i feel fortran lacks optimization for derived operator of derived type, or i lack some knowledge of fortran? i don't know.

i rewrite my code and let it compute under native data type and array.

any suggestions about the derived operator function is welcome

Roman1 · ‎04-11-2017

When you do the second test (qq = qq + q + 2.d0 * q) , there are temporary array variables created at each operator function call. If I remove these calls, the test would look something like the following code. If you run it, you will get similar time values to what you had before. As you can see, all of these allocations, deallocations and array copies are slowing the program down.

type(vector):: v1,v2,v3

    p = 1.001d0
    pp = 0.d0
    call qq%init(pp)
    call q%init(p)
    
    call CPU_TIME(t1)
    do i=1,n
       allocate(v1%vc(size(q%vc)), v2%vc(size(q%vc)), v3%vc(size(q%vc)) )
       v1%vc = 2.d0 * q%vc
       v2%vc = q%vc + v1%vc
       v3%vc = qq%vc + v2%vc
       qq%vc = v3%vc
       deallocate(v1%vc, v2%vc, v3%vc)
    enddo
    call CPU_TIME(t2)

Li_L_ · ‎04-12-2017

Roman wrote:

When you do the second test (qq = qq + q + 2.d0 * q) , there are temporary array variables created at each operator function call. If I remove these calls, the test would look something like the following code. If you run it, you will get similar time values to what you had before. As you can see, all of these allocations, deallocations and array copies are slowing the program down.
type(vector):: v1,v2,v3

    p = 1.001d0
    pp = 0.d0
    call qq%init(pp)
    call q%init(p)
    
    call CPU_TIME(t1)
    do i=1,n
       allocate(v1%vc(size(q%vc)), v2%vc(size(q%vc)), v3%vc(size(q%vc)) )
       v1%vc = 2.d0 * q%vc
       v2%vc = q%vc + v1%vc
       v3%vc = qq%vc + v2%vc
       qq%vc = v3%vc
       deallocate(v1%vc, v2%vc, v3%vc)
    enddo
    call CPU_TIME(t2)

that's the point!

as far as i know, there exists many derived operator in C++, and no one complaining the speed

so the question: is there a similar way to construct derived operator in Fortran which avoids temporary containers

or Fortran discourages the derived operator for derived type

you know, it's mentally hard to recover operating polynomials to dealing with native array.

polynomials is a special allocatable vector, and we always want to see the expression: a = b + c, other than: a%vc = b%vc + c%vc, then round(a)

jimdempseyatthecove · ‎04-12-2017

>>as far as i know, there exists many derived operator in C++, and no one complaining the speed

If your derived operator in C++ is operating on objects that are a container with variable capacity, then you will see similar allocations for temporaries.

>>we always want to see the expression: a = b + c, other than: a%vc = b%vc + c%vc

shorthand: associate (a=>oa%vc, b=>ob%vc, c=>ob%vc)
...
a = b + c
...
end associate shorthand

Jim Dempsey

Andrew_Smith · ‎04-12-2017

The parameterized derived type should theoretically address your issue. However these are the run times for your last test program run on 64 bit and latest windows compiler:

Optimization level	Fixed size (100)	Allocatable	Parameterized
O2	0.81	3.3	25.4
O3, IPO	0.53	3.3	26.5

Web master: The above looks like a table when I edit the reply form but displays as a simple list when posted.

You can see the long run times for parameterized types. This is the first time I got to try this new feature and I am dissapointed by the performance. It should be close to the times for the fixed size array.

Here is the full code I used for the parametrized derived type. I made no attempt to tune it (data alignment, explicit vectoring etc.) but I doubt any tuning is going to fix it.

module constants
    integer,parameter:: ip = 4
    integer,parameter:: rp = 8
end module constants

module vector_
   use constants
   implicit none
   private
   public :: vector
   public :: operator(+), operator(*)

   type :: vector(n)
      integer, len :: n
      real(rp) :: vc(n)
    contains
        generic :: init => init_ar
        procedure, private :: init_ar   
    end type

    interface operator(+)
        procedure::  vplus
    end interface

    interface operator(*)
        procedure::  svproduct
        procedure::  vsproduct
    end interface
contains

pure subroutine init_ar(this,ar)
    class(vector(*)),intent(out) :: this
    real(rp), intent(in) :: ar(:)
    this%vc = ar
end

elemental function vplus(lhs,rhs) result(vvp)
    type(vector(*)),intent(in) :: lhs
    type(Vector(lhs%n)), intent(in) :: rhs
    type(vector(lhs%n)) vvp
    vvp%vc = lhs%vc + rhs%vc
end

elemental function svproduct(lhs,rhs) result(vr)
   real(rp),intent(in) :: lhs
   type(vector(*)),intent(in) :: rhs
   type(vector(rhs%n)) :: vr

   vr%vc = lhs * rhs%vc
end function svproduct

elemental function vsproduct(lhs,rhs) result(vr)
   type(vector(*)),intent(in) :: lhs
   real(rp),intent(in) :: rhs
   type(vector(lhs%n)) :: vr
   vr%vc = rhs * lhs%vc
end

end module vector_
   
program test
   use vector_
   implicit none

   integer,parameter:: ip=4,rp=8
   integer(ip)::       i,n
   real(rp)::          t1,t2,t3,t4,t5,t6
   real(rp),dimension(:),allocatable:: p,pp
   real(rp),dimension(:),pointer::     e,ee
   type(vector(100)) :: q,qq

   allocate(p(100),pp(100))
   p = 1.001_rp
   pp = 0.0_rp
   n = 10000000

   !the fastest i know
   call CPU_TIME(t3)

   do i=1,n
      call op(p,pp)
   end do

   call CPU_TIME(t4)
   print*, t4 - t3

   call qq%init(pp)
   call q%init(p)
   call CPU_TIME(t1)

   do i=1,n
      qq = qq + q + 2.d0 * q
   end do

   call CPU_TIME(t2)
   print*, qq%vc(1:10)
   print*, t2-t1
   pause
contains

pure subroutine op(s,ss)
    real(rp),dimension(:),intent(in)::  s
    real(rp),dimension(:),intent(inout):: ss
    ss = ss + s + 2.d0 * s
end

end program

Li_L_ · ‎04-12-2017

jimdempseyatthecove wrote:

>>as far as i know, there exists many derived operator in C++, and no one complaining the speed

If your derived operator in C++ is operating on objects that are a container with variable capacity, then you will see similar allocations for temporaries.

>>we always want to see the expression: a = b + c, other than: a%vc = b%vc + c%vc

shorthand: associate (a=>oa%vc, b=>ob%vc, c=>ob%vc)
...
a = b + c
...
end associate shorthand

Jim Dempsey

part1: i'm not familiar with c++, maybe i made a mistake

part2: that partly solves the problem, but it looks not elegant :(

Li_L_ · ‎04-12-2017

Andrew Smith wrote:

The parameterized derived type should theoretically address your issue. However these are the run times for your last test program run on 64 bit and latest windows compiler:

Optimization level
Fixed size (100)
Allocatable
Parameterized

O2
0.81
3.3
25.4

O3, IPO
0.53
3.3
26.5

Web master: The above looks like a table when I edit the reply form but displays as a simple list when posted.

You can see the long run times for parameterized types. This is the first time I got to try this new feature and I am dissapointed by the performance. It should be close to the times for the fixed size array.

Here is the full code I used for the parametrized derived type. I made no attempt to tune it (data alignment, explicit vectoring etc.) but I doubt any tuning is going to fix it.

module constants
    integer,parameter:: ip = 4
    integer,parameter:: rp = 8
end module constants

module vector_
   use constants
   implicit none
   private
   public :: vector
   public :: operator(+), operator(*)

   type :: vector(n)
      integer, len :: n
      real(rp) :: vc(n)
    contains
        generic :: init => init_ar
        procedure, private :: init_ar   
    end type

    interface operator(+)
        procedure::  vplus
    end interface

    interface operator(*)
        procedure::  svproduct
        procedure::  vsproduct
    end interface
contains

pure subroutine init_ar(this,ar)
    class(vector(*)),intent(out) :: this
    real(rp), intent(in) :: ar(:)
    this%vc = ar
end

elemental function vplus(lhs,rhs) result(vvp)
    type(vector(*)),intent(in) :: lhs
    type(Vector(lhs%n)), intent(in) :: rhs
    type(vector(lhs%n)) vvp
    vvp%vc = lhs%vc + rhs%vc
end

elemental function svproduct(lhs,rhs) result(vr)
   real(rp),intent(in) :: lhs
   type(vector(*)),intent(in) :: rhs
   type(vector(rhs%n)) :: vr

   vr%vc = lhs * rhs%vc
end function svproduct

elemental function vsproduct(lhs,rhs) result(vr)
   type(vector(*)),intent(in) :: lhs
   real(rp),intent(in) :: rhs
   type(vector(lhs%n)) :: vr
   vr%vc = rhs * lhs%vc
end

end module vector_
   
program test
   use vector_
   implicit none

   integer,parameter:: ip=4,rp=8
   integer(ip)::       i,n
   real(rp)::          t1,t2,t3,t4,t5,t6
   real(rp),dimension(:),allocatable:: p,pp
   real(rp),dimension(:),pointer::     e,ee
   type(vector(100)) :: q,qq

   allocate(p(100),pp(100))
   p = 1.001_rp
   pp = 0.0_rp
   n = 10000000

   !the fastest i know
   call CPU_TIME(t3)

   do i=1,n
      call op(p,pp)
   end do

   call CPU_TIME(t4)
   print*, t4 - t3

   call qq%init(pp)
   call q%init(p)
   call CPU_TIME(t1)

   do i=1,n
      qq = qq + q + 2.d0 * q
   end do

   call CPU_TIME(t2)
   print*, qq%vc(1:10)
   print*, t2-t1
   pause
contains

pure subroutine op(s,ss)
    real(rp),dimension(:),intent(in)::  s
    real(rp),dimension(:),intent(inout):: ss
    ss = ss + s + 2.d0 * s
end

end program

impressive test...

i know this type for the first time

what is the parameterized derived type designed for, with a so poor performance...

Andrew_Smith · ‎04-13-2017

Please raise this as a support issue for the performance of parameterized derived types. I would have expected performance somewhere between the fixed size and the allocatable vectors, not an order of magnitude slower! There really was no point in providing it like this.

Andrew_Smith · ‎04-21-2017

Blatent bump

Kevin_D_Intel · ‎04-21-2017

I have been looking at it Andrew. I'll get it to Development soon.

(Internal tracking id: TBD)

Andrew_Smith · ‎06-23-2017

Any progress please?

Was the idea of the parameterized derived type just an academic exercise or was it a serious attempt to introduce a unique performance benefit to the language?

Steve_Lionel · ‎06-23-2017

Performance benefit? No. Indeed there are several on the standards committee who have recently expressed the opinion that this feature should never have been added to the language.

Andrew_Smith · ‎06-23-2017

That is very dissapointing news. From snippets of conversations over the last few years I was under the impression that parameterized derived types would give improved opportunities for stack memory allocation and vectorization compared to allocatable vectors. I was hoping they would get near the speed of fixed size vectors since the performance loss from allocatable vectors is large (6x in my test above).

But why is the performance 10x slower again than allocatable vectors? If this is expected then they are pretty much useless in a high performance language

Steve_Lionel · ‎06-23-2017

I have no idea, other than to make two observations:

Each time a new, significant language feature was added, it took time for compilers to learn how to optimize them well. Consider array operations vs. DO loops.
Any time you defer information to run-time, you lose performance. KIND type parameters are fine - those are always compile-time. But LEN parameters have been nothing but trouble for compiler implementors.

My advice would be to file a report with Intel and ask that the performance degradation be investigated. Maybe it's something simple, but don't get your hopes up too much.

jimdempseyatthecove · ‎06-24-2017

Li L:

>>Web master: The above looks like a table when I edit the reply form but displays as a simple list when posted

The text of the general message on this forum is a variable pitch font. To get a fixed pitch font, Click on the {...} code button, select Plain Text, and enter/paste in your text.

As for performance, consider experimenting with specifying the vector bounds in the operation

elemental function svproduct(lhs,rhs) result(vr)
   real(rp),intent(in) :: lhs
   type(vector(*)),intent(in) :: rhs
   type(vector(rhs%n)) :: vr

   vr%vc(1:rhs%n) = lhs * rhs%vc(1:rhs%n)
end function svproduct

and the same for the other functions.

Jim Dempsey

Andrew_Smith · ‎06-26-2017

Using my posted example I tried Jims suggestion and got no improvement.

Then I tried /Qhost and got a speedup of 0.15s for the fixed size vector and 0.3s for the parameterized derived type. This still leaves a whopping 25s discrepancy. Can Intel explain why this is?

jimdempseyatthecove · ‎06-26-2017

Andrew,

In the example #9, I notice that the "member" functions are declared elemental, however the usage is as scalar (iow not array of type(vector)).

What happens when you remove "elemental" from the derived type functions?

Jim Dempsey

Andrew_Smith · ‎06-26-2017

No significant change without elemental

Devorah_H_Intel · ‎06-27-2017

Andrew Smith wrote:

Using my posted example I tried Jims suggestion and got no improvement.

Then I tried /Qhost and got a speedup of 0.15s for the fixed size vector and 0.3s for the parameterized derived type. This still leaves a whopping 25s discrepancy. Can Intel explain why this is?

Please submit the ticket via Online Service Center for further investigation.

Thank you,