Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

how to get the fast operation for derived type. a comparison test show optimization do nothing

Li_L_
New Contributor I
2,294 Views
module constants
    integer,parameter:: ip = 4
    integer,parameter:: rp = 8
end module constants

module vector_
use constants
implicit none

    private
    public:: vector
    public:: operator(+),operator(*)

    type::  vector
        real(rp),allocatable,dimension(:):: vc
    contains
        generic::           init    =>  init_ar
        procedure,private:: init_ar    
    end type vector

!---------------------------------------------
    interface operator(+)
        procedure::  vplus
    end interface

    interface operator(*)
        procedure::  svproduct
        procedure::  vsproduct
    end interface
    
!-----
contains

    !&&
    pure subroutine init_ar(this,ar)
    class(vector),intent(out)::         this
    real(rp),dimension(:),intent(in)::  ar
        this%vc = ar
    end subroutine init_ar
    
!----operator
    elemental type(vector) function vplus(lhs,rhs) result(vvp)
    type(vector),intent(in)::   lhs,rhs
        vvp%vc = lhs%vc + rhs%vc
    end function vplus
    
    !--
    elemental function svproduct(lhs,rhs) result(vr)
    real(rp),intent(in)::               lhs
    type(vector),intent(in)::           rhs
    type(vector)::                      vr
        vr%vc = lhs * rhs%vc
    end function svproduct
    
    !--
    elemental function vsproduct(lhs,rhs) result(vr)
    type(vector),intent(in)::           lhs
    real(rp),intent(in)::               rhs
    type(vector)::                      vr
        vr%vc = rhs * lhs%vc
    end function vsproduct
    
end module vector_

program test
use constants
use vector_
implicit none
integer(ip)::   i,n,j
real(rp)::      t1,t2,t3,t4,t5,t6,t7
type(vector)::  p1
real(rp),dimension(:),allocatable:: p2
real(rp),dimension(100):: p3


    p3 = 1.0001d0
    p2 = p3
    call p1%init(p2)
    
    n = 1e7
    
    !1
    call CPU_TIME(t1)   !2.375
    do i=1,n
        p1 = p1 + 2.d0 * p1
    enddo
    
    !2
    call CPU_TIME(t2)   !0.297
    do i=1,n
        p2 = p2 + 2.d0 * p2
    enddo
    
    !3
    call CPU_TIME(t3)   !2.5
    do i=1,n
        call op(p2)
    enddo
    
    !4
    call CPU_TIME(t4)   !2.531
    do i=1,n
        p3 = p3 + 2.d0 * p3
    enddo
    
    !5
    call CPU_TIME(t5)   !2.515
    do i=1,n
        do j=1,100
            p3(j) = p3(j) + 2.d0 * p3(j)
        enddo
    enddo
    
    !6
    call CPU_TIME(t6)   !0.234
    do i=1,n
        call op(p3)
    enddo
    
    call CPU_TIME(t7)
    
    print*, '1',t2 - t1
    print*, '2',t3 - t2
    print*, '3',t4 - t3
    print*, '4',t5 - t4
    print*, '5',t6 - t5
    print*, '6',t7 - t6

contains

    pure subroutine op(s)
    real(rp),dimension(:),intent(inout):: s
        s = s + 2.d0 * s
    end subroutine op
    
end program test

here i test the operation of arraies, and three kinds of array are chosen

1. the derived type vector which is actually an array

2. the allocatable array with undetermined size

3. the array with determined size

and then 6 kinds of procedures are tested, which are all dealing with (s = s + 2.d0 * s )

then i find the difference time cost for each procedures

for O2, we get the time cost: proc1(2.375s), proc2(0.297s), proc3(2.5s), proc4(2.531s), proc5(2.515), proc6(0.234)

for O3, proc1, proc2, proc6 unchanged time cost, and proc3, proc4, proc5 decrease to 1.25s around

 

so i have a question: is it possible to get the speed as the proc6 for derived type with overriding operation?

how to do it?

1 Solution
Steve_Lionel
Honored Contributor III
2,259 Views

I have no idea, other than to make two observations:

  1. Each time a new, significant language feature was added, it took time for compilers to learn how to optimize them well. Consider array operations vs. DO loops.
  2. Any time you defer information to run-time, you lose performance. KIND type parameters are fine - those are always compile-time. But LEN parameters have been nothing but trouble for compiler implementors.

My advice would be to file a report with Intel and ask that the performance degradation be investigated. Maybe it's something simple, but don't get your hopes up too much.

View solution in original post

0 Kudos
24 Replies
Roman1
New Contributor I
2,026 Views

I hope you know that your test causes a floating point overflow.  Was that intentional?  When doing these kinds of benchmark tests, I don't think it is appropriate to have values set to Inf, or Nan.

Also, can you add a print statement at the very end, such as  print*,p3(1).  Since the array is not accessed at the very end, the optimizer might to something "smart", and remove its calculation.

 

Roman

 

0 Kudos
Li_L_
New Contributor I
2,026 Views

Roman wrote:

I hope you know that your test causes a floating point overflow.  Was that intentional?  When doing these kinds of benchmark tests, I don't think it is appropriate to have values set to Inf, or Nan.

Also, can you add a print statement at the very end, such as  print*,p3(1).  Since the array is not accessed at the very end, the optimizer might to something "smart", and remove its calculation.

 

Roman

 

thanks for your correction. i realized my mistakes after my post, so i have another test

program test
use vector_
implicit none
integer,parameter:: ip=4,rp=8
integer(ip)::       i,n
real(rp)::          t1,t2,t3,t4,t5,t6
real(rp),dimension(:),allocatable:: p,pp
real(rp),dimension(:),pointer::     e,ee
type(vector)::                      q,qq

    allocate(p(100),pp(100))
    p = 1.001d0
    pp = 0.d0
    n=1e7
    
    !the fastest i know
    call CPU_TIME(t3)
    do i=1,n
        call op(p,pp)
    enddo
    call CPU_TIME(t4)
    
    print*, t4 - t3
    
    
    call qq%init(pp)
    call q%init(p)
    
    call CPU_TIME(t1)
    !e   =>  q%ptr()
    !ee  =>  qq%ptr()
    do i=1,n
        qq = qq + q + 2.d0 * q
        !call op(e,ee)
    enddo
    call CPU_TIME(t2)
    
    print*, qq%vc(1:10)
    print*, t2-t1

contains

    pure subroutine op(s,ss)
    real(rp),dimension(:),intent(in)::  s
    real(rp),dimension(:),intent(inout):: ss
        ss = ss + s + 2.d0 * s
    end subroutine op
    
end program test

 

here i post the fastest and slowest procedures. the second time cost is 10 times as the first time cost

 

i once changed the vector as wrapper of determined size array, like (real(rp),dimension(100):: vc)

then the calculation is around 2~3 times as the fastest procedure

so i realize the <allocatable> attr in operator function is manly responsible for the extra time cost

and more, i guess the pure function in Fortran is not as fast as inline function in C++ (ps: just guess, and i know little about C++)

 

i feel fortran lacks optimization for derived operator of derived type, or i lack some knowledge of fortran? i don't know. 

i rewrite my code and let it compute under native data type and array.

any suggestions about the derived operator function is welcome

0 Kudos
Roman1
New Contributor I
2,026 Views

When you do the second test (qq = qq + q + 2.d0 * q) , there are temporary array variables created at each operator function call.  If I remove these calls, the test would look something like the following code.   If you run it, you will get similar time values to what you had before.  As you can see, all of these allocations, deallocations and array copies are slowing the program down.

 

type(vector):: v1,v2,v3

    p = 1.001d0
    pp = 0.d0
    call qq%init(pp)
    call q%init(p)
    
    call CPU_TIME(t1)
    do i=1,n
       allocate(v1%vc(size(q%vc)), v2%vc(size(q%vc)), v3%vc(size(q%vc)) )
       v1%vc = 2.d0 * q%vc
       v2%vc = q%vc + v1%vc
       v3%vc = qq%vc + v2%vc
       qq%vc = v3%vc
       deallocate(v1%vc, v2%vc, v3%vc)
    enddo
    call CPU_TIME(t2)

 

0 Kudos
Li_L_
New Contributor I
2,026 Views

Roman wrote:

When you do the second test (qq = qq + q + 2.d0 * q) , there are temporary array variables created at each operator function call.  If I remove these calls, the test would look something like the following code.   If you run it, you will get similar time values to what you had before.  As you can see, all of these allocations, deallocations and array copies are slowing the program down.

 

type(vector):: v1,v2,v3

    p = 1.001d0
    pp = 0.d0
    call qq%init(pp)
    call q%init(p)
    
    call CPU_TIME(t1)
    do i=1,n
       allocate(v1%vc(size(q%vc)), v2%vc(size(q%vc)), v3%vc(size(q%vc)) )
       v1%vc = 2.d0 * q%vc
       v2%vc = q%vc + v1%vc
       v3%vc = qq%vc + v2%vc
       qq%vc = v3%vc
       deallocate(v1%vc, v2%vc, v3%vc)
    enddo
    call CPU_TIME(t2)

 

that's the point!

as far as i know, there exists many derived operator in C++, and no one complaining the speed

so the question: is there a similar way to construct derived operator in Fortran which avoids temporary containers

or Fortran discourages the derived operator for derived type

 

you know, it's mentally hard to recover operating polynomials to dealing with native array.

polynomials is a special allocatable vector, and we always want to see the expression:   a = b + c,  other than: a%vc = b%vc + c%vc, then round(a)

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,026 Views

>>as far as i know, there exists many derived operator in C++, and no one complaining the speed

If your derived operator in C++ is operating on objects that are a container with variable capacity, then you will see similar allocations for temporaries.

>>we always want to see the expression:   a = b + c,  other than: a%vc = b%vc + c%vc

shorthand: associate (a=>oa%vc, b=>ob%vc, c=>ob%vc)
...
a = b + c
...
end associate shorthand

Jim Dempsey
 

 

0 Kudos
Andrew_Smith
New Contributor III
2,026 Views

The parameterized derived type should theoretically address your issue. However these are the run times for your last test program run on 64 bit and latest windows compiler:

Optimization level Fixed size (100) Allocatable Parameterized
O2 0.81 3.3 25.4
O3, IPO 0.53 3.3 26.5

Web master: The above looks like a table when I edit the reply form but displays as a simple list when posted.

You can see the long run times for parameterized types. This is the first time I got to try this new feature and I am dissapointed by the performance. It should be close to the times for the fixed size array.

Here is the full code I used for the parametrized derived type. I made no attempt to tune it (data alignment, explicit vectoring etc.) but I doubt any tuning is going to fix it.

module constants
    integer,parameter:: ip = 4
    integer,parameter:: rp = 8
end module constants

module vector_
   use constants
   implicit none
   private
   public :: vector
   public :: operator(+), operator(*)

   type :: vector(n)
      integer, len :: n
      real(rp) :: vc(n)
    contains
        generic :: init => init_ar
        procedure, private :: init_ar   
    end type

    interface operator(+)
        procedure::  vplus
    end interface

    interface operator(*)
        procedure::  svproduct
        procedure::  vsproduct
    end interface
contains

pure subroutine init_ar(this,ar)
    class(vector(*)),intent(out) :: this
    real(rp), intent(in) :: ar(:)
    this%vc = ar
end

elemental function vplus(lhs,rhs) result(vvp)
    type(vector(*)),intent(in) :: lhs
    type(Vector(lhs%n)), intent(in) :: rhs
    type(vector(lhs%n)) vvp
    vvp%vc = lhs%vc + rhs%vc
end

elemental function svproduct(lhs,rhs) result(vr)
   real(rp),intent(in) :: lhs
   type(vector(*)),intent(in) :: rhs
   type(vector(rhs%n)) :: vr

   vr%vc = lhs * rhs%vc
end function svproduct

elemental function vsproduct(lhs,rhs) result(vr)
   type(vector(*)),intent(in) :: lhs
   real(rp),intent(in) :: rhs
   type(vector(lhs%n)) :: vr
   vr%vc = rhs * lhs%vc
end

end module vector_
   
program test
   use vector_
   implicit none

   integer,parameter:: ip=4,rp=8
   integer(ip)::       i,n
   real(rp)::          t1,t2,t3,t4,t5,t6
   real(rp),dimension(:),allocatable:: p,pp
   real(rp),dimension(:),pointer::     e,ee
   type(vector(100)) :: q,qq

   allocate(p(100),pp(100))
   p = 1.001_rp
   pp = 0.0_rp
   n = 10000000

   !the fastest i know
   call CPU_TIME(t3)

   do i=1,n
      call op(p,pp)
   end do

   call CPU_TIME(t4)
   print*, t4 - t3

   call qq%init(pp)
   call q%init(p)
   call CPU_TIME(t1)

   do i=1,n
      qq = qq + q + 2.d0 * q
   end do

   call CPU_TIME(t2)
   print*, qq%vc(1:10)
   print*, t2-t1
   pause
contains

pure subroutine op(s,ss)
    real(rp),dimension(:),intent(in)::  s
    real(rp),dimension(:),intent(inout):: ss
    ss = ss + s + 2.d0 * s
end

end program

 

Li_L_
New Contributor I
2,026 Views

jimdempseyatthecove wrote:

>>as far as i know, there exists many derived operator in C++, and no one complaining the speed

If your derived operator in C++ is operating on objects that are a container with variable capacity, then you will see similar allocations for temporaries.

>>we always want to see the expression:   a = b + c,  other than: a%vc = b%vc + c%vc

shorthand: associate (a=>oa%vc, b=>ob%vc, c=>ob%vc)
...
a = b + c
...
end associate shorthand

Jim Dempsey
 

 

part1: i'm not familiar with c++, maybe i made a mistake

part2: that partly solves the problem, but it looks not elegant :(

0 Kudos
Li_L_
New Contributor I
2,026 Views

Andrew Smith wrote:

The parameterized derived type should theoretically address your issue. However these are the run times for your last test program run on 64 bit and latest windows compiler:

Optimization level
Fixed size (100)
Allocatable
Parameterized

O2
0.81
3.3
25.4

O3, IPO
0.53
3.3
26.5

Web master: The above looks like a table when I edit the reply form but displays as a simple list when posted.

You can see the long run times for parameterized types. This is the first time I got to try this new feature and I am dissapointed by the performance. It should be close to the times for the fixed size array.

Here is the full code I used for the parametrized derived type. I made no attempt to tune it (data alignment, explicit vectoring etc.) but I doubt any tuning is going to fix it.

module constants
    integer,parameter:: ip = 4
    integer,parameter:: rp = 8
end module constants

module vector_
   use constants
   implicit none
   private
   public :: vector
   public :: operator(+), operator(*)

   type :: vector(n)
      integer, len :: n
      real(rp) :: vc(n)
    contains
        generic :: init => init_ar
        procedure, private :: init_ar   
    end type

    interface operator(+)
        procedure::  vplus
    end interface

    interface operator(*)
        procedure::  svproduct
        procedure::  vsproduct
    end interface
contains

pure subroutine init_ar(this,ar)
    class(vector(*)),intent(out) :: this
    real(rp), intent(in) :: ar(:)
    this%vc = ar
end

elemental function vplus(lhs,rhs) result(vvp)
    type(vector(*)),intent(in) :: lhs
    type(Vector(lhs%n)), intent(in) :: rhs
    type(vector(lhs%n)) vvp
    vvp%vc = lhs%vc + rhs%vc
end

elemental function svproduct(lhs,rhs) result(vr)
   real(rp),intent(in) :: lhs
   type(vector(*)),intent(in) :: rhs
   type(vector(rhs%n)) :: vr

   vr%vc = lhs * rhs%vc
end function svproduct

elemental function vsproduct(lhs,rhs) result(vr)
   type(vector(*)),intent(in) :: lhs
   real(rp),intent(in) :: rhs
   type(vector(lhs%n)) :: vr
   vr%vc = rhs * lhs%vc
end

end module vector_
   
program test
   use vector_
   implicit none

   integer,parameter:: ip=4,rp=8
   integer(ip)::       i,n
   real(rp)::          t1,t2,t3,t4,t5,t6
   real(rp),dimension(:),allocatable:: p,pp
   real(rp),dimension(:),pointer::     e,ee
   type(vector(100)) :: q,qq

   allocate(p(100),pp(100))
   p = 1.001_rp
   pp = 0.0_rp
   n = 10000000

   !the fastest i know
   call CPU_TIME(t3)

   do i=1,n
      call op(p,pp)
   end do

   call CPU_TIME(t4)
   print*, t4 - t3

   call qq%init(pp)
   call q%init(p)
   call CPU_TIME(t1)

   do i=1,n
      qq = qq + q + 2.d0 * q
   end do

   call CPU_TIME(t2)
   print*, qq%vc(1:10)
   print*, t2-t1
   pause
contains

pure subroutine op(s,ss)
    real(rp),dimension(:),intent(in)::  s
    real(rp),dimension(:),intent(inout):: ss
    ss = ss + s + 2.d0 * s
end

end program

 

impressive test...

i know this type for the first time

what is the parameterized derived type designed for, with a so poor performance... 

0 Kudos
Andrew_Smith
New Contributor III
2,026 Views

Please raise this as a support issue for the performance of parameterized derived types. I would have expected performance somewhere between the fixed size and the allocatable vectors, not an order of magnitude slower! There really was no point in providing it like this.

0 Kudos
Andrew_Smith
New Contributor III
2,026 Views

Blatent bump

0 Kudos
Kevin_D_Intel
Employee
2,026 Views

I have been looking at it Andrew. I'll get it to Development soon.

(Internal tracking id: TBD)

0 Kudos
Andrew_Smith
New Contributor III
2,026 Views

Any progress please?

Was the idea of the parameterized derived type just an academic exercise or was it a serious attempt to introduce a unique performance benefit to the language?

Steve_Lionel
Honored Contributor III
2,026 Views

Performance benefit? No. Indeed there are several on the standards committee who have recently expressed the opinion that this feature should never have been added to the language.

0 Kudos
Andrew_Smith
New Contributor III
2,026 Views

That is very dissapointing news. From snippets of conversations over the last few years I was under the impression that parameterized derived types would give improved opportunities for stack memory allocation and vectorization compared to allocatable vectors. I was hoping they would get near the speed of fixed size vectors since the performance loss from allocatable vectors is large (6x in my test above).

But why is the performance 10x slower again than allocatable vectors? If this is expected then they are pretty much useless in a high performance language

0 Kudos
Steve_Lionel
Honored Contributor III
2,260 Views

I have no idea, other than to make two observations:

  1. Each time a new, significant language feature was added, it took time for compilers to learn how to optimize them well. Consider array operations vs. DO loops.
  2. Any time you defer information to run-time, you lose performance. KIND type parameters are fine - those are always compile-time. But LEN parameters have been nothing but trouble for compiler implementors.

My advice would be to file a report with Intel and ask that the performance degradation be investigated. Maybe it's something simple, but don't get your hopes up too much.

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,026 Views

Li L:

>>Web master: The above looks like a table when I edit the reply form but displays as a simple list when posted

The text of the general message on this forum is a variable pitch font. To get a fixed pitch font, Click on the {...} code button, select Plain Text, and enter/paste in your text.

As for performance, consider experimenting with specifying the vector bounds in the operation

elemental function svproduct(lhs,rhs) result(vr)
   real(rp),intent(in) :: lhs
   type(vector(*)),intent(in) :: rhs
   type(vector(rhs%n)) :: vr

   vr%vc(1:rhs%n) = lhs * rhs%vc(1:rhs%n)
end function svproduct

and the same for the other functions.

Jim Dempsey

Andrew_Smith
New Contributor III
2,026 Views

Using my posted example I tried Jims suggestion and got no improvement.

Then I tried /Qhost and got a speedup of 0.15s for the fixed size vector and 0.3s for the parameterized derived type. This still leaves a whopping 25s discrepancy. Can Intel explain why this is?

 

jimdempseyatthecove
Honored Contributor III
2,026 Views

Andrew,

In the example #9, I notice that the "member" functions are declared elemental, however the usage is as scalar (iow not array of type(vector)).

What happens when you remove "elemental" from the derived type functions?

Jim Dempsey

0 Kudos
Andrew_Smith
New Contributor III
2,026 Views

No significant change without elemental

0 Kudos
Devorah_H_Intel
Moderator
1,771 Views

Andrew Smith wrote:

Using my posted example I tried Jims suggestion and got no improvement.

Then I tried /Qhost and got a speedup of 0.15s for the fixed size vector and 0.3s for the parameterized derived type. This still leaves a whopping 25s discrepancy. Can Intel explain why this is?

Please submit the ticket via Online Service Center for further investigation. 

Thank you,

 

0 Kudos
Reply