Re: Improving execution speed with derived types

jcac · ‎09-19-2005

Hi,

I am investigating ways to improve the execution speed of a fortran90 program. This program uses a derived data type that defines for each particle all the values known at the particle.

I have been testing ideas in a small programs, and see a significant speed difference between standard arrays (faster) and the derived type (slower) in the following routine

Code:

use base_database
!
implicit none
!
integer :: i, j, k, cyc
real :: start_time, end_time
real(kind=real_acc) :: othird
!
othird = 1.0/3.0
!
call cpu_time(start_time)
!
do cyc=1,ncycle
 do i=istart, iend
  do j=1,3
   do k=1,3
    par(i)%sigma(k,j) = par(i)%sigma(k,j) +othird*par(i)%rho*par(i)%rod(k,j)
   enddo
  enddo
 enddo
enddo
!
call cpu_time(end_time)
!
write(*,*) 'Time elapsed for derived test = ',end_time-start_time
!
call cpu_time(start_time)
!
do cyc=1,ncycle
 do i=istart, iend
  do j=1,3
   do k=1,3
    p_sigma(k,j,i) = p_sigma(k,j,i) + othird*p_rho(i)*p_rod(k,j,i)
   enddo
  enddo
 enddo
enddo
!
call cpu_time(end_time)
!
write(*,*) 'Time elapsed for array test = ',end_time-start_time
!
End

Sigma and rod are 3 by 3 arrays held at each particle i. All the p_ arrays and the par data type are allocatable and all the real variables are set to double precision (real_acc). The number of particles can be large, over 100,000.

I believe that the difference is due to the way the values are held in memory, with the derived data types being non-contiguous.

Is there a way to speed up this type of operation and get closer to the standard array speed, while using derived data types?

James

jim_dempsey · ‎09-19-2005

F90/F95 has implicit do's for many operations on arrays

array = 0. ! zeros all elements in array

array3 = array1 + array2 ! sums each cell.

array3 = array1 + array2 * scale ! applies scale to all cells too

Try changing

do j=1,3
do k=1,3
par(i)%sigma(k,j) = par(i)%sigma(k,j) +othird*par(i)%rho*par(i)%rod(k,j)
enddo
enddo

to

par(i)%sigma = par(i)%sigma +othird*par(i)%rho*par(i)%rod

Please report back as to improvement if any

Jim Dempsey

jcac · ‎09-20-2005

Jim,

Thank you for the suggestion. I have tried it out and it is actually slightly slower.

The results from the program are:

Time for derived type test: 1.703125 seconds
Time for modified test: 1.765265 seconds

For comparison the array takes 0.46875 seconds

These timings are for 100 cycles with 100,000 particles. The test machine is a laptop with a Pentium M 2.0GHz processor running Windows XP. I am using the version 9.0 compiler, as I am new to it I am just using the default settings for the Release version.

James

TimP · ‎09-20-2005

Have you checked whether the inner loops are unrolled? If the compiler doesn't unroll as fully in the derived type case, you could check whether unrolling the source code helps. If so, you might file a problem report.

jcac · ‎09-20-2005

I am not sure how to check for loop unrolling other than in the source. I have tried unrolling the inner loops, replacing

Code:

  do j=1,3
   do k=1,3
    par(i)%sigma(k,j) = par(i)%sigma(k,j) + othird*par(i)%rho*par(i)%rod(k,j)
   enddo
  enddo

withCode:

    par(i)%sigma(1,1) = par(i)%sigma(1,1) + othird*par(i)%rho*par(i)%rod(1,1)
    par(i)%sigma(2,1) = par(i)%sigma(2,1) + othird*par(i)%rho*par(i)%rod(2,1)
    par(i)%sigma(3,1) = par(i)%sigma(3,1) + othird*par(i)%rho*par(i)%rod(3,1)
    par(i)%sigma(1,2) = par(i)%sigma(1,2) + othird*par(i)%rho*par(i)%rod(1,2)
    par(i)%sigma(2,2) = par(i)%sigma(2,2) + othird*par(i)%rho*par(i)%rod(2,2)
    par(i)%sigma(3,2) = par(i)%sigma(3,2) + othird*par(i)%rho*par(i)%rod(3,2)
    par(i)%sigma(1,3) = par(i)%sigma(1,3) + othird*par(i)%rho*par(i)%rod(1,3)
    par(i)%sigma(2,3) = par(i)%sigma(2,3) + othird*par(i)%rho*par(i)%rod(2,3)
    par(i)%sigma(3,3) = par(i)%sigma(3,3) + othird*par(i)%rho*par(i)%rod(3,3)

and the run time is identical for both cases to the accuracy of the cpu_time function, which seems to be 1/64 seconds.

James

jim_dempsey · ‎09-20-2005

In addition to using array syntax try adding a pointer to your class

type(YourParType), pointer :: pPar

...

do i=istart, iend

pPar => par(i)

pPar%sigma = pPar%sigma +othird*pPar%rho*pPar%rod

end do

The compiler should have optimized out the indexing...

unless you had range checking enabled

Note, you may need t add "target" where you declare your

"par(:)"

Jim Dempsey

Message Edited by sblionel on 09-20-2005 04:25 PM

anthonyrichards · ‎09-20-2005

dimension p_sigma(9*imax), p_rod(9*imax), p_rho(imax)

real*8 oprho

integer icount, ,jcount

do cyc=1,ncycle
icount=(istart-2)*9
do i=istart, iend
icount=icount+9
oprho=othird*p_rho(i)
do jk=1,9
jcount=jk+icount
p_sigma(jcount) = p_sigma(jcount) + oprho*p_rod(jcount)
enddo
enddo
enddo

The heart of the loop contains one real multiply, one real add and one integer add and a single array index calculation. This should be faster.

jim_dempsey · ‎09-21-2005

Try the following and report back the timings

use base_database
!
implicit none
!
integer :: i, j, k, cyc
! vvv add
integer :: iCachPopulate
real :: start_time, end_time
real(kind=real_acc) :: othird
! vvv add replace 'typPar' with your par type
type(typePar), pointer :: pPar
!
othird = 1.0/3.0
!
! vvv add loop to prime the processor chache
! vvv on first iteration. Get timing on second iteration
do iCachPopulate=1,2
call cpu_time(start_time)
!
do cyc=1,ncycle
do i=istart, iend
do j=1,3
do k=1,3
par(i)%sigma(k,j) = par(i)%sigma(k,j) +othird*par(i)%rho*par(i)%rod(k,j)
enddo
enddo
enddo
enddo
!
call cpu_time(end_time)
! vvv add end of processor cache prime
enddo
!
! vvv change "test" to "test 1"
write(*,*) 'Time elapsed for derived test 1 = ',end_time-start_time
!
! Next do above test with removing (k,j) to test
! implicit array computation speed
!
! vvv add loop to prime the processor chache
! vvv on first iteration. Get timing on second iteration
do iCachPopulate=1,2
call cpu_time(start_time)
!
do cyc=1,ncycle
do i=istart, iend
par(i)%sigma = par(i)%sigma +othird*par(i)%rho*par(i)%rod
enddo
enddo
!
call cpu_time(end_time)
! vvv add end of processor cache prime
enddo
!
! vvv change "test" to "test 2"
write(*,*) 'Time elapsed for derived test 2 = ',end_time-start_time
!
! Next do above test with pointer to par
! implicit array computation speed
!
! vvv add loop to prime the processor chache
! vvv on first iteration. Get timing on second iteration
do iCachPopulate=1,2
call cpu_time(start_time)
!
do cyc=1,ncycle
do i=istart, iend
pPar => par(i)
pPar%sigma = pPar%sigma +othird*pPar%rho*pPar%rod
enddo
enddo
!
call cpu_time(end_time)
! vvv add end of processor cache prime
enddo
!
! vvv change "test" to "test 3"
write(*,*) 'Time elapsed for derived test 3 = ',end_time-start_time
!
! vvv add loop to prime the processor chache
! vvv on first iteration. Get timing on second iteration
do iCachPopulate=1,2
call cpu_time(start_time)
!
do cyc=1,ncycle
do i=istart, iend
do j=1,3
do k=1,3
p_sigma(k,j,i) = p_sigma(k,j,i) + othird*p_rho(i)*p_rod(k,j,i)
enddo
enddo
enddo
enddo
!
call cpu_time(end_time)
! vvv add end of processor cache prime
enddo
!
write(*,*) 'Time elapsed for array test = ',end_time-start_time
!
End

jcac · ‎09-22-2005

Thank you for your suggestions.

Pointers are a capability of fortran 90 that I am not familiar with and so had not tried.

Jim, the timings for test 1 and test 2 are identical to the tests they are derived from. Test 3 is slightly slower than test 2. In the code you give I am not clear on the likely benefit of the cache populate step. I am running each loop through a number of cycles in order to increase the execution time to a point where useful comparisons can be done with the cpu_time function. Also the space required to store the sigma array for every particle is over seven megabytes in this case, which I assume exceeds any cache size.

In the main program that I am looking to improve, derived data type are used as they make it much easier to maintin and modify. These features are more important in this case than raw execution speed. However given that, it would be useful to make the program run as fast as possible as large analyses can run for hours to days. Minimising the number of operations inside loops is one of the exercises that will be done. My interest is in whether the speed penalty from using derived types can be reduced, or is it an inevitable consequence of the choice to use them.

My understanding is that the reason for the speed difference between derived types and basic arrays is that in the derived type the sigma and rod tensors for particle i and i+1 are not held in adjacent memory locations and so any cache is not being used as efficiently. I had wondered if I could make an array of pointers that pointed to just the sigma or rod components, and work on these. But from the compile errors when I tried this I see that you are not allowed to make individual components of a derived type a target.

I have downloaded a trial version of the vtune analyser which I hope will help me with these comparisons, once I learn how to use it.

James

anthonyrichards · ‎09-22-2005

Canyou not also skip the inner j,k loops if abs(p_rho(i)) is less than some small quantity?

Jugoslav_Dujic · ‎09-22-2005

Tony, I'm not an expert on optimization, but AFAIK this sort of "optimization" is a frequent error. Namely, it is usually faster for the processor to evaluate a simple expression, because it can pipeline it into FP part of the processor separately of the normal exectution flow. Any kind of test ivolves jumping, interferes with control flow and thus kills performance. Check for yourself:

program FloatTest

integer, parameter:: IMAX = 10000000

real x(IMAX)

do i=1,IMAX
   call random_number(x(i))
   if (x(i).lt.0.5) x(i) = 0.
end do

call cpu_time(time0)
y = 0.
do i=1,IMAX
   y = y + x(i)*(x(i)-1)+x(i)
end do
call cpu_time(time1)


write(*,*) "time =", time1-time0
write(*,*) y

call cpu_time(time0)

y = 0.
do i=1,IMAX
   if (x(i).gt.0.) y = y + x(i)*(x(i)-1)+x(i)
end do
call cpu_time(time1)

write(*,*) "time =", time1-time0
write(*,*) y

end program

Jugoslav

jcac · ‎09-22-2005

Out of interest I compiled the same program on another system and found that the speed difference as much less pronounced, with the arraycase being only 5-10% faster.

The system was aCompaq Alpha,running TRU64 UNIX and using the Compaq f90 compiler with the -fast flag.

So the actual difference between the derived typeand array operations seems to be strongly dependent on the compiler/OS/hardware combination.

James

anthonyrichards · ‎09-22-2005

But the possible saving each time is 9 executions of the arithmetic expression+address assignments, not one. Anyway, it was just a suggestion....its suitabilitydepends of course on the frequency of zeros in one of the arrays..

jim_dempsey · ‎09-23-2005

James, I ran a few test indicating the derived type is faster than the 3 dimensioned arrays. Don't entirely know why. Here is my test code:

! Test2.f90
module base_database

type typeOldPar
real(8) :: rho,sigma(3,3),rod(3,3)
end type typeOldPar

type typeNewPar
sequence
real(8) :: sigma(3,3) ! Offset = 0 (+ 9*8 = 72 = 4.5*16)
real(8) :: rho ! Offset = 72 (+ 8 = 80 = 5*16)
real(8) :: rod(3,3) ! Offset = 80 (+ 9*8 = 152 = 9.5*16)
real(8) :: padd ! Offset = 152 (+8 = 160)
end type typeNewPar
type(typeOldPar), allocatable :: par(:)
type(typeNewPar), allocatable :: parNew(:)
real(8), allocatable :: p_sigma(:,:,:), p_rho(:), p_rod(:,:,:)
end module base_database

!DEC$ ATTRIBUTES FORCEINLINE :: ComputeNewSigmaArray
subroutine ComputeNewSigmaArray(p)
use base_database
!
implicit none
type(typeNewPar) :: p
real(8), parameter :: othird = 1.0D0 / 3.0D0
real(8) :: scale
scale = othird*p%rho
p%sigma = p%sigma +scale*p%rod
end subroutine ComputeNewSigmaArray

!DEC$ ATTRIBUTES FORCEINLINE :: ComputeNewSigma3x3
subroutine ComputeNewSigma3x3(p)
use base_database
!
implicit none
type(typeNewPar) :: p
real(8), parameter :: othird = 1.0D0 / 3.0D0
real(8) :: scale
integer :: i,j
scale = othird*p%rho
do j=1,3
do i=1,3
p%sigma(i,j) = p%sigma(i,j) +scale*p%rod(i,j)
end do
end do
end subroutine ComputeNewSigma3x3

program Test2

use base_database
!
implicit none
!
integer :: i, j, k, cyc
real :: start_time, end_time
real(8), parameter :: othird = 1.0D0 / 3.0D0
integer :: istart, iend, ncycle, rcycle, ircycle
istart = 1
iend = 100000
ncycle = 100
rcycle = 3
allocate(par(iend))
allocate(parNew(iend))
allocate(p_sigma(3,3,iend))
allocate(p_rho(iend))
allocate(p_rod(3,3,iend))
!
do i=istart, iend
par(i)%sigma = 0.
par(i)%rho = 0.
par(i)%rod(k,j) = 0.
parNew(i)%sigma = 0.
parNew(i)%rho = 0.
parNew(i)%rod(k,j) = 0.
enddo
!
do ircycle = 1, rcycle
call cpu_time(start_time)
!
do cyc=1,ncycle
do i=istart, iend
do j=1,3
do k=1,3
par(i)%sigma(k,j) = par(i)%sigma(k,j) +othird*par(i)%rho*par(i)%rod(k,j)
enddo
enddo
enddo
enddo
!
call cpu_time(end_time)
!
write(*,*) 'Time elapsed for derived test 1 = ',end_time-start_time
enddo
!DEC$ IF(0)
On a P4 530 with other threads running
Time elapsed for derived test 1 = 0.7031250
Time elapsed for derived test 1 = 0.7656250
Time elapsed for derived test 1 = 0.7343750
!DEC$ ENDIF
!
do ircycle = 1, rcycle
call cpu_time(start_time)
!
do cyc=1,ncycle
do i=istart, iend
call ComputeNewSigmaArray(parNew(i))
enddo
enddo
!
call cpu_time(end_time)
!
write(*,*) 'Time elapsed for derived test 2 = ',end_time-start_time
enddo
!
!DEC$ IF(0)
On a P4 530 with other threads running
Time elapsed for derived test 2 = 1.000000
Time elapsed for derived test 2 = 1.015625
Time elapsed for derived test 2 = 1.078125
!DEC$ ENDIF
do ircycle = 1, rcycle
call cpu_time(start_time)
!
do cyc=1,ncycle
do i=istart, iend
call ComputeNewSigma3x3(parNew(i))
enddo
enddo
!
call cpu_time(end_time)
!
write(*,* ) 'Time elapsed for derived test 3 = ',end_time-start_time
enddo
!DEC$ IF(0)
On a P4 530 with other threads running
Time elapsed for derived test 3 = 1.000000
Time elapsed for derived test 3 = 0.9843750
Time elapsed for derived test 3 = 1.031250
!DEC$ ENDIF
!
do ircycle = 1, rcycle
call cpu_time(start_time)
!
do cyc=1,ncycle
do i=istart, iend
do j=1,3
do k=1,3
p_sigma(k,j,i) = p_sigma(k,j,i) + othird*p_rho(i)*p_rod(k,j,i)
enddo
enddo
enddo
enddo
!
call cpu_time(end_time)
!
write(*,*) 'Time elapsed for array test = ',end_time-start_time
enddo
!DEC$ IF(0)
On a P4 530 with other threads running
Time elapsed for array test = 1.703125
Time elapsed for array test = 1.703125
Time elapsed for array test = 1.718750
!DEC$ ENDIF
!
stop
end program Test2

jcac · ‎09-23-2005

Jim,

What compiler version and flags are you using?

I am using Version 9.0.2713.2002 integrated with Visual Studio .NET 2002. I built your code as a new console project, using the default Release settings and got the following:

Time elapsed for derived test 1 = 1.703125
Time elapsed for derived test 1 = 1.671875
Time elapsed for derived test 1 = 1.671875

Time elapsed for derived test 2 = 1.687500
Time elapsed for derived test 2 = 1.687500
Time elapsed for derived test 2 = 1.687500

Time elapsed for derived test 3 = 1.640625
Time elapsed for derived test 3 = 1.656250
Time elapsed for derived test 3 = 1.640625

Time elapsed for array test = 1.312500
Time elapsed for array test = 1.296875
Time elapsed for array test = 1.281250

I did experiment with some options, for example /Qipo speeds up the array but not the derived type:

Time elapsed for derived test 1 = 1.703125
Time elapsed for derived test 1 = 1.656250
Time elapsed for derived test 1 = 1.640625

Time elapsed for derived test 2 = 1.703125
Time elapsed for derived test 2 = 1.687500
Time elapsed for derived test 2 = 1.671875

Time elapsed for derived test 3 = 1.640625
Time elapsed for derived test 3 = 1.656250
Time elapsed for derived test 3 = 1.640625

Time elapsed for array test = 1.093750
Time elapsed for array test = 1.078125
Time elapsed for array test = 1.078125

What I am not sure is why the array test with your program is slower than in mine, I need to investigate this more.

I had to make a change to compile your program, the (j,k) had to be removed from rod when initializing the variables, otherwise I received a compile error.

Steven_L_Intel1 · ‎09-23-2005

A hint to those posting code. If you are using MSIE, then use the button for source and a box pops up. Otherwise, put

and

tags around the source in order to prevent punctuation to be interpreted as smileys.

jim_dempsey · ‎09-23-2005

Compiler options:
/nologo /Zi /O3 /QaxP /QxP /fpp /fpe:0 /module:"$(INTDIR)/"
/object:"$(INTDIR)/" /traceback /libs:static /dbglibs /c

I made some additional tests in the program

Original derived type loop
Time elapsed for derived test par(i)%sigma(k,j) = 0.7500000
Time elapsed for derived test par(i)%sigma(k,j) = 0.8125000
Time elapsed for derived test par(i)%sigma(k,j) = 0.7500000

The call to the inlined function as in prior test code
Time elapsed for derived test ComputeNewSigmaArray = 1.015625
Time elapsed for derived test ComputeNewSigmaArray = 1.031250
Time elapsed for derived test ComputeNewSigmaArray = 1.046875

Bringing the contents of the above inlined funciton into line by hand
Time elapsed for derived test p%sigma = 1.109375
Time elapsed for derived test p%sigma = 1.109375
Time elapsed for derived test p%sigma = 1.062500
??^^ I was suprised this ran slightly slower than having the compiler inline the code

Results of call to inlined function
Time elapsed for derived test ComputeNewSigma3x3 = 1.093750
Time elapsed for derived test ComputeNewSigma3x3 = 1.093750
Time elapsed for derived test ComputeNewSigma3x3 = 0.9531250

Bringing the contents of the above inlined funciton into line by hand
Time elapsed for derived test p%sigma(k,j) = 1.078125
Time elapsed for derived test p%sigma(k,j) = 1.093750
Time elapsed for derived test p%sigma(k,j) = 0.9843750

Time elapsed for array test = 1.703125
Time elapsed for array test = 1.843750
Time elapsed for array test = 1.765625

Interestingly the intuitive actions of creating local temps for scale and a pointer to the derived type element interfered with the compiler's optimizations - so much for intuition.

This is a good example of why some time must be invested in examining the performance impact of different methods. In particular if this function is going to consume 10's, 100's, 1000's hours of processor time.

This may be a good candidate for using a dual core processor with OpenMP.
From my experience with OpenMP on my P4 530 with HT is that FPU intensive applications run slower. I am looking at replacing my motherboard and processor with something with true MP capabilities.

Im my case my application on the P4 530 will take several months to complete the corse level computations. A dual or quad processor system, each with dual cores looks tantilizing (at least until I look at the cost). On the low end a Dual Core P4 840. On the high end a Quad Xeon or Quad Opteron system. But more likely something in between (dual processor each with dual core).

My simulation is a tension structure built with tethers. One configuraiton has 6 tethers the other has 8. The tether end points are connectd to mass objects. I am simulating what could be called a compound pendulum with flexible and interconnected arms. A non-trivial computation. The purpose of the computation is a preliminary engineering study of a second generation space elevator.

Jim Dempsey

jim_dempsey · ‎09-23-2005

RE: (k,j) on initialization

Oops

Funny thing my compiler did not balk at this (using uninitialized variable). Thanks for the catch. The bug got in there with a lazy cut/paste. The improper initialization will not adversely affect the results of the test runs.

Jim Dempsey

jcac · ‎09-26-2005

Changing the compile flags by adding /O3 and adding the processor specific options, /QaxB /QxB, in my case, I did see a speed up for the first test.

Time elapsed for derived test 1 = 0.6875000
Time elapsed for derived test 1 = 0.5312500
Time elapsed for derived test 1 = 0.5156250

Time elapsed for derived test 2 = 1.687500
Time elapsed for derived test 2 = 1.671875
Time elapsed for derived test 2 = 1.671875

Time elapsed for derived test 3 = 1.625000
Time elapsed for derived test 3 = 1.625000
Time elapsed for derived test 3 = 1.640625

Time elapsed for array test = 1.062500
Time elapsed for array test = 1.062500
Time elapsed for array test = 1.046875

I did also try compiling it with each module/routine split into a different file and saw a speed increase for test 1 and the array tests, tests 2 and 3 were not significantly altered:

Time elapsed for derived test 1 = 0.3125000
Time elapsed for derived test 1 = 0.2187500
Time elapsed for derived test 1 = 0.2031250

Time elapsed for array test = 0.2968750
Time elapsed for array test = 0.2968750
Time elapsed for array test = 0.2968750

For now it seems that I do not need to make any basic changes within the FORTRAN language to speed up the derived types. Once I have completed the algorithm and code optimisations I will clearly have to carefully look at the different compiler options.

In the longer term I am looking at implementing a parallel version. However it will have to be based on MPI as it is almost certain that the computing facilities available to mein the future willbe distributed memory systems.

The main code I am working on is a meshless solver for transient solid, structural and fluid mechanics problems. As there is a lot of commonality with the n-body/SPH codes developed for astrophysics simulations, I can learn a lot from the MPI implementations that have been done for that area.

James

jim_dempsey · ‎09-26-2005

>>In the longer term I am looking at implementing a parallel version. However it will have to be based on MPI as it is almost certain that the computing facilities available to mein the future willbe distributed memory systems.<<

Many mid-range workstations (will) have dual core capabilities so I wouldn't discound adding a little OpenMP along with MPI. You can do both.

Jim Dempsey