- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am investigating ways to improve the execution speed of a fortran90 program. This program uses a derived data type that defines for each particle all the values known at the particle.
I have been testing ideas in a small programs, and see a significant speed difference between standard arrays (faster) and the derived type (slower) in the following routine
Code:
use base_database ! implicit none ! integer :: i, j, k, cyc real :: start_time, end_time real(kind=real_acc) :: othird ! othird = 1.0/3.0 ! call cpu_time(start_time) ! do cyc=1,ncycle do i=istart, iend do j=1,3 do k=1,3 par(i)%sigma(k,j) = par(i)%sigma(k,j) +othird*par(i)%rho*par(i)%rod(k,j) enddo enddo enddo enddo ! call cpu_time(end_time) ! write(*,*) 'Time elapsed for derived test = ',end_time-start_time ! call cpu_time(start_time) ! do cyc=1,ncycle do i=istart, iend do j=1,3 do k=1,3 p_sigma(k,j,i) = p_sigma(k,j,i) + othird*p_rho(i)*p_rod(k,j,i) enddo enddo enddo enddo ! call cpu_time(end_time) ! write(*,*) 'Time elapsed for array test = ',end_time-start_time ! End
Sigma and rod are 3 by 3 arrays held at each particle i. All the p_ arrays and the par data type are allocatable and all the real variables are set to double precision (real_acc). The number of particles can be large, over 100,000.
I believe that the difference is due to the way the values are held in memory, with the derived data types being non-contiguous.
Is there a way to speed up this type of operation and get closer to the standard array speed, while using derived data types?
James
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
do k=1,3
par(i)%sigma(k,j) = par(i)%sigma(k,j) +othird*par(i)%rho*par(i)%rod(k,j)
enddo
enddo
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Time for modified test: 1.765265 seconds
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
do j=1,3 do k=1,3 par(i)%sigma(k,j) = par(i)%sigma(k,j) + othird*par(i)%rho*par(i)%rod(k,j) enddo enddo
withCode:
par(i)%sigma(1,1) = par(i)%sigma(1,1) + othird*par(i)%rho*par(i)%rod(1,1) par(i)%sigma(2,1) = par(i)%sigma(2,1) + othird*par(i)%rho*par(i)%rod(2,1) par(i)%sigma(3,1) = par(i)%sigma(3,1) + othird*par(i)%rho*par(i)%rod(3,1) par(i)%sigma(1,2) = par(i)%sigma(1,2) + othird*par(i)%rho*par(i)%rod(1,2) par(i)%sigma(2,2) = par(i)%sigma(2,2) + othird*par(i)%rho*par(i)%rod(2,2) par(i)%sigma(3,2) = par(i)%sigma(3,2) + othird*par(i)%rho*par(i)%rod(3,2) par(i)%sigma(1,3) = par(i)%sigma(1,3) + othird*par(i)%rho*par(i)%rod(1,3) par(i)%sigma(2,3) = par(i)%sigma(2,3) + othird*par(i)%rho*par(i)%rod(2,3) par(i)%sigma(3,3) = par(i)%sigma(3,3) + othird*par(i)%rho*par(i)%rod(3,3)
and the run time is identical for both cases to the accuracy of the cpu_time function, which seems to be 1/64 seconds.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In addition to using array syntax try adding a pointer to your class
type(YourParType), pointer :: pPar
...
do i=istart, iend
pPar => par(i)
"par(:)"
Message Edited by sblionel on 09-20-2005 04:25 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
dimension p_sigma(9*imax), p_rod(9*imax), p_rho(imax)
real*8 oprho
integer icount, ,jcount
do cyc=1,ncycle
icount=(istart-2)*9
do i=istart, iend
icount=icount+9
oprho=othird*p_rho(i)
do jk=1,9
jcount=jk+icount
p_sigma(jcount) = p_sigma(jcount) + oprho*p_rod(jcount)
enddo
enddo
enddo
The heart of the loop contains one real multiply, one real add and one integer add and a single array index calculation. This should be faster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try the following and report back the timings
use base_database
!
implicit none
!
integer :: i, j, k, cyc
! vvv add
integer :: iCachPopulate
real :: start_time, end_time
real(kind=real_acc) :: othird
! vvv add replace 'typPar' with your par type
type(typePar), pointer :: pPar
!
othird = 1.0/3.0
!
! vvv add loop to prime the processor chache
! vvv on first iteration. Get timing on second iteration
do iCachPopulate=1,2
call cpu_time(start_time)
!
do cyc=1,ncycle
do i=istart, iend
do j=1,3
do k=1,3
par(i)%sigma(k,j) = par(i)%sigma(k,j) +othird*par(i)%rho*par(i)%rod(k,j)
enddo
enddo
enddo
enddo
!
call cpu_time(end_time)
! vvv add end of processor cache prime
enddo
!
! vvv change "test" to "test 1"
write(*,*) 'Time elapsed for derived test 1 = ',end_time-start_time
!
! Next do above test with removing (k,j) to test
! implicit array computation speed
!
! vvv add loop to prime the processor chache
! vvv on first iteration. Get timing on second iteration
do iCachPopulate=1,2
call cpu_time(start_time)
!
do cyc=1,ncycle
do i=istart, iend
par(i)%sigma = par(i)%sigma +othird*par(i)%rho*par(i)%rod
enddo
enddo
!
call cpu_time(end_time)
! vvv add end of processor cache prime
enddo
!
! vvv change "test" to "test 2"
write(*,*) 'Time elapsed for derived test 2 = ',end_time-start_time
!
! Next do above test with pointer to par
! implicit array computation speed
!
! vvv add loop to prime the processor chache
! vvv on first iteration. Get timing on second iteration
do iCachPopulate=1,2
call cpu_time(start_time)
!
do cyc=1,ncycle
do i=istart, iend
pPar => par(i)
pPar%sigma = pPar%sigma +othird*pPar%rho*pPar%rod
enddo
enddo
!
call cpu_time(end_time)
! vvv add end of processor cache prime
enddo
!
! vvv change "test" to "test 3"
write(*,*) 'Time elapsed for derived test 3 = ',end_time-start_time
!
! vvv add loop to prime the processor chache
! vvv on first iteration. Get timing on second iteration
do iCachPopulate=1,2
call cpu_time(start_time)
!
do cyc=1,ncycle
do i=istart, iend
do j=1,3
do k=1,3
p_sigma(k,j,i) = p_sigma(k,j,i) + othird*p_rho(i)*p_rod(k,j,i)
enddo
enddo
enddo
enddo
!
call cpu_time(end_time)
! vvv add end of processor cache prime
enddo
!
write(*,*) 'Time elapsed for array test = ',end_time-start_time
!
End
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
program FloatTest
integer, parameter:: IMAX = 10000000
real x(IMAX)
do i=1,IMAX
call random_number(x(i))
if (x(i).lt.0.5) x(i) = 0.
end do
call cpu_time(time0)
y = 0.
do i=1,IMAX
y = y + x(i)*(x(i)-1)+x(i)
end do
call cpu_time(time1)
write(*,*) "time =", time1-time0
write(*,*) y
call cpu_time(time0)
y = 0.
do i=1,IMAX
if (x(i).gt.0.) y = y + x(i)*(x(i)-1)+x(i)
end do
call cpu_time(time1)
write(*,*) "time =", time1-time0
write(*,*) y
end program
Jugoslav
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
James, I ran a few test indicating the derived type is faster than the 3 dimensioned arrays. Don't entirely know why. Here is my test code:
! Test2.f90
module base_database
type typeOldPar
real(8) :: rho,sigma(3,3),rod(3,3)
end type typeOldPar
type typeNewPar
sequence
real(8) :: sigma(3,3) ! Offset = 0 (+ 9*8 = 72 = 4.5*16)
real(8) :: rho ! Offset = 72 (+ 8 = 80 = 5*16)
real(8) :: rod(3,3) ! Offset = 80 (+ 9*8 = 152 = 9.5*16)
real(8) :: padd ! Offset = 152 (+8 = 160)
end type typeNewPar
type(typeOldPar), allocatable :: par(:)
type(typeNewPar), allocatable :: parNew(:)
real(8), allocatable :: p_sigma(:,:,:), p_rho(:), p_rod(:,:,:)
end module base_database
!DEC$ ATTRIBUTES FORCEINLINE :: ComputeNewSigmaArray
subroutine ComputeNewSigmaArray(p)
use base_database
!
implicit none
type(typeNewPar) :: p
real(8), parameter :: othird = 1.0D0 / 3.0D0
real(8) :: scale
scale = othird*p%rho
p%sigma = p%sigma +scale*p%rod
end subroutine ComputeNewSigmaArray
!DEC$ ATTRIBUTES FORCEINLINE :: ComputeNewSigma3x3
subroutine ComputeNewSigma3x3(p)
use base_database
!
implicit none
type(typeNewPar) :: p
real(8), parameter :: othird = 1.0D0 / 3.0D0
real(8) :: scale
integer :: i,j
scale = othird*p%rho
do j=1,3
do i=1,3
p%sigma(i,j) = p%sigma(i,j) +scale*p%rod(i,j)
end do
end do
end subroutine ComputeNewSigma3x3
program Test2
use base_database
!
implicit none
!
integer :: i, j, k, cyc
real :: start_time, end_time
real(8), parameter :: othird = 1.0D0 / 3.0D0
integer :: istart, iend, ncycle, rcycle, ircycle
istart = 1
iend = 100000
ncycle = 100
rcycle = 3
allocate(par(iend))
allocate(parNew(iend))
allocate(p_sigma(3,3,iend))
allocate(p_rho(iend))
allocate(p_rod(3,3,iend))
!
do i=istart, iend
par(i)%sigma = 0.
par(i)%rho = 0.
par(i)%rod(k,j) = 0.
parNew(i)%sigma = 0.
parNew(i)%rho = 0.
parNew(i)%rod(k,j) = 0.
enddo
!
do ircycle = 1, rcycle
call cpu_time(start_time)
!
do cyc=1,ncycle
do i=istart, iend
do j=1,3
do k=1,3
par(i)%sigma(k,j) = par(i)%sigma(k,j) +othird*par(i)%rho*par(i)%rod(k,j)
enddo
enddo
enddo
enddo
!
call cpu_time(end_time)
!
write(*,*) 'Time elapsed for derived test 1 = ',end_time-start_time
enddo
!DEC$ IF(0)
On a P4 530 with other threads running
Time elapsed for derived test 1 = 0.7031250
Time elapsed for derived test 1 = 0.7656250
Time elapsed for derived test 1 = 0.7343750
!DEC$ ENDIF
!
do ircycle = 1, rcycle
call cpu_time(start_time)
!
do cyc=1,ncycle
do i=istart, iend
call ComputeNewSigmaArray(parNew(i))
enddo
enddo
!
call cpu_time(end_time)
!
write(*,*) 'Time elapsed for derived test 2 = ',end_time-start_time
enddo
!
!DEC$ IF(0)
On a P4 530 with other threads running
Time elapsed for derived test 2 = 1.000000
Time elapsed for derived test 2 = 1.015625
Time elapsed for derived test 2 = 1.078125
!DEC$ ENDIF
do ircycle = 1, rcycle
call cpu_time(start_time)
!
do cyc=1,ncycle
do i=istart, iend
call ComputeNewSigma3x3(parNew(i))
enddo
enddo
!
call cpu_time(end_time)
!
write(*,*
) 'Time elapsed for derived test 3 = ',end_time-start_time
enddo
!DEC$ IF(0)
On a P4 530 with other threads running
Time elapsed for derived test 3 = 1.000000
Time elapsed for derived test 3 = 0.9843750
Time elapsed for derived test 3 = 1.031250
!DEC$ ENDIF
!
do ircycle = 1, rcycle
call cpu_time(start_time)
!
do cyc=1,ncycle
do i=istart, iend
do j=1,3
do k=1,3
p_sigma(k,j,i) = p_sigma(k,j,i) + othird*p_rho(i)*p_rod(k,j,i)
enddo
enddo
enddo
enddo
!
call cpu_time(end_time)
!
write(*,*) 'Time elapsed for array test = ',end_time-start_time
enddo
!DEC$ IF(0)
On a P4 530 with other threads running
Time elapsed for array test = 1.703125
Time elapsed for array test = 1.703125
Time elapsed for array test = 1.718750
!DEC$ ENDIF
!
stop
end program Test2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
What compiler version and flags are you using?
I am using Version 9.0.2713.2002 integrated with Visual Studio .NET 2002. I built your code as a new console project, using the default Release settings and got the following:
Time elapsed for derived test 1 = 1.703125
Time elapsed for derived test 1 = 1.671875
Time elapsed for derived test 1 = 1.671875
Time elapsed for derived test 2 = 1.687500
Time elapsed for derived test 2 = 1.687500
Time elapsed for derived test 2 = 1.687500
Time elapsed for derived test 3 = 1.640625
Time elapsed for derived test 3 = 1.656250
Time elapsed for derived test 3 = 1.640625
Time elapsed for array test = 1.312500
Time elapsed for array test = 1.296875
Time elapsed for array test = 1.281250
Time elapsed for derived test 1 = 1.656250
Time elapsed for derived test 1 = 1.640625
Time elapsed for derived test 2 = 1.687500
Time elapsed for derived test 2 = 1.671875
Time elapsed for derived test 3 = 1.656250
Time elapsed for derived test 3 = 1.640625
Time elapsed for array test = 1.078125
Time elapsed for array test = 1.078125
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
andtags around the source in order to prevent punctuation to be interpreted as smileys.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Compiler options:
/nologo /Zi /O3 /QaxP /QxP /fpp /fpe:0 /module:"$(INTDIR)/"
/object:"$(INTDIR)/" /traceback /libs:static /dbglibs /c
I made some additional tests in the program
Original derived type loop
Time elapsed for derived test par(i)%sigma(k,j) = 0.7500000
Time elapsed for derived test par(i)%sigma(k,j) = 0.8125000
Time elapsed for derived test par(i)%sigma(k,j) = 0.7500000
The call to the inlined function as in prior test code
Time elapsed for derived test ComputeNewSigmaArray = 1.015625
Time elapsed for derived test ComputeNewSigmaArray = 1.031250
Time elapsed for derived test ComputeNewSigmaArray = 1.046875
Bringing the contents of the above inlined funciton into line by hand
Time elapsed for derived test p%sigma = 1.109375
Time elapsed for derived test p%sigma = 1.109375
Time elapsed for derived test p%sigma = 1.062500
??^^ I was suprised this ran slightly slower than having the compiler inline the code
Results of call to inlined function
Time elapsed for derived test ComputeNewSigma3x3 = 1.093750
Time elapsed for derived test ComputeNewSigma3x3 = 1.093750
Time elapsed for derived test ComputeNewSigma3x3 = 0.9531250
Bringing the contents of the above inlined funciton into line by hand
Time elapsed for derived test p%sigma(k,j) = 1.078125
Time elapsed for derived test p%sigma(k,j) = 1.093750
Time elapsed for derived test p%sigma(k,j) = 0.9843750
Time elapsed for array test = 1.703125
Time elapsed for array test = 1.843750
Time elapsed for array test = 1.765625
Interestingly the intuitive actions of creating local temps for scale and a pointer to the derived type element interfered with the compiler's optimizations - so much for intuition.
This is a good example of why some time must be invested in examining the performance impact of different methods. In particular if this function is going to consume 10's, 100's, 1000's hours of processor time.
This may be a good candidate for using a dual core processor with OpenMP.
From my experience with OpenMP on my P4 530 with HT is that FPU intensive applications run slower. I am looking at replacing my motherboard and processor with something with true MP capabilities.
Im my case my application on the P4 530 will take several months to complete the corse level computations. A dual or quad processor system, each with dual cores looks tantilizing (at least until I look at the cost). On the low end a Dual Core P4 840. On the high end a Quad Xeon or Quad Opteron system. But more likely something in between (dual processor each with dual core).
My simulation is a tension structure built with tethers. One configuraiton has 6 tethers the other has 8. The tether end points are connectd to mass objects. I am simulating what could be called a compound pendulum with flexible and interconnected arms. A non-trivial computation. The purpose of the computation is a preliminary engineering study of a second generation space elevator.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
RE: (k,j) on initialization
Oops
Funny thing my compiler did not balk at this (using uninitialized variable). Thanks for the catch. The bug got in there with a lazy cut/paste. The improper initialization will not adversely affect the results of the test runs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Changing the compile flags by adding /O3 and adding the processor specific options, /QaxB /QxB, in my case, I did see a speed up for the first test.
Time elapsed for derived test 1 = 0.6875000
Time elapsed for derived test 1 = 0.5312500
Time elapsed for derived test 1 = 0.5156250
Time elapsed for derived test 2 = 1.687500
Time elapsed for derived test 2 = 1.671875
Time elapsed for derived test 2 = 1.671875
Time elapsed for derived test 3 = 1.625000
Time elapsed for derived test 3 = 1.625000
Time elapsed for derived test 3 = 1.640625
Time elapsed for array test = 1.062500
Time elapsed for array test = 1.062500
Time elapsed for array test = 1.046875
I did also try compiling it with each module/routine split into a different file and saw a speed increase for test 1 and the array tests, tests 2 and 3 were not significantly altered:
Time elapsed for derived test 1 = 0.3125000
Time elapsed for derived test 1 = 0.2187500
Time elapsed for derived test 1 = 0.2031250
Time elapsed for array test = 0.2968750
Time elapsed for array test = 0.2968750
Time elapsed for array test = 0.2968750
For now it seems that I do not need to make any basic changes within the FORTRAN language to speed up the derived types. Once I have completed the algorithm and code optimisations I will clearly have to carefully look at the different compiler options.
In the longer term I am looking at implementing a parallel version. However it will have to be based on MPI as it is almost certain that the computing facilities available to mein the future willbe distributed memory systems.
The main code I am working on is a meshless solver for transient solid, structural and fluid mechanics problems. As there is a lot of commonality with the n-body/SPH codes developed for astrophysics simulations, I can learn a lot from the MPI implementations that have been done for that area.
James
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>In the longer term I am looking at implementing a parallel version. However it will have to be based on MPI as it is almost certain that the computing facilities available to mein the future willbe distributed memory systems.<<
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page