Solved: Performance/speed difference when using derived type

Wee_Beng_T_ · ‎06-07-2011

Hi,

I am interested to know if there's performance/speed difference when using derived type in Fortran.

I did a simple code using derived type:

program derived_test

type cell

real(8) :: pd_E,pd_W,pd_N,pd_S,Ed_H,Ed_V

end type cell

type(cell), allocatable :: cp(:,:)

integer :: i,j,n

real(8) :: val

n=6000

allocate (cp(n,n))

do j=1,n

do i=1,n

cp(i,j)%Ed_V=0.005*i

cp(i,j)%pd_S=0.000045*j

end do

do j=1,n

do i=1,n

val=val+cp(i,j)%Ed_V*cp(i,j)%pd_S

end do

end program derived_test

and another similar code without using derived type:

program derived_test

real(8), allocatable :: pd_E(:,:),pd_W(:,:),pd_N(:,:),pd_S(:,:),Ed_H(:,:),Ed_V(:,:)

integer :: i,j,n

real(8) :: val

n=6000

allocate (pd_S(n,n))

allocate (Ed_V(n,n))

allocate (pd_E(n,n))

allocate (pd_W(n,n))

allocate (pd_N(n,n))

allocate (Ed_H(n,n))

do j=1,n

do i=1,n

Ed_V(i,j)=0.005*i

pd_S(i,j)=0.000045*j

end do

do j=1,n

do i=1,n

val=val+Ed_V(i,j)*pd_S(i,j)

end do

end program derived_test

I did test run and found that using non derived type is 3 times faster. Compilation was done using the default -O2.

Is this the norm? Is there any way to speed things up such that derived type is almost as fast as non derived type?

Thanks!

jimdempseyatthecove · ‎06-07-2011

In fluid dynamics (depending on your code) your particles are free to migrate. Interactions of particles occur only with a subset of particles within the immediate area. The particular particles in the immediate area change. When computing these interactions your interaction access patterns will tend to be random in order. When you fetch the AOS position and velocity vectors (mass, temp, ...) of one particle, and when you align your data properly, you can pull in all the components of a particle used for interaction using one cache line fetch. Same for the other particle of the interaction. For this phase of the computation AOS may yield superior performance. However, when computing new position and velocity (state advancement) SOA will vectorize better and yield better performance. The choice of AOS or SOA may affect different sections of code differently. You will have to test the performance of both methods on the entire integration process as opposed to optimizingeach component part.

Focus your optimization on (in order)

vectorization if possible and if overall morebeneficial
reduce number of cache line fetches
increase hardware prefetch hits
increase L1 utilization
increase L2 utilization
increase L3 utilization
(increase NUMA utilization)
parallel programming

Get your serial program to run fastest first, then work on parallel programming. Keep in mind that you will likely have to re-address your cache utilization since parallel programming will cause (for unified caches) cache evictions and diminished capacity.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎06-07-2011

The derived type method is called Array of Structures (AOS)
The individual arrays method is called Structure of Arrays (SOA).

As to what is betterto program... this will depend on how your actual program manipulates the data which may be entirely different than yourtest program .OR. it may be similar to your test program. You will have to decide this.

The SOA format (depending on function and access requirements) can take advantage of the SSE (now recentlyAVX) small vector instructions. Meaning your simple do loops in your test code can perform two (SSE) or four (AVX) REAL(8) arithmetic operations per instruction. Also, SOA when amenable to vectorization is also friendlier to cache access.

AOS has (may have) and advantage where your objects are randomly picked as this increases the cach hit ratio in accessing adjacent data (member variables).

Is 3x faster a good enough reason for using SOA?

Jim Dempsey

Wee_Beng_T_ · ‎06-07-2011

I am running a computational fluid dynamics code whereby I need to loop frequently through the meshes.

Does this mean that it is better to use the SOA?

However, I don't understand "AOS has (may have) and advantage where your objects are randomly picked as this increases the cach hit ratio in accessing adjacent data (member variables)."

So are the variables side by side in the memory?

For e.g. c(1,1)%pd_W andc(1,1)%pd_E? Orc(1,1)%pd_W andc(2,1)%pd_W

I remember reading an article which mentions using some options like "inline" to improve the speed of derived types. Is that so?

Thanks

TimP · ‎06-07-2011

Those who advocate the AOS for improved cache locality are thinking of the case where your data access pattern may run out of cache associativity ways. Supposing you have 4-way associative cache, you would need to be accessing more than 4 data streams (in the SOA case) before the AOS could show superior performance. Evidently, this is strongly platform dependent, as well as depending on your application. Certain CPU vendors choose a smaller degree of associativity when it permits larger cache; such a platform would be more likely so see this benefit from SOA, at least for the case where it isn't possible to vectorize effectively by using AOS.
If the compiler has optimizations to deal with data access patterns, they may be enabled by inline optimization, if it involves optimizing across subroutine calls.

jimdempseyatthecove · ‎06-07-2011

In fluid dynamics (depending on your code) your particles are free to migrate. Interactions of particles occur only with a subset of particles within the immediate area. The particular particles in the immediate area change. When computing these interactions your interaction access patterns will tend to be random in order. When you fetch the AOS position and velocity vectors (mass, temp, ...) of one particle, and when you align your data properly, you can pull in all the components of a particle used for interaction using one cache line fetch. Same for the other particle of the interaction. For this phase of the computation AOS may yield superior performance. However, when computing new position and velocity (state advancement) SOA will vectorize better and yield better performance. The choice of AOS or SOA may affect different sections of code differently. You will have to test the performance of both methods on the entire integration process as opposed to optimizingeach component part.

Focus your optimization on (in order)

vectorization if possible and if overall morebeneficial
reduce number of cache line fetches
increase hardware prefetch hits
increase L1 utilization
increase L2 utilization
increase L3 utilization
(increase NUMA utilization)
parallel programming

Get your serial program to run fastest first, then work on parallel programming. Keep in mind that you will likely have to re-address your cache utilization since parallel programming will cause (for unified caches) cache evictions and diminished capacity.

Jim Dempsey

Lars_Jakobsen · ‎06-15-2011

Ican reproduce theresults and also with a factor of 3 on the computational time. However if I changed the allocatables to be within the derived type the overhead disappeared, e.g.

program derived_test
type cell
real(8), allocatable :: pd_E(:,:),pd_W(:,:),pd_N(:,:),pd_S(:,:),Ed_H(:,:),Ed_V(:,:)
end type cell
type(cell) :: cp
integer :: i,j,n
real(8) :: val
n=6000
allocate (cp%pd_S(n,n))
allocate (cp%Ed_V(n,n))
allocate (cp%pd_E(n,n))
allocate (cp%pd_W(n,n))
allocate (cp%pd_N(n,n))
allocate (cp%Ed_H(n,n))
do j=1,n
do i=1,n
cp%Ed_V(i,j)=0.005*i
cp%pd_S(i,j)=0.000045*j
end do
end do
do j=1,n
do i=1,n
val=val+cp%Ed_V(i,j)*cp%pd_S(i,j)
end do
end do
end program derived_test

I guess thisis also a SOAmethod?

Regars
Lars

jimdempseyatthecove · ‎06-15-2011

Lars,

Does your Fluid Dynamics system partition a large volume into smaller volumes and then computes interactions of particles within the smaller volume plus betweenedge particles of neighboring smaller volumes? If so, then your code may "move" a particle from smaller volume to smaller volume by moving a particle index or pointer from smaller volume (array) to smaller volume (array) (as opposed to moving all particle property values). When this is the case then AOS may experience better performance.

Then on the other hand, if your particles have low mobility (cross smallervolumes infrequently) then you might want to copy all the particle properties when they migrate and stick with SOA. Note, this will have a larger memory burden since each volume needs to be able to contain (allocated to)the maximum particle density expected for that smaller volume.

To summarize, there is not a simple answer. You will need to experiment. Your end program may end up having two different pathways: one for fluid of high mobility and one for fluid of low mobility.

If you have time, google "PARSEC site:princeton.edu". In there is a test program called fluidanimate. This does the partitioningof the volumes and will give you a better idea of this technique.

Jim Dempsey

TimP · ‎06-15-2011

Quoting Lars Jakobsen

Ican reproduce theresults and also with a factor of 3 on the computational time. However if I changed the allocatables to be within the derived type the overhead disappeared, e.g.

program derived_test
type cell
real(8), allocatable :: pd_E(:,:),pd_W(:,:),pd_N(:,:),pd_S(:,:),Ed_H(:,:),Ed_V(:,:)
end type cell
type(cell) :: cp
integer :: i,j,n
real(8) :: val
n=6000
allocate (cp%pd_S(n,n))
allocate (cp%Ed_V(n,n))
allocate (cp%pd_E(n,n))
allocate (cp%pd_W(n,n))
allocate (cp%pd_N(n,n))
allocate (cp%Ed_H(n,n))
do j=1,n
do i=1,n
cp%Ed_V(i,j)=0.005*i
cp%pd_S(i,j)=0.000045*j
end do
end do
do j=1,n
do i=1,n
val=val+cp%Ed_V(i,j)*cp%pd_S(i,j)
end do
end do
end program derived_test

I guess thisis also a SOAmethod?

Regars
Lars

Yes, you have returned to an easily vectorizable dot product (why not write that explicitly?).
You continue with an apparent contradiction, initializing real(8) arrays with single precision constants. 2 decades ago, there were platforms with 64-bit single precision, but your syntax wouldn't have worked then; I don't know of any current 64-bit single precision.
It's hard to draw realistic performance conclusions when you leave it up to the compiler to what extent it should short-cut your code.

do j=1,n
do i=1,n
cp%Ed_V(i,j)=0.005*i
cp%pd_S(i,j)=0.000045*j
val=val+cp%Ed_V(i,j)*cp%pd_S(i,j)
end do
end do

Then, as the array values are never accessed outside this loop, nor is val ever used, the entire thing could be eliminated as dead code.

Lars_Jakobsen · ‎06-15-2011

@ Jim, I think you have me mixed up with quarkz - I do not do CFD:)

@ TimP, I did'nt do anything but copy paste the original code and rewrite the derived type (I did not want to compare two different things). I was just currious since I use a lot of derived types and did not want the computational overhead as indicated by quarkz's code (just for the record, in mycompiled codeexample I use the value to print to the screen - so it is used).

Regards

Lars