- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Focus your optimization on (in order)
vectorization if possible and if overall morebeneficial
reduce number of cache line fetches
increase hardware prefetch hits
increase L1 utilization
increase L2 utilization
increase L3 utilization
(increase NUMA utilization)
parallel programming
Get your serial program to run fastest first, then work on parallel programming. Keep in mind that you will likely have to re-address your cache utilization since parallel programming will cause (for unified caches) cache evictions and diminished capacity.
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The individual arrays method is called Structure of Arrays (SOA).
As to what is betterto program... this will depend on how your actual program manipulates the data which may be entirely different than yourtest program .OR. it may be similar to your test program. You will have to decide this.
The SOA format (depending on function and access requirements) can take advantage of the SSE (now recentlyAVX) small vector instructions. Meaning your simple do loops in your test code can perform two (SSE) or four (AVX) REAL(8) arithmetic operations per instruction. Also, SOA when amenable to vectorization is also friendlier to cache access.
AOS has (may have) and advantage where your objects are randomly picked as this increases the cach hit ratio in accessing adjacent data (member variables).
Is 3x faster a good enough reason for using SOA?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the compiler has optimizations to deal with data access patterns, they may be enabled by inline optimization, if it involves optimizing across subroutine calls.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Focus your optimization on (in order)
vectorization if possible and if overall morebeneficial
reduce number of cache line fetches
increase hardware prefetch hits
increase L1 utilization
increase L2 utilization
increase L3 utilization
(increase NUMA utilization)
parallel programming
Get your serial program to run fastest first, then work on parallel programming. Keep in mind that you will likely have to re-address your cache utilization since parallel programming will cause (for unified caches) cache evictions and diminished capacity.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
program derived_test
type cell
real(8), allocatable :: pd_E(:,:),pd_W(:,:),pd_N(:,:),pd_S(:,:),Ed_H(:,:),Ed_V(:,:)
end type cell
type(cell) :: cp
integer :: i,j,n
real(8) :: val
n=6000
allocate (cp%pd_S(n,n))
allocate (cp%Ed_V(n,n))
allocate (cp%pd_E(n,n))
allocate (cp%pd_W(n,n))
allocate (cp%pd_N(n,n))
allocate (cp%Ed_H(n,n))
do j=1,n
do i=1,n
cp%Ed_V(i,j)=0.005*i
cp%pd_S(i,j)=0.000045*j
end do
end do
do j=1,n
do i=1,n
val=val+cp%Ed_V(i,j)*cp%pd_S(i,j)
end do
end do
end program derived_test
I guess thisis also a SOAmethod?
Regars
Lars
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Does your Fluid Dynamics system partition a large volume into smaller volumes and then computes interactions of particles within the smaller volume plus betweenedge particles of neighboring smaller volumes? If so, then your code may "move" a particle from smaller volume to smaller volume by moving a particle index or pointer from smaller volume (array) to smaller volume (array) (as opposed to moving all particle property values). When this is the case then AOS may experience better performance.
Then on the other hand, if your particles have low mobility (cross smallervolumes infrequently) then you might want to copy all the particle properties when they migrate and stick with SOA. Note, this will have a larger memory burden since each volume needs to be able to contain (allocated to)the maximum particle density expected for that smaller volume.
To summarize, there is not a simple answer. You will need to experiment. Your end program may end up having two different pathways: one for fluid of high mobility and one for fluid of low mobility.
If you have time, google "PARSEC site:princeton.edu". In there is a test program called fluidanimate. This does the partitioningof the volumes and will give you a better idea of this technique.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
program derived_test
type cell
real(8), allocatable :: pd_E(:,:),pd_W(:,:),pd_N(:,:),pd_S(:,:),Ed_H(:,:),Ed_V(:,:)
end type cell
type(cell) :: cp
integer :: i,j,n
real(8) :: val
n=6000
allocate (cp%pd_S(n,n))
allocate (cp%Ed_V(n,n))
allocate (cp%pd_E(n,n))
allocate (cp%pd_W(n,n))
allocate (cp%pd_N(n,n))
allocate (cp%Ed_H(n,n))
do j=1,n
do i=1,n
cp%Ed_V(i,j)=0.005*i
cp%pd_S(i,j)=0.000045*j
end do
end do
do j=1,n
do i=1,n
val=val+cp%Ed_V(i,j)*cp%pd_S(i,j)
end do
end do
end program derived_test
I guess thisis also a SOAmethod?
Regars
Lars
You continue with an apparent contradiction, initializing real(8) arrays with single precision constants. 2 decades ago, there were platforms with 64-bit single precision, but your syntax wouldn't have worked then; I don't know of any current 64-bit single precision.
It's hard to draw realistic performance conclusions when you leave it up to the compiler to what extent it should short-cut your code.
do j=1,n
do i=1,n
cp%Ed_V(i,j)=0.005*i
cp%pd_S(i,j)=0.000045*j
val=val+cp%Ed_V(i,j)*cp%pd_S(i,j)
end do
end do
Then, as the array values are never accessed outside this loop, nor is val ever used, the entire thing could be eliminated as dead code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@ TimP, I did'nt do anything but copy paste the original code and rewrite the derived type (I did not want to compare two different things). I was just currious since I use a lot of derived types and did not want the computational overhead as indicated by quarkz's code (just for the record, in mycompiled codeexample I use the value to print to the screen - so it is used).
Regards
Lars

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page