- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For example: You have a collection of objects, each with mass(m), position(X,Y,Z), velocity(dX,dY,dZ), acceleration(ddX, ddY, ddZ), externalForce(xfX, xfY, xfZ), ... other properties.
While it is relatively straitforward conceptually to organized these into separate objects
Object(iObj)%Pos ! as real(8) :: Pos(3)
Doing so interferes with good vectorization opportunities that you get with
X(iObj)
Y(iObj)
Z(iObj)
Consider
DO iObj = 1, nObjs
Object(iObj)%acceleration = Object(iObj)%externalForce / Object(iObj)%mass
Object(iObj)%velocity = Object(iObj)%velocity + (Object(iObj)%acceleration* dT)
Object(iObj)%position = Object(iObj)%position+(Object(iObj)%velocity * dT)
END DO
versis
DO iObj = 1, nObjs
accelerationX(iObj) = externalForceX(iObj) / mass(iObj)
accelerationY(iObj) = externalForceY(iObj) / mass(iObj)
accelerationZ(iObj) = externalForceZ(iObj) / mass(iObj)
velocityX(iObj) = velocityX(iObj) + (accelerationX(iObj)* dT)
velocityY(iObj) = velocityY(iObj) + (accelerationY(iObj)* dT)
velocityZ(iObj) = velocityZ(iObj) + (accelerationZ(iObj)* dT)
positionX(iObj) = positionX(iObj)+(velocityX(iObj) * dT)
positionY(iObj) = positionY(iObj)+(velocityY(iObj) * dT)
positionZ(iObj) = positionZ(iObj)+(velocityZ(iObj) * dT)
END DO
While the former loop looks simpler, it will have to work harder (longer)due to less opportunities for vectorization. In this situation, the latter loop will vectorize better. (Breaking into 2 or 3 loops may help too.)
With millions of operations at stake (IOW long run times), run time matters.
A potential drawback to Structure Of Arrays is it is more difficlut to add and remove objects.
However, the performance difference is such that you can "delete" an object by setting its mass to 0.0.
And if you keep a list of these deleted objects, you can add first by overstriking a previously deleted object.
(And initial allocation having expansion space of nObjsMax - nObjs.)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you loop over u(i) and v(i), consecutive memory accesses are contiguous, so a single instruction can load several elements at once, and loops can often be vectorized efficiently.
When you loop over vector(i)%u and vector(i)%v, accesses to u in consecutive iterations are no longer contiguous in memory. So the compiler may not be able to load multiple elements in a single instruction, or at best, will need additional instructions to rearrange the data so that it is contiguous for subsequent SIMD processing.
Actual performance impact will depend on how much calculation is done once the data has been loaded, (more is better), and the complexity of the structure (eg whether there are other components besides u and v) - less is better.
A possible compromise would be to put thearrays into a structure, vector%u(i) and vector%v(i). The structure could also include the array dimension. The individual array elements would still be contiguous in memory.
Array alignment was mentioned. It's a little easier for the compiler to know the alignment if the arrays aren't in a structure. But the compiler can make runtime tests for alignment, and peel iterations accordingly. If the arrays are large, the overhead should not be great. What is important is that the main arrays should have the same alignment - if they don't, pad until they do, or use "attributes align" directives.
You may find similar discussions for the C/C++ compiler, that refer to a Structure of Arrays (SoA) compared to an Array of Structures (AoS). See for example
http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/ sect. 5.3
Unfortunately, there isn't yet a version with Fortran examples.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I believe that everyone asks this question as he/she starts thinking about performance improvements at the design stage. Points mentioned by Jim and Martyn are in-synch with those outlined in the document Martyn mentioned.
Below is a dumb test using the example given by Jim. I am using VS2005 on WinXP64-PRO with Dual Core Xeon. The results obtained in Release mode somehow contradict i.e. AOS performs better than SOA.
Since the test uses random number, it is not "real-world" problem and perhaps the compiler is doing something smarter. In the diagnostics I see that all the SOA loops are vectorized whereas the AOS loop is "partially" vectorized. For example, for N=1 million I get compute times as SOA = 0.76 s and AOS = 0.59 s.
Will it be possible for someone to guide by giving an example of a more meaningful test that shows SOA performing better than AOS? {On pages 15 through 18 in the document mentioned by Martyn, there an example (with points structure); however that example is similar to the one below.}
Abhi
----
[fortran] Module OM Implicit None Real(8), Parameter :: ZERO = 0.0d0, ONE=1.0d0 Type t_Position Real(8) :: X=ZERO, Y=ZERO, Z=ZERO End Type t_Position Type t_Velocity Real(8) :: u=ZERO, v=ZERO, w=ZERO End Type t_Velocity Type t_Acceleration Real(8) :: a=ZERO, b=ZERO, c=ZERO End Type t_Acceleration Type t_Force Real(8) :: F=ZERO, G=ZERO, H=ZERO End Type t_Force Type t_SOA Real(8), Allocatable :: mass(:) Real(8), Allocatable :: X(:), Y(:), Z(:) Real(8), Allocatable :: u(:), v(:), w(:) Real(8), Allocatable :: a(:), b(:), c(:) Real(8), Allocatable :: F(:), G(:), H(:) Contains Procedure :: Set => Set_SOA Procedure :: Release => Release_SOA End Type t_SOA Type t_AOS Real(8) :: mass Type(t_Position) :: Position Type(t_Velocity) :: Velocity Type(t_Acceleration) :: Acceleration Type(t_Force) :: Force End Type t_AOS Contains Subroutine Set_SOA(obj, N, iError) Class(t_SOA), Intent(OUT) :: obj Integer, Intent(IN) :: N Integer, Intent(OUT) :: iError iError = 0 if (Allocated(obj%mass)) then DeAllocate(obj%mass, & obj%X, obj%Y, obj%Z, & obj%u, obj%v, obj%w, & obj%a, obj%b, obj%c, & obj%F, obj%G, obj%H) endif Allocate(obj%mass(N), & obj%X(N), obj%Y(N), obj%Z(N), & obj%u(N), obj%v(N), obj%w(N), & obj%a(N), obj%b(N), obj%c(N), & obj%F(N), obj%G(N), obj%H(N), & stat=iError) obj%mass = ZERO obj%X = ZERO obj%Y = ZERO obj%Z = ZERO obj%u = ZERO obj%v = ZERO obj%w = ZERO obj%a = ZERO obj%b = ZERO obj%c = ZERO obj%F = ZERO obj%G = ZERO obj%H = ZERO End Subroutine Set_SOA Subroutine Release_SOA(obj) Class(t_SOA), Intent(OUT) :: obj if (Allocated(obj%mass) ) then DeAllocate(obj%mass, & obj%X, obj%Y, obj%Z, & obj%u, obj%v, obj%w, & obj%a, obj%b, obj%c, & obj%F, obj%G, obj%H) endif End Subroutine Release_SOA Subroutine Set_AOS(obj, N, iError) Type(t_AOS), Intent(OUT), Allocatable :: obj(:) Integer, Intent(IN) :: N Integer, Intent(OUT) :: iError iError = 0 Allocate(obj(N), stat=iError) End Subroutine Set_AOS Subroutine Release_AOS(obj) Type(t_AOS), Intent(OUT), Allocatable :: obj(:) End Subroutine Release_AOS End Module OM Program Test Use OM Implicit None Type(t_SOA) :: SOA Type(t_AOS), Allocatable :: AOS(:) Integer :: N Integer :: i, j Integer :: iError Real(8) :: ts, te, rn, dt dt = 1.0d-02 do i=1,10 Call Random_Seed() N = 1000000*i Write(*,"(A,I0)") "Array size N=", N Call SOA%Set(N, iError) Call Random_Number(SOA%mass) Call Random_Number(SOA%F) Call Random_Number(SOA%G) Call Random_Number(SOA%H) SOA%mass = ONE + SOA%mass SOA%F = ONE + SOA%F SOA%G = ONE + SOA%G SOA%H = ONE + SOA%H Call CPU_TIME(ts) do j = 1, N SOA%a(j) = SOA%F(j) / SOA%mass(j) SOA%b(j) = SOA%G(j) / SOA%mass(j) SOA%c(j) = SOA%H(j) / SOA%mass(j) end do do j = 1, N SOA%u(j) = SOA%u(j) + (SOA%a(j) * dt) SOA%v(j) = SOA%v(j) + (SOA%b(j) * dt) SOA%w(j) = SOA%w(j) + (SOA%c(j) * dt) end do do j = 1, N SOA%X(j) = SOA%X(j) + (SOA%u(j) * dt) SOA%Y(j) = SOA%Y(j) + (SOA%v(j) * dt) SOA%Z(j) = SOA%Z(j) + (SOA%w(j) * dt) end do Call CPU_TIME(te) Write(*,*) "SOA:", te - ts Call SOA%Release() Call Set_AOS(AOS, N, iError) Call Random_Number(AOS%mass) Call Random_Number(AOS%Force%F) Call Random_Number(AOS%Force%G) Call Random_Number(AOS%Force%H) AOS%mass = ONE + AOS%mass AOS%Force%F = ONE + AOS%Force%F AOS%Force%G = ONE + AOS%Force%G AOS%Force%H = ONE + AOS%Force%H Call CPU_TIME(ts) do j = 1, N AOS(j)%Acceleration%a = AOS(j)%Force%F / AOS(j)%mass AOS(j)%Acceleration%b = AOS(j)%Force%G / AOS(j)%mass AOS(j)%Acceleration%c = AOS(j)%Force%H / AOS(j)%mass AOS(j)%Velocity%u = AOS(j)%Velocity%u + (AOS(j)%Acceleration%a * dt) AOS(j)%Velocity%v = AOS(j)%Velocity%v + (AOS(j)%Acceleration%b * dt) AOS(j)%Velocity%w = AOS(j)%Velocity%w + (AOS(j)%Acceleration%c * dt) AOS(j)%Position%X = AOS(j)%Position%X + (AOS(j)%Velocity%u * dt) AOS(j)%Position%Y = AOS(j)%Position%Y + (AOS(j)%Velocity%v * dt) AOS(j)%Position%Z = AOS(j)%Position%Z + (AOS(j)%Velocity%w * dt) end do Call CPU_TIME(te) Write(*,*) "AOS:", te - ts Call Release_AOS(AOS) Write(*,*) end do End Program Test[/fortran]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Such performance measurements are extremely difficult to get reliable results from.
For one thing, you use the various arrays from the structure one after another. If instead you use array operations or do-loops that are confined to a single array, the result may be very different.
Also ten iterations may not be enough to get a reliable timing. Have you checked the variation?
On the other hand, in many practical cases the do-loops may be very complex, annihilating most effects of optimisation.
Regards,
Arjen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the sample code.
I took some liberties with your sample code and made the following changes
1) replace call to CPU_TIME with higher precision timer omp_get_wtime()
2) Added 3rd test were SOA perfromed with one loop as opposed to three loops
3) For each size, run each test three times,obtain minimum time.
Results:
[plain]C:TestAbhix64Release>abhi Array size N=1000000 SOA 3 loops: 5.296442192047834E-002 SOA 1 loop: 4.159056860953569E-002 AOS: 4.715456999838352E-002 Array size N=2000000 SOA 3 loops: 0.105941614136100 SOA 1 loop: 8.626186009496450E-002 AOS: 9.444041270762682E-002 Array size N=3000000 SOA 3 loops: 0.160792567767203 SOA 1 loop: 0.126706320792437 AOS: 0.141730781644583 Array size N=4000000 SOA 3 loops: 0.213049729354680 SOA 1 loop: 0.175986659713089 AOS: 0.189607863314450 Array size N=5000000 SOA 3 loops: 0.265978128649294 SOA 1 loop: 0.212588425725698 AOS: 0.236196809448302 Array size N=6000000 SOA 3 loops: 0.319670793600380 SOA 1 loop: 0.255262950435281 AOS: 0.281718607991934 Array size N=7000000 SOA 3 loops: 0.373268204741180 SOA 1 loop: 0.295668337494135 AOS: 0.330650757066905 Array size N=8000000 SOA 3 loops: 0.428482050076127 SOA 1 loop: 0.345974148251116 AOS: 0.378969475626945 Array size N=9000000 SOA 3 loops: 0.478422214277089 SOA 1 loop: 0.376372301019728 AOS: 0.423757964745164 Array size N=10000000 SOA 3 loops: 0.532217066735029 SOA 1 loop: 0.425057412125170 AOS: 0.472959278151393 [/plain]
SOA 1 loop has the advantage.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can reproduce Jim's results, that is, big SOA loop gives the best performance. However, I see the following:-
(1) When I break the SOA loop into three loops, the diagnostics says that loops are vectorized. If it is in one big loop then the dianostics say the loop is partially vectorized. However, the performance of the latter (i.e. partial vectorization) is better.
(2) If I don't break AOS into three loops then the diagnostics says that the loop is partially vectorized. Breaking it into three loops gives no vectorization. The performance of the former is way better; as expected.
I am not sure what is happening here; especially in (1).
Sincerely
Abhijit
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a set of consistent recommendations would be very nice,and I hope you succeed in that.I am, however,
not sure you can reach that ideal.
(Missed the bit by the way where you increase the size - should have looked better :))
Regards,
Arjen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With this experience, I must say that I value maintenance efficiency much above performance (of course, within reasonable limits). I think that squeezing every bit of performance on the expense of code maintainability is generally a Bad Thing. The more complex the program is, the maintainability matters more.
With test timings above, where speed differences are within 10-15% range, I would select the more readable code (AOS) without a second thought. Your mileage may vary, of course.
After all, we all know that:
(ProgramSpeed * ProgramMemory) / (WritingEffort * MaintenanceEffort) = const
When you put the costs in the equation, you will know where your focus should be ;)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> "LOOP WAS PARTIALLY VECTORIZED".
When programmed as multiple loops you do get full vectorization (due to fitting within limited number of XMM registers).
However, when coding as three seperate loops, the results of the first loop, although calculated faster, are evicted from L1 cache prior to use by the second loop, and results of second loop evicted from L1 cache prior to use by third loop. In areal simulationapplication you may have more cascading loops.
This illustrates that one must experiment with critical code sections to attain the best results.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Object(i)%p = Object(i)%p + Object(i)%v * dt
Object_p(i) = Object_p(i) +Object_v(i) * dt
AOS does buy you some convienence in copying the structure.
Also AOS handles multiple collections of objects better (although is SOA your collections can be made up with different sub-ranges of the SOAs).
LOS (ListOf Structures) buys you convienence in high frequency allocate/deallocate of objects.
I concur that 10%-15% performance difference may not be significant in the larger picture. So if AOS or LOS buys you some convenience, then by all means, chose Structures.
In somecases though, performance will be much larger than that exhibited by this simple test application. Consider an object with many properties, but also with very few properties interacting between objects. An example of this might be gravitational interaction.
In this situation, the position, mass, and accumulator for gravitational force (per object) interact((n-1)**2)/2 times whereas velocity and position are n times. And color, temperature, angular velocity, radius, ... many other attributes may have very little interaction amongst objects. By using AOS, in this circumstance, youmaybe diluting the last level cache. This can be mitigated to some extentby strategically packing the structure. e.g. cache align and{pos(1:3), FG(1:3), M} within structure and padd structure to size of multiple of cache lines. IOW the pertinant information is placed within a single cache line.
You may also experiment with additional padd to produce an odd number (or prime number) of cache lines consumed by each object. As this may reduce aliasing in the cache.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks to all for this interesting discussion.
The performance considerations made were based on the assumption that the structures elements are required in sequential order, so that for exampleAOS(2)%Acceleration%a will be used just after AOS(1)%Acceleration%a and so on. In our particular application which is a finite element code, in the do loop the structure elements are not contiguous anyway. In other words, the do loop goes over all elements but in more or less random order.
Do you think that in that case, would it matter whiter having AOSs or SOAs?
R//G
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks to all for this interesting discussion.
The performance considerations made were based on the assumption that the structures elements are required in sequential order, so that for exampleAOS(2)%Acceleration%a will be used just after AOS(1)%Acceleration%a and so on. In our particular application which is a finite element code, in the do loop the structure elements are not contiguous anyway. In other words, the do loop goes over all elements but in more or less random order.
Do you think that in that case, would it matter whiter having AOSs or SOAs?
R//G
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
SOA works best when a given attribute of all elements is processed independently.
AOS works best when all (several) attributes of each element is processes in proximity.
The FE state advancment can be written either way.
With AVX, and its longer vectors, SOA might see a boost in performance. In the simple test we performed, we were using REAL(8), with SSE small vector size of 2. With AVX, the REAL(8) small vector size doubles to 4 elements. The 44/41 time AOS/SOA could potentially go to 44/21. Although a real test would have to be run. Anyone here have a Sandy Bridge?
Jim

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page