Performance overhead using structures vs. arrays

Reinaldo_Garcia · ‎02-01-2011

I wonder if there would be a performance loss by using structures vs. arrays. For example:

vector(i)%u and vector(i)%v

vs.

u(i) and v(i)

I am working on a program that requires millions operations with these assignment. It would improve the model maintenance efficiency by using structures, but I am afraid to change the whole program to use structures because the potential performance overhead.

Any thoughts?

R//G

anthonyrichards · ‎02-01-2011

If you want the compiler to take advantage of SSE2 instructions which can load and process 64-bit words in pairs in 128-bit registers, then you need to be able to align your array elements to 16 byte boundaries. So it depends what alignment your structures come up with, I guess.

jimdempseyatthecove · ‎02-01-2011

Generally speaking Structure Of Arrays (SOA)vectorize better than Array Of Structures (AOS).

For example: You have a collection of objects, each with mass(m), position(X,Y,Z), velocity(dX,dY,dZ), acceleration(ddX, ddY, ddZ), externalForce(xfX, xfY, xfZ), ... other properties.

While it is relatively straitforward conceptually to organized these into separate objects

Object(iObj)%Pos ! as real(8) :: Pos(3)

Doing so interferes with good vectorization opportunities that you get with

X(iObj)
Y(iObj)
Z(iObj)

Consider

DO iObj = 1, nObjs
Object(iObj)%acceleration = Object(iObj)%externalForce / Object(iObj)%mass
Object(iObj)%velocity = Object(iObj)%velocity + (Object(iObj)%acceleration* dT)
Object(iObj)%position = Object(iObj)%position+(Object(iObj)%velocity * dT)
END DO

versis

DO iObj = 1, nObjs
accelerationX(iObj) = externalForceX(iObj) / mass(iObj)
accelerationY(iObj) = externalForceY(iObj) / mass(iObj)
accelerationZ(iObj) = externalForceZ(iObj) / mass(iObj)

velocityX(iObj) = velocityX(iObj) + (accelerationX(iObj)* dT)
velocityY(iObj) = velocityY(iObj) + (accelerationY(iObj)* dT)
velocityZ(iObj) = velocityZ(iObj) + (accelerationZ(iObj)* dT)

positionX(iObj) = positionX(iObj)+(velocityX(iObj) * dT)
positionY(iObj) = positionY(iObj)+(velocityY(iObj) * dT)
positionZ(iObj) = positionZ(iObj)+(velocityZ(iObj) * dT)
END DO

While the former loop looks simpler, it will have to work harder (longer)due to less opportunities for vectorization. In this situation, the latter loop will vectorize better. (Breaking into 2 or 3 loops may help too.)

With millions of operations at stake (IOW long run times), run time matters.

A potential drawback to Structure Of Arrays is it is more difficlut to add and remove objects.
However, the performance difference is such that you can "delete" an object by setting its mass to 0.0.
And if you keep a list of these deleted objects, you can add first by overstriking a previously deleted object.
(And initial allocation having expansion space of nObjsMax - nObjs.)

Jim Dempsey

Martyn_C_Intel · ‎02-01-2011

Yes, there is likely to be a substantial impact on performance, if these arrays are large.

When you loop over u(i) and v(i), consecutive memory accesses are contiguous, so a single instruction can load several elements at once, and loops can often be vectorized efficiently.

When you loop over vector(i)%u and vector(i)%v, accesses to u in consecutive iterations are no longer contiguous in memory. So the compiler may not be able to load multiple elements in a single instruction, or at best, will need additional instructions to rearrange the data so that it is contiguous for subsequent SIMD processing.

Actual performance impact will depend on how much calculation is done once the data has been loaded, (more is better), and the complexity of the structure (eg whether there are other components besides u and v) - less is better.

A possible compromise would be to put thearrays into a structure, vector%u(i) and vector%v(i). The structure could also include the array dimension. The individual array elements would still be contiguous in memory.

Array alignment was mentioned. It's a little easier for the compiler to know the alignment if the arrays aren't in a structure. But the compiler can make runtime tests for alignment, and peel iterations accordingly. If the arrays are large, the overhead should not be great. What is important is that the main arrays should have the same alignment - if they don't, pad until they do, or use "attributes align" directives.

You may find similar discussions for the C/C++ compiler, that refer to a Structure of Arrays (SoA) compared to an Array of Structures (AoS). See for example
http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/ sect. 5.3
Unfortunately, there isn't yet a version with Fortran examples.

IDZ_A_Intel · ‎02-01-2011

Hi Jim and Martyn

I believe that everyone asks this question as he/she starts thinking about performance improvements at the design stage. Points mentioned by Jim and Martyn are in-synch with those outlined in the document Martyn mentioned.

Below is a dumb test using the example given by Jim. I am using VS2005 on WinXP64-PRO with Dual Core Xeon. The results obtained in Release mode somehow contradict i.e. AOS performs better than SOA.

Since the test uses random number, it is not "real-world" problem and perhaps the compiler is doing something smarter. In the diagnostics I see that all the SOA loops are vectorized whereas the AOS loop is "partially" vectorized. For example, for N=1 million I get compute times as SOA = 0.76 s and AOS = 0.59 s.

Will it be possible for someone to guide by giving an example of a more meaningful test that shows SOA performing better than AOS? {On pages 15 through 18 in the document mentioned by Martyn, there an example (with points structure); however that example is similar to the one below.}

Abhi

----

[fortran]      Module OM
         Implicit None
         Real(8), Parameter :: ZERO = 0.0d0, ONE=1.0d0
         Type t_Position
            Real(8) :: X=ZERO, Y=ZERO, Z=ZERO
         End Type t_Position
         Type t_Velocity
            Real(8) :: u=ZERO, v=ZERO, w=ZERO
         End Type t_Velocity
         Type t_Acceleration
            Real(8) :: a=ZERO, b=ZERO, c=ZERO
         End Type t_Acceleration
         Type t_Force
            Real(8) :: F=ZERO, G=ZERO, H=ZERO
         End Type t_Force
         
         Type t_SOA
            Real(8), Allocatable :: mass(:)
            Real(8), Allocatable :: X(:), Y(:), Z(:)
            Real(8), Allocatable :: u(:), v(:), w(:)
            Real(8), Allocatable :: a(:), b(:), c(:)
            Real(8), Allocatable :: F(:), G(:), H(:)
         Contains
            Procedure :: Set => Set_SOA
            Procedure :: Release => Release_SOA
         End Type t_SOA

         Type t_AOS
            Real(8) :: mass 
            Type(t_Position) :: Position
            Type(t_Velocity) :: Velocity
            Type(t_Acceleration) :: Acceleration
            Type(t_Force) :: Force 
         End Type t_AOS
         
      Contains
      
         Subroutine Set_SOA(obj, N, iError)
            Class(t_SOA), Intent(OUT) :: obj
            Integer, Intent(IN) :: N
            Integer, Intent(OUT) :: iError
            iError = 0
            if (Allocated(obj%mass)) then
               DeAllocate(obj%mass,             &
                          obj%X, obj%Y, obj%Z,  &
                          obj%u, obj%v, obj%w,  &
                          obj%a, obj%b, obj%c,  &
                          obj%F, obj%G, obj%H)
            endif
            Allocate(obj%mass(N),                     &
                     obj%X(N), obj%Y(N), obj%Z(N),    &
                     obj%u(N), obj%v(N), obj%w(N),    &
                     obj%a(N), obj%b(N), obj%c(N),    &
                     obj%F(N), obj%G(N), obj%H(N),    &
                     stat=iError)
            
            obj%mass = ZERO
            obj%X = ZERO
            obj%Y = ZERO
            obj%Z = ZERO    
            obj%u = ZERO
            obj%v = ZERO
            obj%w = ZERO    
            obj%a = ZERO
            obj%b = ZERO
            obj%c = ZERO    
            obj%F = ZERO
            obj%G = ZERO
            obj%H = ZERO                        
                     
         End Subroutine Set_SOA
         
         Subroutine Release_SOA(obj)
            Class(t_SOA), Intent(OUT) :: obj
            if (Allocated(obj%mass) ) then
               DeAllocate(obj%mass,             &
                          obj%X, obj%Y, obj%Z,  &
                          obj%u, obj%v, obj%w,  &
                          obj%a, obj%b, obj%c,  &
                          obj%F, obj%G, obj%H)
            endif
         End Subroutine Release_SOA         
         
         Subroutine Set_AOS(obj, N, iError)
            Type(t_AOS), Intent(OUT), Allocatable :: obj(:)
            Integer, Intent(IN)  :: N
            Integer, Intent(OUT) :: iError
            iError = 0
            Allocate(obj(N), stat=iError)
         End Subroutine Set_AOS

         Subroutine Release_AOS(obj)
            Type(t_AOS), Intent(OUT), Allocatable :: obj(:)
         End Subroutine Release_AOS
      
      End Module OM

      Program Test
         Use OM
         Implicit None
         Type(t_SOA) :: SOA
         Type(t_AOS), Allocatable :: AOS(:)
         
         Integer :: N
         Integer :: i, j
         Integer :: iError
         
         Real(8) :: ts, te, rn, dt
         
         dt = 1.0d-02
         
         do i=1,10
         
            Call Random_Seed()
            N = 1000000*i
            Write(*,"(A,I0)") "Array size N=", N
         
            Call SOA%Set(N, iError)
            Call Random_Number(SOA%mass)
            Call Random_Number(SOA%F)
            Call Random_Number(SOA%G)
            Call Random_Number(SOA%H)
            SOA%mass = ONE + SOA%mass
            SOA%F = ONE + SOA%F
            SOA%G = ONE + SOA%G
            SOA%H = ONE + SOA%H
            Call CPU_TIME(ts)
               do j = 1, N
                  SOA%a(j) = SOA%F(j) / SOA%mass(j)
                  SOA%b(j) = SOA%G(j) / SOA%mass(j)
                  SOA%c(j) = SOA%H(j) / SOA%mass(j)
               end do
               
               do j = 1, N
                  SOA%u(j) = SOA%u(j) + (SOA%a(j) * dt)
                  SOA%v(j) = SOA%v(j) + (SOA%b(j) * dt)
                  SOA%w(j) = SOA%w(j) + (SOA%c(j) * dt)
               end do
               
               do j = 1, N
                  SOA%X(j) = SOA%X(j)  +  (SOA%u(j) * dt)
                  SOA%Y(j) = SOA%Y(j)  +  (SOA%v(j) * dt)
                  SOA%Z(j) = SOA%Z(j)  +  (SOA%w(j) * dt)
               end do
            Call CPU_TIME(te)
            Write(*,*) "SOA:", te - ts
            Call SOA%Release()         
                
            Call Set_AOS(AOS, N, iError)
            Call Random_Number(AOS%mass)
            Call Random_Number(AOS%Force%F)
            Call Random_Number(AOS%Force%G)
            Call Random_Number(AOS%Force%H)
            AOS%mass = ONE + AOS%mass
            AOS%Force%F = ONE + AOS%Force%F
            AOS%Force%G = ONE + AOS%Force%G
            AOS%Force%H = ONE + AOS%Force%H
            Call CPU_TIME(ts)   
            do j = 1, N
               
               AOS(j)%Acceleration%a = AOS(j)%Force%F / AOS(j)%mass
               AOS(j)%Acceleration%b = AOS(j)%Force%G / AOS(j)%mass
               AOS(j)%Acceleration%c = AOS(j)%Force%H / AOS(j)%mass
               
               AOS(j)%Velocity%u = AOS(j)%Velocity%u + (AOS(j)%Acceleration%a * dt)
               AOS(j)%Velocity%v = AOS(j)%Velocity%v + (AOS(j)%Acceleration%b * dt)
               AOS(j)%Velocity%w = AOS(j)%Velocity%w + (AOS(j)%Acceleration%c * dt)
               
               AOS(j)%Position%X = AOS(j)%Position%X  +  (AOS(j)%Velocity%u * dt)
               AOS(j)%Position%Y = AOS(j)%Position%Y  +  (AOS(j)%Velocity%v * dt)
               AOS(j)%Position%Z = AOS(j)%Position%Z  +  (AOS(j)%Velocity%w * dt)
               
            end do
            Call CPU_TIME(te)

            Write(*,*) "AOS:", te - ts
            Call Release_AOS(AOS)

            Write(*,*)
         
         end do          
                  
      End Program Test[/fortran]

Arjen_Markus · ‎02-02-2011

Such performance measurements are extremely difficult to get reliable results from.

For one thing, you use the various arrays from the structure one after another. If instead you use array operations or do-loops that are confined to a single array, the result may be very different.

Also ten iterations may not be enough to get a reliable timing. Have you checked the variation?

On the other hand, in many practical cases the do-loops may be very complex, annihilating most effects of optimisation.

Regards,

Arjen

jimdempseyatthecove · ‎02-02-2011

Abhi,

Thanks for the sample code.

I took some liberties with your sample code and made the following changes

1) replace call to CPU_TIME with higher precision timer omp_get_wtime()
2) Added 3rd test were SOA perfromed with one loop as opposed to three loops
3) For each size, run each test three times,obtain minimum time.

Results:

[plain]C:TestAbhix64Release>abhi
Array size N=1000000
 SOA 3 loops:  5.296442192047834E-002
 SOA 1 loop:   4.159056860953569E-002
 AOS:          4.715456999838352E-002

Array size N=2000000
 SOA 3 loops:  0.105941614136100
 SOA 1 loop:   8.626186009496450E-002
 AOS:          9.444041270762682E-002

Array size N=3000000
 SOA 3 loops:  0.160792567767203
 SOA 1 loop:   0.126706320792437
 AOS:          0.141730781644583

Array size N=4000000
 SOA 3 loops:  0.213049729354680
 SOA 1 loop:   0.175986659713089
 AOS:          0.189607863314450

Array size N=5000000
 SOA 3 loops:  0.265978128649294
 SOA 1 loop:   0.212588425725698
 AOS:          0.236196809448302

Array size N=6000000
 SOA 3 loops:  0.319670793600380
 SOA 1 loop:   0.255262950435281
 AOS:          0.281718607991934

Array size N=7000000
 SOA 3 loops:  0.373268204741180
 SOA 1 loop:   0.295668337494135
 AOS:          0.330650757066905

Array size N=8000000
 SOA 3 loops:  0.428482050076127
 SOA 1 loop:   0.345974148251116
 AOS:          0.378969475626945

Array size N=9000000
 SOA 3 loops:  0.478422214277089
 SOA 1 loop:   0.376372301019728
 AOS:          0.423757964745164

Array size N=10000000
 SOA 3 loops:  0.532217066735029
 SOA 1 loop:   0.425057412125170
 AOS:          0.472959278151393
[/plain]

SOA 1 loop has the advantage.

Jim Dempsey

abhimodak · ‎02-02-2011

Hi Jim and Arjen

Thanks for your responses.

Jim, Initially I had SOA in one loop. But the diagnostics told me that "LOOP WAS PARTIALLY VECTORIZED". Hence, I broke it into three loops as you had mentioned in your first post. I am installing update 2 to composer and I will redo this test and report back. (I realized that I should have posted the complete output as well. My apologies.)

Arjen, I kind of agree and disagree with you at the same time. :) First note that there are not 10 iterations but I increase the array size. Secondly, in, say, 50% of the real-world problems, the computations with structure components will be done at the same time as in the dummy example above. You are of course correct that the real-world cases will also be complex and the operations may be scattered.

However, my intention is to see, if possible, where/when/how the performance hit can surely occur i.e. know the certain winners. For example, will one can readily check the effect of column-major vs row-major indexing. In some ways, SOA vs AOS is a similar issue and it would be great to know a priori.

Sincerely

Abhi

abhimodak · ‎02-02-2011

Hi

I can reproduce Jim's results, that is, big SOA loop gives the best performance. However, I see the following:-

(1) When I break the SOA loop into three loops, the diagnostics says that loops are vectorized. If it is in one big loop then the dianostics say the loop is partially vectorized. However, the performance of the latter (i.e. partial vectorization) is better.

(2) If I don't break AOS into three loops then the diagnostics says that the loop is partially vectorized. Breaking it into three loops gives no vectorization. The performance of the former is way better; as expected.

I am not sure what is happening here; especially in (1).

Sincerely
Abhijit

Arjen_Markus · ‎02-02-2011

Hi Abhi,

a set of consistent recommendations would be very nice,and I hope you succeed in that.I am, however,
not sure you can reach that ideal.

(Missed the bit by the way where you increase the size - should have looked better :))

Regards,

Arjen

Jugoslav_Dujic · ‎02-03-2011

Quoting Reinaldo Garcia

It would improve the model maintenance efficiency by using structures, but I am afraid to change the whole program to use structures because the potential performance overhead.

With this experience, I must say that I value maintenance efficiency much above performance (of course, within reasonable limits). I think that squeezing every bit of performance on the expense of code maintainability is generally a Bad Thing. The more complex the program is, the maintainability matters more.

With test timings above, where speed differences are within 10-15% range, I would select the more readable code (AOS) without a second thought. Your mileage may vary, of course.

After all, we all know that:

(ProgramSpeed * ProgramMemory) / (WritingEffort * MaintenanceEffort) = const

When you put the costs in the equation, you will know where your focus should be ;)

jimdempseyatthecove · ‎02-03-2011

Abhi,

>> "LOOP WAS PARTIALLY VECTORIZED".

When programmed as multiple loops you do get full vectorization (due to fitting within limited number of XMM registers).

However, when coding as three seperate loops, the results of the first loop, although calculated faster, are evicted from L1 cache prior to use by the second loop, and results of second loop evicted from L1 cache prior to use by third loop. In areal simulationapplication you may have more cascading loops.

This illustrates that one must experiment with critical code sections to attain the best results.

Jim Dempsey

jimdempseyatthecove · ‎02-03-2011

AOS and SOA are equally readable (maintainable). IMHO as it boils down to where you place the (index) and %.

Object(i)%p = Object(i)%p + Object(i)%v * dt
Object_p(i) = Object_p(i) +Object_v(i) * dt

AOS does buy you some convienence in copying the structure.

Also AOS handles multiple collections of objects better (although is SOA your collections can be made up with different sub-ranges of the SOAs).

LOS (ListOf Structures) buys you convienence in high frequency allocate/deallocate of objects.

I concur that 10%-15% performance difference may not be significant in the larger picture. So if AOS or LOS buys you some convenience, then by all means, chose Structures.

In somecases though, performance will be much larger than that exhibited by this simple test application. Consider an object with many properties, but also with very few properties interacting between objects. An example of this might be gravitational interaction.

In this situation, the position, mass, and accumulator for gravitational force (per object) interact((n-1)**2)/2 times whereas velocity and position are n times. And color, temperature, angular velocity, radius, ... many other attributes may have very little interaction amongst objects. By using AOS, in this circumstance, youmaybe diluting the last level cache. This can be mitigated to some extentby strategically packing the structure. e.g. cache align and{pos(1:3), FG(1:3), M} within structure and padd structure to size of multiple of cache lines. IOW the pertinant information is placed within a single cache line.

You may also experiment with additional padd to produce an odd number (or prime number) of cache lines consumed by each object. As this may reduce aliasing in the cache.

Jim Dempsey

Reinaldo_Garcia · ‎02-04-2011

Thanks to all for this interesting discussion.

The performance considerations made were based on the assumption that the structures elements are required in sequential order, so that for exampleAOS(2)%Acceleration%a will be used just after AOS(1)%Acceleration%a and so on. In our particular application which is a finite element code, in the do loop the structure elements are not contiguous anyway. In other words, the do loop goes over all elements but in more or less random order.

Do you think that in that case, would it matter whiter having AOSs or SOAs?

R//G

Reinaldo_Garcia · ‎02-04-2011

Thanks to all for this interesting discussion.

The performance considerations made were based on the assumption that the structures elements are required in sequential order, so that for exampleAOS(2)%Acceleration%a will be used just after AOS(1)%Acceleration%a and so on. In our particular application which is a finite element code, in the do loop the structure elements are not contiguous anyway. In other words, the do loop goes over all elements but in more or less random order.

Do you think that in that case, would it matter whiter having AOSs or SOAs?

R//G

jimdempseyatthecove · ‎02-04-2011

A lot will depend on how your FE system is written as to if SOA or AOS is more appropriate.

SOA works best when a given attribute of all elements is processed independently.

AOS works best when all (several) attributes of each element is processes in proximity.

The FE state advancment can be written either way.

With AVX, and its longer vectors, SOA might see a boost in performance. In the simple test we performed, we were using REAL(8), with SSE small vector size of 2. With AVX, the REAL(8) small vector size doubles to 4 elements. The 44/41 time AOS/SOA could potentially go to 44/21. Although a real test would have to be run. Anyone here have a Sandy Bridge?

Jim