Vectorization of linkedlist for particle tracking with Fortran ifort

AT · ‎12-17-2018

Hello Intel Community.

There are a few things which I would like to understand. I hope you can illuminate a bit on my confusions.

First, I have a set of mathematical operations which I need to carry out to millions of particles, and they don't depend on one another. I initially thought to have a linkedlist approach where my main loop will be something like

do while( associated(particle) )

     
(...)
particle => particle% next

end do

The question is how the compiler decides to vectorize this do while loop. The reason for the linkedlist is the ease to add and delete particles. The other option would be an array of a declared type, where the loop would be something like

do i = 1, nopart 

    particle(i)% (...) = (...)

    (...)


enddo

I guess the latter would be easy for the compiler to vectorize or ?

The next question is the directives for vectorization. I read that (INTEL) would suggest from now on to employ the openmp simd directives to tell the compiler explicitly to vectorize, right?

And finally, when compiling with the auto vectorization enabled, does it only include adding the flag -parallel -par-threshold=(XX) ?

Thanks very much

jimdempseyatthecove · ‎12-17-2018

Vectorization requires the like properties of the particles to be in adjacent memory (forget about scatter/gather). IOW:

real, allocatable, dimension(:) :: X, Y, Z, dX, dY, dZ, fX, fY, fZ, Mass, Charge, Spin, ...

! *** do i=1,noPart
  ... ! Force calculation
  dX =dX + (fX / Mass) * dT ! acceleration to velocity
  X = X + dX * dT  ! velocity to position
  ... ! other
! *** end do

Linked list, as well as having each particle object containing the above individual properties scatters the data about memory, or more importantly the like properties are scattered.

Keep in mind, that what is easy for you is hard for computation.

Note, your "do i=1, nopart" loop will experience no vectorizaton due to your having organized your data being organized by way particle objects as opposed to particle properties.

Jim Dempsey

AT · ‎12-17-2018

Thanks for your reply.

I am not sure I understand fully what you mean by the last sentence you wrote. I was considering that instead I would group the particle properties into a declared type such like:

type particle_prop

    real :: x(3)
    real :: u(3)
    (...)

end type 

type(particle_prop) :: particle(no_part)

And then do my loop over all particles. Would that also not be stored in contiguous memory. i.e. will particle(1:no_part)%x(1) not be stored adjacently, and therefore no ability to vectorize?

What do you mean by "forget about scatter/gather). IOW:"

Does Fortran not provide a way to overcome this when using declared types, and more importantly, is there no way to use linked list approach with vectorizable code?

Thank

jimdempseyatthecove · ‎12-17-2018

Two comparable sections of code:

real, allocatable, dimension(:) :: X, Y, Z, dX, dY, dZ, fX, fY, fZ, OmegaX, OmegaY, OmegaZ ! 3D properties as three 1D arrays per property
real, allocatable, dimension(:) :: Mass, Charge  ! 1D property
! other properties
...
allocate(X(nParticles), Y(nParticles), Z(nParticles))
allocate(dX(nParticles), dY(nParticles), dZ(nParticles))
allocate(fX(nParticles), fY(nParticles), fZ(nParticles))
allocate(OmegaX(nParticles), OmegaY(nParticles), OmegaZ(nParticles))
allocate(Mass(nParticles), Charge(nParticles))
...
!$omp parallel do private(j)
do i=1,nParticles
   fX(i) = 0.0
   fY(i) = 0.0
   fZ(i) = 0.0
   if(i > 1) call CalcForceG(i,1,i-1)
   if(i < nParticles) call CalcForceG(i,i+1,nParticles)
end do ! i=1,nParticles

...
subroutine CalcForceG(i,jBegin,jEnd)
   use YourModuleWithParticles
   implicit none
   integer :: i,jBegin,jEnd
   integer :: j
   real :: DistanceX, DistanceY, DistanceZ, rSquared, r, UnitX, UnitY, UnitZ, Fmagnitude
   do j=jBegin, jEnd
      DistanceX = X(i)-X(j)
      DistanceY = Y(i)-Y(j)
      DistanceZ = Z(i)-Z(j)
      rSquared = DistanceX**2 + DistanceY**2 + DistanceZ**2
      r = sqrt(rSquared)
      UnitX= DistanceX / r
      UnitY= DistanceY / r
      UnitZ= DistanceZ / r
      Fmagnitude = G * Mass(i) * Mass(j) / rSquared
      fX(i) = fX(i) + Fmagnitude * UnitX
      fY(i) = fY(i) + Fmagnitude * UnitY
      fZ(i) = fZ(i) + Fmagnitude * UnitZ
   end do
end subroutine CalcForceG

verses ----------------------------

type Particle
   real :: Position(3), Velocity(3), Force(3), Omega(3)
   real :: Mass, Charge
end type Particle
type(Particle), allocatable :: Particles(:)
...
allocate(Particles(nParticles))
...
!$omp parallel do private(j)
do i=1,nParticles
   Particles(i)%Force = 0.0
   if(i > 1) call CalcForceG(i,1,i-1)
   if(i < nParticles) call CalcForceG(i,i+1,nParticles)
end do ! i=1,nParticles

...
subroutine CalcForceG(i,jBegin,jEnd)
   use YourModuleWithParticles
   implicit none
   integer :: i,jBegin,jEnd
   integer :: j
   real :: Distance(3), rSquared, r, Unit(3), Fmagnitude
   do j=jBegin, jEnd
      Distance = Particles(i)%Position - Particles(j)%Position
      rSquared = Distance(1)**2 + Distance(2)**2 + Distance(3)**2
      r = sqrt(rSquared)
      Unit= Distance / r

      Fmagnitude = G * Mass(i) * Mass(j) / rSquared
      Particles(i)%Force = Particles(i)%Force + Fmagnitude * UnitX
   end do
end subroutine CalcForceG

The first section is commonly called Structure Of Arrays (SoA),
The second section is commonly called Array Of Structures (AoS)

In the first example all of the X(j)'s are contiguous, and thus vectoizable in "DistanceX = X(i)-X(j)"
Same with the Y(j) and Z(j) positions and the fX(j), fY(j) and fZ(j) thus making the do j= loop favorable to vectorization.

In the second example each particle has 14 real variables, thus:

Particle(j)%Position(1)
Is separated by 13 real variables (52 bytes) from
Particle(j+1)%Position(1)

Thus forcing the do j=loop to lose ability to be vectorized.

Are there a few more lines of code to write? A few more...
Is it harder to read? No - as easy to read
Is it more code efficient? Deffinately. AVX2 will iterate the do j= loop 1/8th as many times.

Jim Dempsey

jimdempseyatthecove · ‎12-17-2018

>> forget about scatter/gather

Assume AVX512 particles with real properties (4 byte) and 14 properties as above, yielding a record size of 56 bytes.

Although it may be unlikely, let us assume you manage to get the do j= loop of the AoS (second) sample code to vectorize (vector size of 16 real/float).

The SoA code (first example), when issuing a vector fetch from memory has:

(instruction fetch/decode) + 1 RAM memory access to read the cache line containing the vector.

The AoS code (second example), when issuing a vector fetch from memory has:

(instruction fetch/decode) + 14 RAM memory accesses (56*16/64) to read the cache lines containing the vector.

On AVX512 a gather of 16 real variables of structure of 56 bytes, spans 14 cache lines.

Although (in this hypothetical example) both do j= loops are vectorized and iterate the same number of times (nParticles / 16), the AoS code experiences 14 times the memory fetch overhead as does the SoA format.

Which technique would you think is more efficient?

Jim Dempsey

jimdempseyatthecove · ‎12-17-2018

BTW - forget about attempting to get the linked list to perform vector gather/scatter. While this can be done, the practicality of constructing a list of indices makes it inefficient as well as obfuscates any algorithm using it.

Jim Dempsey

jimdempseyatthecove · ‎12-17-2018

Side note

You might notice that the examples above perform more (abstract) calculations than necessary. IOW the distance particles A and B are calculated twice A-B and B-A. And that the common optimization technique of

do i=1,nParticles-1
do j=i+1,nParticles

and where you compute the forces between the two particles (A and B) once, that you will note that to accumulate these forces in multi-threaded code that you will either need a critical/atomic section or use a reduction that n-tuples the number of temporary arrays with the corresponding implicit critical section during the reduction. In many cases, the redundant computation is less than the cost of critical/atomic section or reduction. You can experiment with the reduction if you desire.

Jim Dempsey

AT · ‎12-18-2018

Hello Jim,

Thank you very much for the detailed examples. It has helped to understand how the memory access works much better, and I think it is very useful to be aware of those things. The reason why I were looking for linkedlist, and/or AoS is twofold. Dealing with a very large code (10k+ code lines) makes thing easier to structure them, but more importantly here is when having millions of particles which many of those die and others are introduced during runtime, it may become more helpful with a linkedlist approach, so that you can save memory space and avoid deallocation/allocation when the initial array of particles is not big enough. The latter problem is common for both cases AoS and SoA.
A bit unfortunate that you can't tell the compiler to store them as if they were seperate arrays, but on the code said remained arrays of structures. I guess if that was possible, it could be a nice feature

jimdempseyatthecove · ‎12-18-2018

Conceptually, a linked list is easier to visualize (manage) for the death and birth of particles. This said, one must take into consideration the length of time waiting for a simulation to complete (times the number of simulations) while running scalar non-streaming data verses vector streaming data of the SoA format.

Using SoA format, one would pre-allocate for worst case population, fill in for initial population, and index in the range of the initial population. Should a birth/split occur, add it to the unused cells (initially at the end of the current iteration space). Deaths are a little different.

One method for death handling is to take the end particle and move it to the dead cell (this is expensive when the particle has a large number of properties. Note for multi-threaded code this can be from the end of the partition of the population subset that the thread is working on.

A second technique is to convert a/the property/s of the dead cell such that including it in the population is benign to the calculation. e.g. for Mass, define your particle property as 1/Mass (then use * in place of /) and for dead cell, change the 1/Mass property to 0.0. At some point in time, you would eliminate the interior dead cells (say when you checkpoint the data or when you update a visualization).

While time to program is important, time waiting for program may be more important.

Jim Dempsey

AT · ‎12-18-2018

Thank you for your help Jim. Just to ensure one thing, I know what multithreaded code is, although my initial enquiry was on the compilers ability to vectorize i.e. using features like !omp simd etc. But I generally guess if the code can be multithreaded, then it can also be vectorized.

I was know checking the difference between having a variable x(3,nopart) and particle(1:nopart)% x(3) and making some loop test. I used omp_get_wtime. What I observed was for the first case it was taking around 1 second, while the latter two seconds. However, the latter felt much longer than two seconds!. I used my stopwatch from my smartphone and realized that the latter actually took around 10 seconds, while the first case was more and less the same as physical time.

Is that because omp_get_wtime does not account for the time when the processor is waiting for memory to be fetched ? in that case should a more reasonable comparison obviously take this into account, and what else function could I use

I tried mpi_wtime and cpu_time without any luck

AT · ‎12-18-2018

This was my example

type particle 
   real :: x(3) 
   real :: dum1 
   real :: dum2 
   real :: dum3
   real :: dum4 
   real :: dum5
   real :: dum6 
   real :: dum7 r
   real :: dum8 
end type 

allocate(particle(20*10e6))
t1 = omp_get_wtime()
do i = 1, 20*10e6 
   particle(i) % x(1) = exp(i) - exp(i/2) 
   particle(i) % x(2) = exp(i) - exp(i/2) 
   particle(i) % x(3) = exp(i) - exp(i/2) 
enddo 
t2 = omp_get_wtime() ! double precision
print*, t2-t1

! another example 
real, allocatable :: x(:,:) 
allocate(x(3,20*10e6)
t1 = omp_get_wtime()
do i = 1, 20*10e6 
   x(1,i) = exp(i) - exp(i/2) 
   x(2,i) = exp(i) - exp(i/2) 
   x(3,i) = exp(i) - exp(i/2) 
enddo

t2 = omp_get_wtime()
print*, t2-t1

the dums in type is just to create the real scenario of having many other members of that type and to create the steps in memory addresses as you mentioned previously.

compiled in double precision (-r8)

jimdempseyatthecove · ‎12-18-2018

Example code:

!  ATparticle.f90 
!
!  FUNCTIONS:
!  ATparticle - Entry point of console application.
!

!****************************************************************************
!
!  PROGRAM: ATparticle
!
!  PURPOSE:  Entry point for the console application.
!
!****************************************************************************
module my_module
    integer :: nParticles
    integer :: nTimeSteps
    real, parameter :: dT = 0.001 ! delta time ?
    real, parameter :: Gmks = 6.6742867E-11
    type particle_t
        real :: Pos(3)
        real :: Vel(3)
        real :: Force(3)
        real :: Mass
        real :: dum1
    end type particle_t
    
    type(particle_t), allocatable :: Particle(:)
    
    ! SoA format
    type SoA_t
        real, dimension(:), allocatable :: PosX, PosY, PosZ, VelX, VelY, VelZ, ForceX, ForceY, ForceZ, Mass, dum1
    end type SoA_t
    
    type(SoA_t) :: SoA
    
    contains
    
    subroutine Init(n,s)
        implicit none
        integer :: n, s
        nParticles = n
        nTimeSteps = s
        call InitAoS
        call InitSoA
    end subroutine Init

    subroutine InitAoS
        implicit none
        integer :: i
        allocate(Particle(nParticles))
        do i = 1, nParticles
            call RANDOM_NUMBER(Particle(i) % Pos)   ! random position 0.0-1.0 (this should be scaled to fit your domain)
            Particle(i) % Vel = 0.0                 ! at rest (can change to meet your needs)
            Particle(i) % Force = 0.0
            call RANDOM_NUMBER(Particle(i) % Mass)  ! arbitrary Mass
            Particle(i) % dum1 = 0.0
        end do
    end subroutine InitAoS
    
    subroutine InitSoA
        implicit none
        allocate(SoA%PosX(nParticles), SoA%PosY(nParticles), SoA%PosZ(nParticles), &
            SoA%VelX(nParticles), SoA%VelY(nParticles), SoA%VelZ(nParticles), &
            SoA%ForceX(nParticles), SoA%ForceY(nParticles), SoA%ForceZ(nParticles), &
            SoA%Mass(nParticles), &
            &SoA%dum1(nParticles))
        call RANDOM_NUMBER(SoA%PosX)
        call RANDOM_NUMBER(SoA%PosY)
        call RANDOM_NUMBER(SoA%PosZ)
        SoA%VelX = 0.0
        SoA%VelY = 0.0
        SoA%VelZ = 0.0
        SoA%ForceX = 0.0
        SoA%ForceY = 0.0
        SoA%ForceZ = 0.0
        call RANDOM_NUMBER(SoA%Mass)
        SoA%dum1 = 0.0
    end subroutine InitSoA
    
    subroutine Free
        implicit none
        call FreeAoS
        call FreeSoA
    end subroutine Free
    
    subroutine FreeAoS
        implicit none
        deallocate(Particle)
    end subroutine FreeAoS
    
    subroutine FreeSoA
        implicit none
        deallocate(SoA%PosX, SoA%PosY, SoA%PosZ, &
            SoA%VelX, SoA%VelY, SoA%VelZ, &
            SoA%ForceX, SoA%ForceY, SoA%ForceZ, &
            SoA%Mass, &
            SoA%dum1)
    end subroutine FreeSoA
    
end module my_module
    
program ATparticle
    use omp_lib
    use my_module
    implicit none
    real :: t0, t1, t2
    integer :: i
    call Init(int(20*10e3), 1)    ! nParticles, nTimeSteps
    ! repeat to eliminate once only time overhead
    do i=1,3
        t0 = omp_get_wtime()
        call AsParticle
        t1 = omp_get_wtime()
        call AsSoA
        t2 = omp_get_wtime()
        print *, t1-t0, t2-t1, (t1-t0) / (t2-t1)
    end do
    call Free
end program ATparticle

subroutine AsParticle
    use my_module
    implicit none
    integer :: i,j, iStep
    real :: distanceVector(3), forceMagnitude, forceVector(3), radiusSq, unitVector(3), accVector(3)
    do iStep = 1, nTimeSteps
        do i = 1, nParticles
            ! zero out force vector for this timestep
            Particle(i) % Force = 0.0
        enddo
        ! accumulate forces
        do i = 1, nParticles-1
            do j=i+1,nParticles
                distanceVector =  Particle(i) % Pos - Particle(j) % Pos
                radiusSq = sum(distanceVector**2)
                unitVector = distanceVector / sqrt(radiusSq)
                forceMagnitude = Gmks * Particle(i) % Mass * Particle(j) % Mass / radiusSq
                forceVector = unitVector * forceMagnitude
                Particle(i) % Force = Particle(i) % Force + forceVector
                Particle(j) % Force = Particle(j) % Force - forceVector
            end do
        enddo
        ! accumulate forces into velocity, and advance position
        do i = 1, nParticles
            accVector = Particle(i) % Force / Particle(i) % Mass
            Particle(i) % Vel = Particle(i) % Vel + accVector
            Particle(i) % Pos = Particle(i) % Pos + Particle(i) % Vel * dT
        enddo
    end do
end subroutine AsParticle
    
! another example
subroutine AsSoA
    use my_module
    implicit none
    integer :: i,j,iStep
    real :: distanceVectorX, distanceVectorY, distanceVectorZ, radiusSq, radius
    real :: unitVectorX, unitVectorY, unitVectorZ, forceMagnitude
    
    do iStep = 1, nTimeSteps
        ! zero out force vector for this timestep
        SoA % ForceX = 0.0
        SoA % ForceY = 0.0
        SoA % ForceZ = 0.0
        ! accumulate forces
        do i = 1, nParticles-1
            do j=i+1,nParticles
                distanceVectorX =  SoA % PosX(i) - SoA % PosX(j)
                distanceVectorY =  SoA % PosY(i) - SoA % PosY(j)
                distanceVectorZ =  SoA % PosZ(i) - SoA % PosZ(j)
                radiusSq = distanceVectorX**2 + distanceVectorY**2 + distanceVectorZ**2
                radius = sqrt(radiusSq)
                unitVectorX = distanceVectorX / radius
                unitVectorY = distanceVectorY / radius
                unitVectorZ = distanceVectorZ / radius
                forceMagnitude = Gmks * SoA % Mass(i) * SoA % Mass(j) / radiusSq
                SoA % ForceX(i) = SoA % ForceX(i) + forceMagnitude * unitVectorX
                SoA % ForceX(j) = SoA % ForceX(j) - forceMagnitude * unitVectorX
                SoA % ForceY(i) = SoA % ForceY(i) + forceMagnitude * unitVectorY
                SoA % ForceY(j) = SoA % ForceY(j) - forceMagnitude * unitVectorY
                SoA % ForceZ(i) = SoA % ForceZ(i) + forceMagnitude * unitVectorZ
                SoA % ForceZ(j) = SoA % ForceZ(j) - forceMagnitude * unitVectorZ
            end do
        enddo
        ! accumulate forces into velocity, and advance position
        SoA % VelX = SoA % VelX + SoA % ForceX / SoA % Mass
        SoA % VelY = SoA % VelY + SoA % ForceY / SoA % Mass
        SoA % VelZ = SoA % VelZ + SoA % ForceZ / SoA % Mass
        SoA % PosX = SoA % PosX + SoA % VelX * dT
        SoA % PosY = SoA % PosY + SoA % VelY * dT
        SoA % PosZ = SoA % PosZ + SoA % VelZ * dT
    end do
end subroutine AsSoA

Output on Core i7 2600K (first generation of AVX), single threaded

   75.87500       22.50000       3.372222
   76.25000       22.37500       3.407821
   75.75000       22.25000       3.404494
Press any key to continue . . .

Code not verified for correctness, used as representative example
Code does not account for coincident particles (separation of 0.0)
Code not parallelized

This illustrates for a system with AVX (1) that vectorization yields a 3.4x improvement in performance.

*** Edited after post error in unitVector calculation (used radiusSq instead of radius)

Jim Dempsey

jimdempseyatthecove · ‎12-18-2018

The correction above reduced the gain to 2.23x (due to formerly optimizing out the sqrt(radiusSq) from the code)

I haven't ported this to my AVX512 system as of this time.

Jim Dempsey