Hello Intel Community.
There are a few things which I would like to understand. I hope you can illuminate a bit on my confusions.
First, I have a set of mathematical operations which I need to carry out to millions of particles, and they don't depend on one another. I initially thought to have a linkedlist approach where my main loop will be something like
do while( associated(particle) ) (...) particle => particle% next end do
The question is how the compiler decides to vectorize this do while loop. The reason for the linkedlist is the ease to add and delete particles. The other option would be an array of a declared type, where the loop would be something like
do i = 1, nopart particle(i)% (...) = (...) (...) enddo
I guess the latter would be easy for the compiler to vectorize or ?
The next question is the directives for vectorization. I read that (INTEL) would suggest from now on to employ the openmp simd directives to tell the compiler explicitly to vectorize, right?
And finally, when compiling with the auto vectorization enabled, does it only include adding the flag -parallel -par-threshold=(XX) ?
Thanks very much
Vectorization requires the like properties of the particles to be in adjacent memory (forget about scatter/gather). IOW:
real, allocatable, dimension(:) :: X, Y, Z, dX, dY, dZ, fX, fY, fZ, Mass, Charge, Spin, ...
! *** do i=1,noPart ... ! Force calculation dX =dX + (fX / Mass) * dT ! acceleration to velocity X = X + dX * dT ! velocity to position ... ! other ! *** end do
Linked list, as well as having each particle object containing the above individual properties scatters the data about memory, or more importantly the like properties are scattered.
Keep in mind, that what is easy for you is hard for computation.
Note, your "do i=1, nopart" loop will experience no vectorizaton due to your having organized your data being organized by way particle objects as opposed to particle properties.
Thanks for your reply.
I am not sure I understand fully what you mean by the last sentence you wrote. I was considering that instead I would group the particle properties into a declared type such like:
type particle_prop real :: x(3) real :: u(3) (...) end type type(particle_prop) :: particle(no_part)
And then do my loop over all particles. Would that also not be stored in contiguous memory. i.e. will particle(1:no_part)%x(1) not be stored adjacently, and therefore no ability to vectorize?
What do you mean by "forget about scatter/gather). IOW:"
Does Fortran not provide a way to overcome this when using declared types, and more importantly, is there no way to use linked list approach with vectorizable code?
Two comparable sections of code:
real, allocatable, dimension(:) :: X, Y, Z, dX, dY, dZ, fX, fY, fZ, OmegaX, OmegaY, OmegaZ ! 3D properties as three 1D arrays per property real, allocatable, dimension(:) :: Mass, Charge ! 1D property ! other properties ... allocate(X(nParticles), Y(nParticles), Z(nParticles)) allocate(dX(nParticles), dY(nParticles), dZ(nParticles)) allocate(fX(nParticles), fY(nParticles), fZ(nParticles)) allocate(OmegaX(nParticles), OmegaY(nParticles), OmegaZ(nParticles)) allocate(Mass(nParticles), Charge(nParticles)) ... !$omp parallel do private(j) do i=1,nParticles fX(i) = 0.0 fY(i) = 0.0 fZ(i) = 0.0 if(i > 1) call CalcForceG(i,1,i-1) if(i < nParticles) call CalcForceG(i,i+1,nParticles) end do ! i=1,nParticles ... subroutine CalcForceG(i,jBegin,jEnd) use YourModuleWithParticles implicit none integer :: i,jBegin,jEnd integer :: j real :: DistanceX, DistanceY, DistanceZ, rSquared, r, UnitX, UnitY, UnitZ, Fmagnitude do j=jBegin, jEnd DistanceX = X(i)-X(j) DistanceY = Y(i)-Y(j) DistanceZ = Z(i)-Z(j) rSquared = DistanceX**2 + DistanceY**2 + DistanceZ**2 r = sqrt(rSquared) UnitX= DistanceX / r UnitY= DistanceY / r UnitZ= DistanceZ / r Fmagnitude = G * Mass(i) * Mass(j) / rSquared fX(i) = fX(i) + Fmagnitude * UnitX fY(i) = fY(i) + Fmagnitude * UnitY fZ(i) = fZ(i) + Fmagnitude * UnitZ end do end subroutine CalcForceG verses ---------------------------- type Particle real :: Position(3), Velocity(3), Force(3), Omega(3) real :: Mass, Charge end type Particle type(Particle), allocatable :: Particles(:) ... allocate(Particles(nParticles)) ... !$omp parallel do private(j) do i=1,nParticles Particles(i)%Force = 0.0 if(i > 1) call CalcForceG(i,1,i-1) if(i < nParticles) call CalcForceG(i,i+1,nParticles) end do ! i=1,nParticles ... subroutine CalcForceG(i,jBegin,jEnd) use YourModuleWithParticles implicit none integer :: i,jBegin,jEnd integer :: j real :: Distance(3), rSquared, r, Unit(3), Fmagnitude do j=jBegin, jEnd Distance = Particles(i)%Position - Particles(j)%Position rSquared = Distance(1)**2 + Distance(2)**2 + Distance(3)**2 r = sqrt(rSquared) Unit= Distance / r Fmagnitude = G * Mass(i) * Mass(j) / rSquared Particles(i)%Force = Particles(i)%Force + Fmagnitude * UnitX end do end subroutine CalcForceG
The first section is commonly called Structure Of Arrays (SoA),
The second section is commonly called Array Of Structures (AoS)
In the first example all of the X(j)'s are contiguous, and thus vectoizable in "DistanceX = X(i)-X(j)"
Same with the Y(j) and Z(j) positions and the fX(j), fY(j) and fZ(j) thus making the do j= loop favorable to vectorization.
In the second example each particle has 14 real variables, thus:
Is separated by 13 real variables (52 bytes) from
Thus forcing the do j=loop to lose ability to be vectorized.
Are there a few more lines of code to write? A few more...
Is it harder to read? No - as easy to read
Is it more code efficient? Deffinately. AVX2 will iterate the do j= loop 1/8th as many times.
>> forget about scatter/gather
Assume AVX512 particles with real properties (4 byte) and 14 properties as above, yielding a record size of 56 bytes.
Although it may be unlikely, let us assume you manage to get the do j= loop of the AoS (second) sample code to vectorize (vector size of 16 real/float).
The SoA code (first example), when issuing a vector fetch from memory has:
(instruction fetch/decode) + 1 RAM memory access to read the cache line containing the vector.
The AoS code (second example), when issuing a vector fetch from memory has:
(instruction fetch/decode) + 14 RAM memory accesses (56*16/64) to read the cache lines containing the vector.
On AVX512 a gather of 16 real variables of structure of 56 bytes, spans 14 cache lines.
Although (in this hypothetical example) both do j= loops are vectorized and iterate the same number of times (nParticles / 16), the AoS code experiences 14 times the memory fetch overhead as does the SoA format.
Which technique would you think is more efficient?
BTW - forget about attempting to get the linked list to perform vector gather/scatter. While this can be done, the practicality of constructing a list of indices makes it inefficient as well as obfuscates any algorithm using it.
You might notice that the examples above perform more (abstract) calculations than necessary. IOW the distance particles A and B are calculated twice A-B and B-A. And that the common optimization technique of
and where you compute the forces between the two particles (A and B) once, that you will note that to accumulate these forces in multi-threaded code that you will either need a critical/atomic section or use a reduction that n-tuples the number of temporary arrays with the corresponding implicit critical section during the reduction. In many cases, the redundant computation is less than the cost of critical/atomic section or reduction. You can experiment with the reduction if you desire.
Thank you very much for the detailed examples. It has helped to understand how the memory access works much better, and I think it is very useful to be aware of those things. The reason why I were looking for linkedlist, and/or AoS is twofold. Dealing with a very large code (10k+ code lines) makes thing easier to structure them, but more importantly here is when having millions of particles which many of those die and others are introduced during runtime, it may become more helpful with a linkedlist approach, so that you can save memory space and avoid deallocation/allocation when the initial array of particles is not big enough. The latter problem is common for both cases AoS and SoA.
A bit unfortunate that you can't tell the compiler to store them as if they were seperate arrays, but on the code said remained arrays of structures. I guess if that was possible, it could be a nice feature
Conceptually, a linked list is easier to visualize (manage) for the death and birth of particles. This said, one must take into consideration the length of time waiting for a simulation to complete (times the number of simulations) while running scalar non-streaming data verses vector streaming data of the SoA format.
Using SoA format, one would pre-allocate for worst case population, fill in for initial population, and index in the range of the initial population. Should a birth/split occur, add it to the unused cells (initially at the end of the current iteration space). Deaths are a little different.
One method for death handling is to take the end particle and move it to the dead cell (this is expensive when the particle has a large number of properties. Note for multi-threaded code this can be from the end of the partition of the population subset that the thread is working on.
A second technique is to convert a/the property/s of the dead cell such that including it in the population is benign to the calculation. e.g. for Mass, define your particle property as 1/Mass (then use * in place of /) and for dead cell, change the 1/Mass property to 0.0. At some point in time, you would eliminate the interior dead cells (say when you checkpoint the data or when you update a visualization).
While time to program is important, time waiting for program may be more important.
Thank you for your help Jim. Just to ensure one thing, I know what multithreaded code is, although my initial enquiry was on the compilers ability to vectorize i.e. using features like !omp simd etc. But I generally guess if the code can be multithreaded, then it can also be vectorized.
I was know checking the difference between having a variable x(3,nopart) and particle(1:nopart)% x(3) and making some loop test. I used omp_get_wtime. What I observed was for the first case it was taking around 1 second, while the latter two seconds. However, the latter felt much longer than two seconds!. I used my stopwatch from my smartphone and realized that the latter actually took around 10 seconds, while the first case was more and less the same as physical time.
Is that because omp_get_wtime does not account for the time when the processor is waiting for memory to be fetched ? in that case should a more reasonable comparison obviously take this into account, and what else function could I use
I tried mpi_wtime and cpu_time without any luck
This was my example
type particle real :: x(3) real :: dum1 real :: dum2 real :: dum3 real :: dum4 real :: dum5 real :: dum6 real :: dum7 r real :: dum8 end type allocate(particle(20*10e6)) t1 = omp_get_wtime() do i = 1, 20*10e6 particle(i) % x(1) = exp(i) - exp(i/2) particle(i) % x(2) = exp(i) - exp(i/2) particle(i) % x(3) = exp(i) - exp(i/2) enddo t2 = omp_get_wtime() ! double precision print*, t2-t1 ! another example real, allocatable :: x(:,:) allocate(x(3,20*10e6) t1 = omp_get_wtime() do i = 1, 20*10e6 x(1,i) = exp(i) - exp(i/2) x(2,i) = exp(i) - exp(i/2) x(3,i) = exp(i) - exp(i/2) enddo
t2 = omp_get_wtime()
the dums in type is just to create the real scenario of having many other members of that type and to create the steps in memory addresses as you mentioned previously.
compiled in double precision (-r8)
! ATparticle.f90 ! ! FUNCTIONS: ! ATparticle - Entry point of console application. ! !**************************************************************************** ! ! PROGRAM: ATparticle ! ! PURPOSE: Entry point for the console application. ! !**************************************************************************** module my_module integer :: nParticles integer :: nTimeSteps real, parameter :: dT = 0.001 ! delta time ? real, parameter :: Gmks = 6.6742867E-11 type particle_t real :: Pos(3) real :: Vel(3) real :: Force(3) real :: Mass real :: dum1 end type particle_t type(particle_t), allocatable :: Particle(:) ! SoA format type SoA_t real, dimension(:), allocatable :: PosX, PosY, PosZ, VelX, VelY, VelZ, ForceX, ForceY, ForceZ, Mass, dum1 end type SoA_t type(SoA_t) :: SoA contains subroutine Init(n,s) implicit none integer :: n, s nParticles = n nTimeSteps = s call InitAoS call InitSoA end subroutine Init subroutine InitAoS implicit none integer :: i allocate(Particle(nParticles)) do i = 1, nParticles call RANDOM_NUMBER(Particle(i) % Pos) ! random position 0.0-1.0 (this should be scaled to fit your domain) Particle(i) % Vel = 0.0 ! at rest (can change to meet your needs) Particle(i) % Force = 0.0 call RANDOM_NUMBER(Particle(i) % Mass) ! arbitrary Mass Particle(i) % dum1 = 0.0 end do end subroutine InitAoS subroutine InitSoA implicit none allocate(SoA%PosX(nParticles), SoA%PosY(nParticles), SoA%PosZ(nParticles), & SoA%VelX(nParticles), SoA%VelY(nParticles), SoA%VelZ(nParticles), & SoA%ForceX(nParticles), SoA%ForceY(nParticles), SoA%ForceZ(nParticles), & SoA%Mass(nParticles), & &SoA%dum1(nParticles)) call RANDOM_NUMBER(SoA%PosX) call RANDOM_NUMBER(SoA%PosY) call RANDOM_NUMBER(SoA%PosZ) SoA%VelX = 0.0 SoA%VelY = 0.0 SoA%VelZ = 0.0 SoA%ForceX = 0.0 SoA%ForceY = 0.0 SoA%ForceZ = 0.0 call RANDOM_NUMBER(SoA%Mass) SoA%dum1 = 0.0 end subroutine InitSoA subroutine Free implicit none call FreeAoS call FreeSoA end subroutine Free subroutine FreeAoS implicit none deallocate(Particle) end subroutine FreeAoS subroutine FreeSoA implicit none deallocate(SoA%PosX, SoA%PosY, SoA%PosZ, & SoA%VelX, SoA%VelY, SoA%VelZ, & SoA%ForceX, SoA%ForceY, SoA%ForceZ, & SoA%Mass, & SoA%dum1) end subroutine FreeSoA end module my_module program ATparticle use omp_lib use my_module implicit none real :: t0, t1, t2 integer :: i call Init(int(20*10e3), 1) ! nParticles, nTimeSteps ! repeat to eliminate once only time overhead do i=1,3 t0 = omp_get_wtime() call AsParticle t1 = omp_get_wtime() call AsSoA t2 = omp_get_wtime() print *, t1-t0, t2-t1, (t1-t0) / (t2-t1) end do call Free end program ATparticle subroutine AsParticle use my_module implicit none integer :: i,j, iStep real :: distanceVector(3), forceMagnitude, forceVector(3), radiusSq, unitVector(3), accVector(3) do iStep = 1, nTimeSteps do i = 1, nParticles ! zero out force vector for this timestep Particle(i) % Force = 0.0 enddo ! accumulate forces do i = 1, nParticles-1 do j=i+1,nParticles distanceVector = Particle(i) % Pos - Particle(j) % Pos radiusSq = sum(distanceVector**2) unitVector = distanceVector / sqrt(radiusSq) forceMagnitude = Gmks * Particle(i) % Mass * Particle(j) % Mass / radiusSq forceVector = unitVector * forceMagnitude Particle(i) % Force = Particle(i) % Force + forceVector Particle(j) % Force = Particle(j) % Force - forceVector end do enddo ! accumulate forces into velocity, and advance position do i = 1, nParticles accVector = Particle(i) % Force / Particle(i) % Mass Particle(i) % Vel = Particle(i) % Vel + accVector Particle(i) % Pos = Particle(i) % Pos + Particle(i) % Vel * dT enddo end do end subroutine AsParticle ! another example subroutine AsSoA use my_module implicit none integer :: i,j,iStep real :: distanceVectorX, distanceVectorY, distanceVectorZ, radiusSq, radius real :: unitVectorX, unitVectorY, unitVectorZ, forceMagnitude do iStep = 1, nTimeSteps ! zero out force vector for this timestep SoA % ForceX = 0.0 SoA % ForceY = 0.0 SoA % ForceZ = 0.0 ! accumulate forces do i = 1, nParticles-1 do j=i+1,nParticles distanceVectorX = SoA % PosX(i) - SoA % PosX(j) distanceVectorY = SoA % PosY(i) - SoA % PosY(j) distanceVectorZ = SoA % PosZ(i) - SoA % PosZ(j) radiusSq = distanceVectorX**2 + distanceVectorY**2 + distanceVectorZ**2 radius = sqrt(radiusSq) unitVectorX = distanceVectorX / radius unitVectorY = distanceVectorY / radius unitVectorZ = distanceVectorZ / radius forceMagnitude = Gmks * SoA % Mass(i) * SoA % Mass(j) / radiusSq SoA % ForceX(i) = SoA % ForceX(i) + forceMagnitude * unitVectorX SoA % ForceX(j) = SoA % ForceX(j) - forceMagnitude * unitVectorX SoA % ForceY(i) = SoA % ForceY(i) + forceMagnitude * unitVectorY SoA % ForceY(j) = SoA % ForceY(j) - forceMagnitude * unitVectorY SoA % ForceZ(i) = SoA % ForceZ(i) + forceMagnitude * unitVectorZ SoA % ForceZ(j) = SoA % ForceZ(j) - forceMagnitude * unitVectorZ end do enddo ! accumulate forces into velocity, and advance position SoA % VelX = SoA % VelX + SoA % ForceX / SoA % Mass SoA % VelY = SoA % VelY + SoA % ForceY / SoA % Mass SoA % VelZ = SoA % VelZ + SoA % ForceZ / SoA % Mass SoA % PosX = SoA % PosX + SoA % VelX * dT SoA % PosY = SoA % PosY + SoA % VelY * dT SoA % PosZ = SoA % PosZ + SoA % VelZ * dT end do end subroutine AsSoA
Output on Core i7 2600K (first generation of AVX), single threaded
75.87500 22.50000 3.372222
76.25000 22.37500 3.407821
75.75000 22.25000 3.404494
Press any key to continue . . .
Code not verified for correctness, used as representative example
Code does not account for coincident particles (separation of 0.0)
Code not parallelized
This illustrates for a system with AVX (1) that vectorization yields a 3.4x improvement in performance.
*** Edited after post error in unitVector calculation (used radiusSq instead of radius)
The correction above reduced the gain to 2.23x (due to formerly optimizing out the sqrt(radiusSq) from the code)
I haven't ported this to my AVX512 system as of this time.