- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Intel Community.
There are a few things which I would like to understand. I hope you can illuminate a bit on my confusions.
First, I have a set of mathematical operations which I need to carry out to millions of particles, and they don't depend on one another. I initially thought to have a linkedlist approach where my main loop will be something like
do while( associated(particle) )
(...)
particle => particle% next
end do
The question is how the compiler decides to vectorize this do while loop. The reason for the linkedlist is the ease to add and delete particles. The other option would be an array of a declared type, where the loop would be something like
do i = 1, nopart
particle(i)% (...) = (...)
(...)
enddo
I guess the latter would be easy for the compiler to vectorize or ?
The next question is the directives for vectorization. I read that (INTEL) would suggest from now on to employ the openmp simd directives to tell the compiler explicitly to vectorize, right?
And finally, when compiling with the auto vectorization enabled, does it only include adding the flag -parallel -par-threshold=(XX) ?
Thanks very much
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vectorization requires the like properties of the particles to be in adjacent memory (forget about scatter/gather). IOW:
real, allocatable, dimension(:) :: X, Y, Z, dX, dY, dZ, fX, fY, fZ, Mass, Charge, Spin, ...
! *** do i=1,noPart ... ! Force calculation dX =dX + (fX / Mass) * dT ! acceleration to velocity X = X + dX * dT ! velocity to position ... ! other ! *** end do
Linked list, as well as having each particle object containing the above individual properties scatters the data about memory, or more importantly the like properties are scattered.
Keep in mind, that what is easy for you is hard for computation.
Note, your "do i=1, nopart" loop will experience no vectorizaton due to your having organized your data being organized by way particle objects as opposed to particle properties.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply.
I am not sure I understand fully what you mean by the last sentence you wrote. I was considering that instead I would group the particle properties into a declared type such like:
type particle_prop
real :: x(3)
real :: u(3)
(...)
end type
type(particle_prop) :: particle(no_part)
And then do my loop over all particles. Would that also not be stored in contiguous memory. i.e. will particle(1:no_part)%x(1) not be stored adjacently, and therefore no ability to vectorize?
What do you mean by "forget about scatter/gather). IOW:"
Does Fortran not provide a way to overcome this when using declared types, and more importantly, is there no way to use linked list approach with vectorizable code?
Thank
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Two comparable sections of code:
real, allocatable, dimension(:) :: X, Y, Z, dX, dY, dZ, fX, fY, fZ, OmegaX, OmegaY, OmegaZ ! 3D properties as three 1D arrays per property
real, allocatable, dimension(:) :: Mass, Charge ! 1D property
! other properties
...
allocate(X(nParticles), Y(nParticles), Z(nParticles))
allocate(dX(nParticles), dY(nParticles), dZ(nParticles))
allocate(fX(nParticles), fY(nParticles), fZ(nParticles))
allocate(OmegaX(nParticles), OmegaY(nParticles), OmegaZ(nParticles))
allocate(Mass(nParticles), Charge(nParticles))
...
!$omp parallel do private(j)
do i=1,nParticles
fX(i) = 0.0
fY(i) = 0.0
fZ(i) = 0.0
if(i > 1) call CalcForceG(i,1,i-1)
if(i < nParticles) call CalcForceG(i,i+1,nParticles)
end do ! i=1,nParticles
...
subroutine CalcForceG(i,jBegin,jEnd)
use YourModuleWithParticles
implicit none
integer :: i,jBegin,jEnd
integer :: j
real :: DistanceX, DistanceY, DistanceZ, rSquared, r, UnitX, UnitY, UnitZ, Fmagnitude
do j=jBegin, jEnd
DistanceX = X(i)-X(j)
DistanceY = Y(i)-Y(j)
DistanceZ = Z(i)-Z(j)
rSquared = DistanceX**2 + DistanceY**2 + DistanceZ**2
r = sqrt(rSquared)
UnitX= DistanceX / r
UnitY= DistanceY / r
UnitZ= DistanceZ / r
Fmagnitude = G * Mass(i) * Mass(j) / rSquared
fX(i) = fX(i) + Fmagnitude * UnitX
fY(i) = fY(i) + Fmagnitude * UnitY
fZ(i) = fZ(i) + Fmagnitude * UnitZ
end do
end subroutine CalcForceG
verses ----------------------------
type Particle
real :: Position(3), Velocity(3), Force(3), Omega(3)
real :: Mass, Charge
end type Particle
type(Particle), allocatable :: Particles(:)
...
allocate(Particles(nParticles))
...
!$omp parallel do private(j)
do i=1,nParticles
Particles(i)%Force = 0.0
if(i > 1) call CalcForceG(i,1,i-1)
if(i < nParticles) call CalcForceG(i,i+1,nParticles)
end do ! i=1,nParticles
...
subroutine CalcForceG(i,jBegin,jEnd)
use YourModuleWithParticles
implicit none
integer :: i,jBegin,jEnd
integer :: j
real :: Distance(3), rSquared, r, Unit(3), Fmagnitude
do j=jBegin, jEnd
Distance = Particles(i)%Position - Particles(j)%Position
rSquared = Distance(1)**2 + Distance(2)**2 + Distance(3)**2
r = sqrt(rSquared)
Unit= Distance / r
Fmagnitude = G * Mass(i) * Mass(j) / rSquared
Particles(i)%Force = Particles(i)%Force + Fmagnitude * UnitX
end do
end subroutine CalcForceG
The first section is commonly called Structure Of Arrays (SoA),
The second section is commonly called Array Of Structures (AoS)
In the first example all of the X(j)'s are contiguous, and thus vectoizable in "DistanceX = X(i)-X(j)"
Same with the Y(j) and Z(j) positions and the fX(j), fY(j) and fZ(j) thus making the do j= loop favorable to vectorization.
In the second example each particle has 14 real variables, thus:
Particle(j)%Position(1)
Is separated by 13 real variables (52 bytes) from
Particle(j+1)%Position(1)
Thus forcing the do j=loop to lose ability to be vectorized.
Are there a few more lines of code to write? A few more...
Is it harder to read? No - as easy to read
Is it more code efficient? Deffinately. AVX2 will iterate the do j= loop 1/8th as many times.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> forget about scatter/gather
Assume AVX512 particles with real properties (4 byte) and 14 properties as above, yielding a record size of 56 bytes.
Although it may be unlikely, let us assume you manage to get the do j= loop of the AoS (second) sample code to vectorize (vector size of 16 real/float).
The SoA code (first example), when issuing a vector fetch from memory has:
(instruction fetch/decode) + 1 RAM memory access to read the cache line containing the vector.
The AoS code (second example), when issuing a vector fetch from memory has:
(instruction fetch/decode) + 14 RAM memory accesses (56*16/64) to read the cache lines containing the vector.
On AVX512 a gather of 16 real variables of structure of 56 bytes, spans 14 cache lines.
Although (in this hypothetical example) both do j= loops are vectorized and iterate the same number of times (nParticles / 16), the AoS code experiences 14 times the memory fetch overhead as does the SoA format.
Which technique would you think is more efficient?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
BTW - forget about attempting to get the linked list to perform vector gather/scatter. While this can be done, the practicality of constructing a list of indices makes it inefficient as well as obfuscates any algorithm using it.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Side note
You might notice that the examples above perform more (abstract) calculations than necessary. IOW the distance particles A and B are calculated twice A-B and B-A. And that the common optimization technique of
do i=1,nParticles-1
do j=i+1,nParticles
and where you compute the forces between the two particles (A and B) once, that you will note that to accumulate these forces in multi-threaded code that you will either need a critical/atomic section or use a reduction that n-tuples the number of temporary arrays with the corresponding implicit critical section during the reduction. In many cases, the redundant computation is less than the cost of critical/atomic section or reduction. You can experiment with the reduction if you desire.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Jim,
Thank you very much for the detailed examples. It has helped to understand how the memory access works much better, and I think it is very useful to be aware of those things. The reason why I were looking for linkedlist, and/or AoS is twofold. Dealing with a very large code (10k+ code lines) makes thing easier to structure them, but more importantly here is when having millions of particles which many of those die and others are introduced during runtime, it may become more helpful with a linkedlist approach, so that you can save memory space and avoid deallocation/allocation when the initial array of particles is not big enough. The latter problem is common for both cases AoS and SoA.
A bit unfortunate that you can't tell the compiler to store them as if they were seperate arrays, but on the code said remained arrays of structures. I guess if that was possible, it could be a nice feature
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Conceptually, a linked list is easier to visualize (manage) for the death and birth of particles. This said, one must take into consideration the length of time waiting for a simulation to complete (times the number of simulations) while running scalar non-streaming data verses vector streaming data of the SoA format.
Using SoA format, one would pre-allocate for worst case population, fill in for initial population, and index in the range of the initial population. Should a birth/split occur, add it to the unused cells (initially at the end of the current iteration space). Deaths are a little different.
One method for death handling is to take the end particle and move it to the dead cell (this is expensive when the particle has a large number of properties. Note for multi-threaded code this can be from the end of the partition of the population subset that the thread is working on.
A second technique is to convert a/the property/s of the dead cell such that including it in the population is benign to the calculation. e.g. for Mass, define your particle property as 1/Mass (then use * in place of /) and for dead cell, change the 1/Mass property to 0.0. At some point in time, you would eliminate the interior dead cells (say when you checkpoint the data or when you update a visualization).
While time to program is important, time waiting for program may be more important.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your help Jim. Just to ensure one thing, I know what multithreaded code is, although my initial enquiry was on the compilers ability to vectorize i.e. using features like !omp simd etc. But I generally guess if the code can be multithreaded, then it can also be vectorized.
I was know checking the difference between having a variable x(3,nopart) and particle(1:nopart)% x(3) and making some loop test. I used omp_get_wtime. What I observed was for the first case it was taking around 1 second, while the latter two seconds. However, the latter felt much longer than two seconds!. I used my stopwatch from my smartphone and realized that the latter actually took around 10 seconds, while the first case was more and less the same as physical time.
Is that because omp_get_wtime does not account for the time when the processor is waiting for memory to be fetched ? in that case should a more reasonable comparison obviously take this into account, and what else function could I use
I tried mpi_wtime and cpu_time without any luck
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This was my example
type particle real :: x(3) real :: dum1 real :: dum2 real :: dum3 real :: dum4 real :: dum5 real :: dum6 real :: dum7 r real :: dum8 end type allocate(particle(20*10e6)) t1 = omp_get_wtime() do i = 1, 20*10e6 particle(i) % x(1) = exp(i) - exp(i/2) particle(i) % x(2) = exp(i) - exp(i/2) particle(i) % x(3) = exp(i) - exp(i/2) enddo t2 = omp_get_wtime() ! double precision print*, t2-t1 ! another example real, allocatable :: x(:,:) allocate(x(3,20*10e6) t1 = omp_get_wtime() do i = 1, 20*10e6 x(1,i) = exp(i) - exp(i/2) x(2,i) = exp(i) - exp(i/2) x(3,i) = exp(i) - exp(i/2) enddo
t2 = omp_get_wtime()
print*, t2-t1
the dums in type is just to create the real scenario of having many other members of that type and to create the steps in memory addresses as you mentioned previously.
compiled in double precision (-r8)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Example code:
! ATparticle.f90
!
! FUNCTIONS:
! ATparticle - Entry point of console application.
!
!****************************************************************************
!
! PROGRAM: ATparticle
!
! PURPOSE: Entry point for the console application.
!
!****************************************************************************
module my_module
integer :: nParticles
integer :: nTimeSteps
real, parameter :: dT = 0.001 ! delta time ?
real, parameter :: Gmks = 6.6742867E-11
type particle_t
real :: Pos(3)
real :: Vel(3)
real :: Force(3)
real :: Mass
real :: dum1
end type particle_t
type(particle_t), allocatable :: Particle(:)
! SoA format
type SoA_t
real, dimension(:), allocatable :: PosX, PosY, PosZ, VelX, VelY, VelZ, ForceX, ForceY, ForceZ, Mass, dum1
end type SoA_t
type(SoA_t) :: SoA
contains
subroutine Init(n,s)
implicit none
integer :: n, s
nParticles = n
nTimeSteps = s
call InitAoS
call InitSoA
end subroutine Init
subroutine InitAoS
implicit none
integer :: i
allocate(Particle(nParticles))
do i = 1, nParticles
call RANDOM_NUMBER(Particle(i) % Pos) ! random position 0.0-1.0 (this should be scaled to fit your domain)
Particle(i) % Vel = 0.0 ! at rest (can change to meet your needs)
Particle(i) % Force = 0.0
call RANDOM_NUMBER(Particle(i) % Mass) ! arbitrary Mass
Particle(i) % dum1 = 0.0
end do
end subroutine InitAoS
subroutine InitSoA
implicit none
allocate(SoA%PosX(nParticles), SoA%PosY(nParticles), SoA%PosZ(nParticles), &
SoA%VelX(nParticles), SoA%VelY(nParticles), SoA%VelZ(nParticles), &
SoA%ForceX(nParticles), SoA%ForceY(nParticles), SoA%ForceZ(nParticles), &
SoA%Mass(nParticles), &
&SoA%dum1(nParticles))
call RANDOM_NUMBER(SoA%PosX)
call RANDOM_NUMBER(SoA%PosY)
call RANDOM_NUMBER(SoA%PosZ)
SoA%VelX = 0.0
SoA%VelY = 0.0
SoA%VelZ = 0.0
SoA%ForceX = 0.0
SoA%ForceY = 0.0
SoA%ForceZ = 0.0
call RANDOM_NUMBER(SoA%Mass)
SoA%dum1 = 0.0
end subroutine InitSoA
subroutine Free
implicit none
call FreeAoS
call FreeSoA
end subroutine Free
subroutine FreeAoS
implicit none
deallocate(Particle)
end subroutine FreeAoS
subroutine FreeSoA
implicit none
deallocate(SoA%PosX, SoA%PosY, SoA%PosZ, &
SoA%VelX, SoA%VelY, SoA%VelZ, &
SoA%ForceX, SoA%ForceY, SoA%ForceZ, &
SoA%Mass, &
SoA%dum1)
end subroutine FreeSoA
end module my_module
program ATparticle
use omp_lib
use my_module
implicit none
real :: t0, t1, t2
integer :: i
call Init(int(20*10e3), 1) ! nParticles, nTimeSteps
! repeat to eliminate once only time overhead
do i=1,3
t0 = omp_get_wtime()
call AsParticle
t1 = omp_get_wtime()
call AsSoA
t2 = omp_get_wtime()
print *, t1-t0, t2-t1, (t1-t0) / (t2-t1)
end do
call Free
end program ATparticle
subroutine AsParticle
use my_module
implicit none
integer :: i,j, iStep
real :: distanceVector(3), forceMagnitude, forceVector(3), radiusSq, unitVector(3), accVector(3)
do iStep = 1, nTimeSteps
do i = 1, nParticles
! zero out force vector for this timestep
Particle(i) % Force = 0.0
enddo
! accumulate forces
do i = 1, nParticles-1
do j=i+1,nParticles
distanceVector = Particle(i) % Pos - Particle(j) % Pos
radiusSq = sum(distanceVector**2)
unitVector = distanceVector / sqrt(radiusSq)
forceMagnitude = Gmks * Particle(i) % Mass * Particle(j) % Mass / radiusSq
forceVector = unitVector * forceMagnitude
Particle(i) % Force = Particle(i) % Force + forceVector
Particle(j) % Force = Particle(j) % Force - forceVector
end do
enddo
! accumulate forces into velocity, and advance position
do i = 1, nParticles
accVector = Particle(i) % Force / Particle(i) % Mass
Particle(i) % Vel = Particle(i) % Vel + accVector
Particle(i) % Pos = Particle(i) % Pos + Particle(i) % Vel * dT
enddo
end do
end subroutine AsParticle
! another example
subroutine AsSoA
use my_module
implicit none
integer :: i,j,iStep
real :: distanceVectorX, distanceVectorY, distanceVectorZ, radiusSq, radius
real :: unitVectorX, unitVectorY, unitVectorZ, forceMagnitude
do iStep = 1, nTimeSteps
! zero out force vector for this timestep
SoA % ForceX = 0.0
SoA % ForceY = 0.0
SoA % ForceZ = 0.0
! accumulate forces
do i = 1, nParticles-1
do j=i+1,nParticles
distanceVectorX = SoA % PosX(i) - SoA % PosX(j)
distanceVectorY = SoA % PosY(i) - SoA % PosY(j)
distanceVectorZ = SoA % PosZ(i) - SoA % PosZ(j)
radiusSq = distanceVectorX**2 + distanceVectorY**2 + distanceVectorZ**2
radius = sqrt(radiusSq)
unitVectorX = distanceVectorX / radius
unitVectorY = distanceVectorY / radius
unitVectorZ = distanceVectorZ / radius
forceMagnitude = Gmks * SoA % Mass(i) * SoA % Mass(j) / radiusSq
SoA % ForceX(i) = SoA % ForceX(i) + forceMagnitude * unitVectorX
SoA % ForceX(j) = SoA % ForceX(j) - forceMagnitude * unitVectorX
SoA % ForceY(i) = SoA % ForceY(i) + forceMagnitude * unitVectorY
SoA % ForceY(j) = SoA % ForceY(j) - forceMagnitude * unitVectorY
SoA % ForceZ(i) = SoA % ForceZ(i) + forceMagnitude * unitVectorZ
SoA % ForceZ(j) = SoA % ForceZ(j) - forceMagnitude * unitVectorZ
end do
enddo
! accumulate forces into velocity, and advance position
SoA % VelX = SoA % VelX + SoA % ForceX / SoA % Mass
SoA % VelY = SoA % VelY + SoA % ForceY / SoA % Mass
SoA % VelZ = SoA % VelZ + SoA % ForceZ / SoA % Mass
SoA % PosX = SoA % PosX + SoA % VelX * dT
SoA % PosY = SoA % PosY + SoA % VelY * dT
SoA % PosZ = SoA % PosZ + SoA % VelZ * dT
end do
end subroutine AsSoA
Output on Core i7 2600K (first generation of AVX), single threaded
75.87500 22.50000 3.372222
76.25000 22.37500 3.407821
75.75000 22.25000 3.404494
Press any key to continue . . .
Code not verified for correctness, used as representative example
Code does not account for coincident particles (separation of 0.0)
Code not parallelized
This illustrates for a system with AVX (1) that vectorization yields a 3.4x improvement in performance.
*** Edited after post error in unitVector calculation (used radiusSq instead of radius)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The correction above reduced the gain to 2.23x (due to formerly optimizing out the sqrt(radiusSq) from the code)
I haven't ported this to my AVX512 system as of this time.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page