FORTRAN "derived type" - code optimization

Capitaine__Julien · ‎08-02-2019

Good morning, good afternoon,

(structure of that message: context and presentation of the code, presentations of my issues, questions)

==================== CONTEXT AND PRESENTATION OF THE CODE ====================

So over the last year I have been implementing a module of particles in a fluid code (MHD to be precised). The code is parallelized with MPI. Everything is declared accordingly and working well in the current state.

My particles are defined thanks to a FORTRAN derived type, a.k.a a structure, which allows to aggregate data of different type, essentially integers and real in the present.

The justification for that choice is the following:

- The particles' positions are continuous, meaning they are not defined over points of the grid (associated to the fluid description) but can take any value; in the limit of machine precision. So the position of a particle needs to be a real obviously. (3 actually --> 3D space in the model)

- But I also need to localize the particles over the grid and so I save the indices relatively to the grid so it is faster to go and take a fluid quantity in the array and interpolate it at the position of the particle. So I am using integer type value for that.

Here is the declaration of the corresponding type in my module:

-------------------------------------------------------------

    type PIC_particles
    ! sequence
    integer           :: active
    integer           :: wrong_dt
    integer           :: nb_wrong_dt
    integer           :: tag
    integer           :: Idx
    integer           :: Idy
    integer           :: Idz
    real(dp)          :: mass
    real(dp)          :: charge
    real(dp)          :: x
    real(dp)          :: y
    real(dp)          :: z
    real(dp)          :: px
    real(dp)          :: py
    real(dp)          :: pz
    real(dp)          :: half_px
    real(dp)          :: half_py
    real(dp)          :: half_pz

-------------------------------------------------------------

===================================================================

==================== ISSUES ====================
Now, when writing the code I used explicit DO loops almost everywhere.

Let's say I am having N particles for instance, the algorithm is the following.

- DO LOOP over the number of particles N.

- IF condition to check to active feature: equal to 1 --> the particle is still existing in the simulation box

equal to 0 --> the particle does not exist and will be removed

- Doing some operations (interpolation of the e.m. field (1st order spline function), evolving the momentum of the particle, evolving the position, checking position relatively to the box boundaries)

- Going to the next particle

=============================================

==================== QUESTIONS ====================

This is what I would like to improve: replace all DO LOOP by global operations.

My questions are:

1) Would the module be faster with global operations? (seems to be good to ask first - it represents quite a work, so if there is no point in the end let's stop it here) Does "global operations" mean improved performances?

2) How does it happens in memory when doing that? Does the machine go and take the data where they are, take them somewhere else to perform a sum or a product, and then go to find the place the result is stored?

Regarding the state of the memory and 2D array operations for example, I know it is faster to write the most inner (innermost?) loop on the column and the outermost on the line because the data in the columns are contiguous and the computer does not go back and forth. (I think it is that way, but it may be the opposite)

Is there the same kind of 'trick' to take into account regarding structures? ==> How are the data stored?!

3) If I have something like some million particles (few * 10^6) in a given processor, is there a risk to overload the memory? How does the computer handle this? (knowing that MHD quantities are also taking some memory space)

4) Is there a risk that data could get mixed from one particle to another in the process? (I am a physicist and learnt computational stuff while working, I still that 'fear' when doing something new)

5) Should I create intermediary 1D arrays?

Justification of that question: when interpolating the x-component of the magnetic field for example, I am doing Bx(Idx, Idy, Idz) currently. Is it possible to do Bx( PIC(:)%Idx, PIC(:)%Idy, PIC(:)%Idz ) (PIC is the name of structure) or should I create 1D arrays Index_x, Index_y, Index_z and then do Bx( Index_x(:), Index_y(:), Index_z(:) )? (maybe a stupid question)

================================================

These last days, the more I think about it the more I have the feeling my code could be more compact and faster.

But before embarking on this work I thought it was a good idea to ask for some advices.

I am grateful to you if you read so far and I hope you will reply.

Kindly

Julien Capitaine

jimdempseyatthecove · ‎08-03-2019

>> If I have something like some million particles (few * 10^6) in a given processor...Would the module be faster with global operations... I know it is faster to write the most inner

With the quantity of particles, any your expression for the need of "faster operations", you'd be wise to structure your data to take advantage of the SIMD (Single Instruction, Multiple Data) instruction set of the modern CPU designs. These instructions are otherwise known as Vector class instructions. Current small vector sizes are SSE (2 doubles), AVX/AVX2 (4 doubles) and AVX512 (8 doubles). The requirement for use of these instructions are that the like data be adjacent in memory (there also may be an alignment requirement on the earlier CPUs that introduced these instructions).

To facilitate the compiler to use these instructions, you will have to replace your type PIC_particles, also known as Array of Structures, with the format typically used in High Performance Computing (HPC) known as Structure of Arrays:

type PIC_particlesSOA
    integer, allocatable           :: active(:) ! allocate to number of particles
    integer, allocatable            :: wrong_dt(:) ! allocate to number of particles
    integer, allocatable            :: nb_wrong_dt(:) ! allocate to number of particles
    integer, allocatable            :: tag(:) ! allocate to number of particles
    integer, allocatable            :: Idx(:) ! allocate to number of particles
    integer, allocatable            :: Idy(:) ! allocate to number of particles
    integer, allocatable            :: Idz(:) ! allocate to number of particles
    real(dp), allocatable           :: mass(:) ! allocate to number of particles
    real(dp), allocatable           :: charge(:) ! allocate to number of particles
    real(dp), allocatable           :: x(:) ! allocate to number of particles
    real(dp), allocatable           :: y(:) ! allocate to number of particles
    real(dp), allocatable           :: z(:) ! allocate to number of particles
    real(dp), allocatable           :: px(:) ! allocate to number of particles
    real(dp), allocatable           :: py(:) ! allocate to number of particles
    real(dp), allocatable           :: pz(:) ! allocate to number of particles
    real(dp), allocatable           :: half_px(:) ! allocate to number of particles
    real(dp), allocatable           :: half_py(:) ! allocate to number of particles
    real(dp), allocatable           :: half_pz(:) ! allocate to number of particles
    ...

Note, the allocation would be made to the maximum number of desired particle positions (in event a particle may split/nucleate)

For particle deletion, I suggest removing the tail-end delete-able particles, then scan for delete-able particles, and then move the last particle into the deleted particle slot (as opposed to squishing the remainder the particle arrays).

An alternative method is leave the deleted particles in place, but condition the appropriate values (e.g. position, mass), such that you avoid divide by zero. For example, each delete particle has a position equal to its array index plus a number that is outside your "box" such that you will not calculate a particle separation of 0.0. Also set the mass to 0.0 such that the particle will not contribute to the momentum, etc.. Then deletion (shuffle from tail end...) will only occur when a sufficient number of deleted particles is reached (or potentially reached within a small vector).

While Fortran is capable of using an array of indices (... intermediary 1D arrays), the efficiency in doing so is targeted at ease for the programmer as opposed for computational ease for the CPU. A little more programming effort can go a long way to reduce your wait time later.

Jim Dempsey