Solved: How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

Yann_D_ · ‎02-20-2023

Hello

I want to vectorize the two nested loops because their length is small.

The example is the following :

module mystuff

USE ISO_FORTRAN_ENV
integer :: nx
integer :: nz
real(REAL64) :: bega

contains

subroutine compute_Correction_Elem_Veloc (Accel, Forces, Displ, Veloc, V0, g1, ddt)
real(REAL64), dimension(1:nx-2, 1:nz-2, 0:1), intent(INOUT) :: Accel
real(REAL64), dimension(0:nx-1, 0:nz-1, 0:1), intent(INOUT) :: Forces
real(REAL64), dimension(1:nx-2, 1:nz-2, 0:1), intent(INOUT) :: Displ
real(REAL64), dimension(1:nx-2, 1:nz-2, 0:1), intent(INOUT) :: Veloc
real(REAL64), dimension(1:nx-2, 1:nz-2, 0:1), intent(IN)    :: V0
real(REAL64), intent(IN) :: g1, ddt
!
integer  :: ii, jj
real(REAL64) :: sc2
!
sc2 = bega * ddt
!$OMP SIMD COLLAPSE(2)
do jj = 1, nz-2
   do ii = 1, nx-2
      Veloc (ii,jj,0) = V0   (ii,jj,0) + ddt *  Forces(ii,jj,0)
      Accel (ii,jj,0) = Accel(ii,jj,0) + g1  *  Forces(ii,jj,0)
      Displ (ii,jj,0) = Displ(ii,jj,0) + sc2 * (Veloc(ii,jj,0) + V0(ii,jj,0) )
      Veloc (ii,jj,1) = V0   (ii,jj,1) + ddt *  Forces(ii,jj,1)
      Accel (ii,jj,1) = Accel(ii,jj,1) + g1  *  Forces(ii,jj,1)
      Displ (ii,jj,1) = Displ(ii,jj,1) + sc2 * (Veloc(ii,jj,1) + V0(ii,jj,1) )
   end do
end do
!
return
end subroutine compute_Correction_Elem_Veloc


end module mystuff

The compilation sequence is

ifort -c -O3 -xHost -ftz -g -traceback -implicitnone -qopt-report=5 -qopt-report-phase=vec -fpp -DGM_VEC -pad -align array64byte -c mystuff_v01.f90

The report is

Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.

Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000

Compiler options: -c -O3 -xHost -ftz -g -traceback -implicitnone -qopt-report=5 -qopt-report-phase=vec -fpp -DGM_VEC -pad -align array64byte -c

Begin optimization report for: MYSTUFF::COMPUTE_CORRECTION_ELEM_VELOC

    Report from: Vector optimizations [vec]


LOOP BEGIN at mystuff_v01.f90(26,25)
   remark #15329: vectorization support: irregularly indexed store was emulated for the variable <veloc(ii,jj,0)>, part of index is nonlinearly computed   [ mystuff_v01.f90(25,7) ]
   remark #15328: vectorization support: irregularly indexed load was emulated for the variable <v0(ii,jj,0)>, part of index is nonlinearly computed   [ mystuff_v01.f90(25,25) ]
   remark #15328: vectorization support: irregularly indexed load was emulated for the variable <forces(ii,jj,0)>, part of index is nonlinearly computed   [ mystuff_v01.f90(25,49) ]
   remark #15329: vectorization support: irregularly indexed store was emulated for the variable <accel(ii,jj,0)>, part of index is nonlinearly computed   [ mystuff_v01.f90(26,7) ]
   remark #15328: vectorization support: irregularly indexed load was emulated for the variable <accel(ii,jj,0)>, part of index is nonlinearly computed
   remark #15328: vectorization support: irregularly indexed load was emulated for the variable <forces(ii,jj,0)>, part of index is nonlinearly computed   [ mystuff_v01.f90(26,49) ]
   remark #15329: vectorization support: irregularly indexed store was emulated for the variable <displ(ii,jj,0)>, part of index is nonlinearly computed   [ mystuff_v01.f90(27,7) ]
   remark #15328: vectorization support: irregularly indexed load was emulated for the variable <displ(ii,jj,0)>, part of index is nonlinearly computed   [ mystuff_v01.f90(27,25) ]
   remark #15328: vectorization support: irregularly indexed load was emulated for the variable <veloc(ii,jj,0)>, part of index is nonlinearly computed   [ mystuff_v01.f90(27,49) ]
   remark #15328: vectorization support: irregularly indexed load was emulated for the variable <v0(ii,jj,0)>, part of index is nonlinearly computed   [ mystuff_v01.f90(27,66) ]
   remark #15329: vectorization support: irregularly indexed store was emulated for the variable <veloc(ii,jj,1)>, part of index is nonlinearly computed   [ mystuff_v01.f90(28,7) ]
   remark #15328: vectorization support: irregularly indexed load was emulated for the variable <v0(ii,jj,1)>, part of index is nonlinearly computed   [ mystuff_v01.f90(28,25) ]
   remark #15328: vectorization support: irregularly indexed load was emulated for the variable <forces(ii,jj,1)>, part of index is nonlinearly computed   [ mystuff_v01.f90(28,49) ]
   remark #15329: vectorization support: irregularly indexed store was emulated for the variable <accel(ii,jj,1)>, part of index is nonlinearly computed   [ mystuff_v01.f90(29,7) ]
   remark #15328: vectorization support: irregularly indexed load was emulated for the variable <accel(ii,jj,1)>, part of index is nonlinearly computed   [ mystuff_v01.f90(29,25) ]
   remark #15328: vectorization support: irregularly indexed load was emulated for the variable <forces(ii,jj,1)>, part of index is nonlinearly computed   [ mystuff_v01.f90(29,49) ]
   remark #15329: vectorization support: irregularly indexed store was emulated for the variable <displ(ii,jj,1)>, part of index is nonlinearly computed   [ mystuff_v01.f90(30,7) ]
   remark #15328: vectorization support: irregularly indexed load was emulated for the variable <displ(ii,jj,1)>, part of index is nonlinearly computed   [ mystuff_v01.f90(30,25) ]
   remark #15328: vectorization support: irregularly indexed load was emulated for the variable <veloc(ii,jj,1)>, part of index is nonlinearly computed   [ mystuff_v01.f90(30,49) ]
   remark #15328: vectorization support: irregularly indexed load was emulated for the variable <v0(ii,jj,1)>, part of index is nonlinearly computed   [ mystuff_v01.f90(30,66) ]
   remark #15305: vectorization support: vector length 2
   remark #15309: vectorization support: normalized vectorization overhead 0.073
   remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
   remark #15462: unmasked indexed (or gather) loads: 14 
   remark #15463: unmasked indexed (or scatter) stores: 6 
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 134 
   remark #15477: vector cost: 191.500 
   remark #15478: estimated potential speedup: 0.690 
   remark #15482: vectorized math library calls: 2 
   remark #15488: --- end vector cost summary ---
LOOP END

LOOP BEGIN at mystuff_v01.f90(26,25)
<Remainder loop for vectorization>
LOOP END
===========================================================================

If I remove the OpenMP direction OMP SIMD COLLAPSE and I recompile, the new report is the following :

Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.

Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000

Compiler options: -c -O3 -xHost -ftz -g -traceback -implicitnone -qopt-report=5 -qopt-report-phase=vec -fpp -DGM_VEC -pad -align array64byte -c

Begin optimization report for: MYSTUFF::COMPUTE_CORRECTION_ELEM_VELOC

    Report from: Vector optimizations [vec]


LOOP BEGIN at mystuff_v00.f90(23,1)
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at mystuff_v00.f90(24,4)
   <Peeled loop for vectorization>
   LOOP END

   LOOP BEGIN at mystuff_v00.f90(24,4)
      remark #15389: vectorization support: reference veloc(ii,jj,1) has unaligned access   [ mystuff_v00.f90(28,7) ]
      remark #15389: vectorization support: reference v0(ii,jj,1) has unaligned access   [ mystuff_v00.f90(28,25) ]
      remark #15389: vectorization support: reference forces(ii,jj,1) has unaligned access   [ mystuff_v00.f90(28,49) ]
      remark #15389: vectorization support: reference veloc(ii,jj,0) has unaligned access   [ mystuff_v00.f90(25,7) ]
      remark #15389: vectorization support: reference v0(ii,jj,0) has unaligned access   [ mystuff_v00.f90(25,25) ]
      remark #15389: vectorization support: reference forces(ii,jj,0) has unaligned access   [ mystuff_v00.f90(25,49) ]
      remark #15389: vectorization support: reference accel(ii,jj,0) has unaligned access   [ mystuff_v00.f90(26,7) ]
      remark #15389: vectorization support: reference accel(ii,jj,0) has unaligned access   [ mystuff_v00.f90(26,25) ]
      remark #15389: vectorization support: reference forces(ii,jj,0) has unaligned access   [ mystuff_v00.f90(26,49) ]
      remark #15389: vectorization support: reference accel(ii,jj,1) has unaligned access   [ mystuff_v00.f90(29,7) ]
      remark #15389: vectorization support: reference accel(ii,jj,1) has unaligned access   [ mystuff_v00.f90(29,25) ]
      remark #15389: vectorization support: reference forces(ii,jj,1) has unaligned access   [ mystuff_v00.f90(29,49) ]
      remark #15389: vectorization support: reference displ(ii,jj,1) has unaligned access   [ mystuff_v00.f90(30,7) ]
      remark #15389: vectorization support: reference displ(ii,jj,1) has unaligned access   [ mystuff_v00.f90(30,25) ]
      remark #15389: vectorization support: reference veloc(ii,jj,1) has unaligned access   [ mystuff_v00.f90(30,49) ]
      remark #15389: vectorization support: reference v0(ii,jj,1) has unaligned access   [ mystuff_v00.f90(30,66) ]
      remark #15389: vectorization support: reference displ(ii,jj,0) has unaligned access   [ mystuff_v00.f90(27,7) ]
      remark #15389: vectorization support: reference displ(ii,jj,0) has unaligned access   [ mystuff_v00.f90(27,25) ]
      remark #15389: vectorization support: reference veloc(ii,jj,0) has unaligned access   [ mystuff_v00.f90(27,49) ]
      remark #15389: vectorization support: reference v0(ii,jj,0) has unaligned access   [ mystuff_v00.f90(27,66) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 4
      remark #15309: vectorization support: normalized vectorization overhead 0.892
      remark #15300: LOOP WAS VECTORIZED
      remark #15442: entire loop may be executed in remainder
      remark #15450: unmasked unaligned unit stride loads: 10 
      remark #15451: unmasked unaligned unit stride stores: 6 
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 55 
      remark #15477: vector cost: 16.250 
      remark #15478: estimated potential speedup: 3.240 
      remark #15488: --- end vector cost summary ---
   LOOP END

   LOOP BEGIN at mystuff_v00.f90(24,4)
   <Remainder loop for vectorization>
   LOOP END
LOOP END
===========================================================================

If I look at the final lines of the first report, It seems to me that the estimated potential speedup with the OpenMP directive is very bad with respect to the last one. Is this normal ? What do I miss ?

Thank you for your comments.

jimdempseyatthecove · ‎02-21-2023

I made changes to your code and it now vectorizes.

main.f90 is unchanged.

module mystuff

use mytypes

real(REAL64) :: bega

contains

subroutine Correction_Elem_Veloc (Elem, bega, gam1, dt)
implicit none

type(Element), intent(INOUT) :: Elem
real(REAL64),  intent(IN)    :: bega, gam1, dt

real(REAL64) :: sc2
integer :: nx, nz

nx = Elem%ngllx
nz = Elem%ngllz

sc2 = bega * dt
CALL compute_correction_Elem_Veloc (Elem%Accel, Elem%Forces, Elem%Displ, &
                                    Elem%Veloc, Elem%V0,     gam1, dt, sc2, nx*nz)



end subroutine Correction_Elem_Veloc

    end module mystuff

! Move out of the module contained procedure
! IOW make it an external procedure
! External procedures take the passed arguments as-is
! meaning the arguments can change rank and shape (you specify the size)
! ***** Compile with interface checking off *****
    
subroutine compute_Correction_Elem_Veloc (Accel, Forces, Displ, Veloc, V0, g1, ddt, sc2, n)
use mytypes
implicit none
integer :: n
real(REAL64), dimension(n), intent(INOUT) :: Accel
real(REAL64), dimension(n), intent(INOUT) :: Forces
real(REAL64), dimension(n), intent(INOUT) :: Displ
real(REAL64), dimension(n), intent(INOUT) :: Veloc
real(REAL64), dimension(n), intent(IN)    :: V0
real(REAL64), intent(IN) :: g1, ddt, sc2

integer :: ii
!$OMP SIMD
do ii = 1, size(Accel)
     Veloc (ii) =V0   (ii) + ddt * Forces(ii)
     Accel (ii) =Accel(ii) + g1  * Forces(ii)
     Displ (ii) =Displ(ii) + sc2 * (Veloc (ii) +V0(ii) )
end do
!
return
end subroutine compute_Correction_Elem_Veloc

Using VTune, the majority of the time in compute_Correction_Elem_Veloc is in this section:
.B2.13::                        ; Preds .B2.13 .B2.12
                                ; Execution count [5.44e+00]
        vmovupd   ymm9, YMMWORD PTR [rax+r10*8]                 ;51.18
        vmovupd   ymm6, YMMWORD PTR [rdx+r10*8]                 ;51.36
        vmovupd   ymm7, YMMWORD PTR [rcx+r10*8]                 ;52.18
        vmovdqa   ymm8, ymm2                                    ;51.6
        vfmadd213pd ymm8, ymm6, ymm9                            ;51.6
        vfmadd231pd ymm7, ymm6, ymm1                            ;52.6
        vaddpd    ymm6, ymm8, ymm9                              ;53.48
        vmovupd   YMMWORD PTR [rbx+r10*8], ymm8                 ;51.6
        vmovupd   YMMWORD PTR [rcx+r10*8], ymm7                 ;52.6
        vfmadd213pd ymm6, ymm0, YMMWORD PTR [rsi+r10*8]         ;53.6
        vmovupd   YMMWORD PTR [rsi+r10*8], ymm6                 ;53.6
        add       r10, 4                                        ;50.1
        cmp       r10, r9                                       ;50.1
        jb        .B2.13        ; Prob 82%                      ;50.1

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎02-20-2023

Due to (nz-2, and nx-2):

do jj = 1, nz-2
   do ii = 1, nx-2

You have gaps in the blocks of data. You can only vectorize the inner loop. IOW these loops cannot be collapsed.

Jim Dempsey

JohnNichols · ‎02-20-2023

Jim

I fail to see why the code uses NZ-2 etc.. if this was lisp one would write a few functions to allow for the equations, there has to be a more efficient method?

jimdempseyatthecove · ‎02-20-2023

Edit,

In looking closer at your code, you have Forces dimensioned as:

real(REAL64), dimension(0:nx-1, 0:nz-1, 0:1), intent(INOUT) :: Forces

Nothing inherently wrong with that, except that your loop is iterating the in indices from 1 to nx-2 and nz-2

This means that indices 0 and nz-1 are skipped over in your do loop. This is the reason why you could not collapse the two loop with vectorization. The collapse requires the collapsed indices to be contiguous.

If you really need collapsing of these loops, then consider dimensioning Forces the same as the other arrays, and then, add the two boundary forces as a separate array.

You will have to trade off "looks odd" with "runs better".

Jim Dempsey

jimdempseyatthecove · ‎02-20-2023

Yann,

If you do take the route of separating the interior forces from the perimeter force then consider

1) In the caller (or module variables), attribute your arrays with:

!DIR$ ATTRIBUTES ALIGN : bytesOfAlignment :: YourArray

2) in your called subroutine use the compiler directive !DIR$ ASSUME_ALIGNED YourArray:bytesOfAlignment

Generally you use 64 for bytesOfAlignment

*** Caution, it is the programmers responsibility to assure the calling arguments meet the assumptions of the called subroutine.

Jim Dempsey

Yann_D_ · ‎02-21-2023

Hello Mr Dempsey,

I'm glad to read your comments !

Some time has passed since my previous thread on Intel's forums (Can't understand why fortran inner loop not vectorized).

It's the same application and your advices let me strongly improve its performances. Thank you very much for this.

Some remarks to answer your replies.

1. As the situation is a bit more complicated than I thought, I create a small program (attached below) that contains everything to mimic the way the whole application does the computations.

- a module named mytypes.f90 with the simplified definition of the derived type Element

- a module named mystuff.f90 with the computing routines

- a main program to run it, its results are meaningless

- a script compile.sh to compile the source files

- a script run.sh to execute the program

2. The dimensions of array "Forces", are not really a problem for me here, one can change them to the same as other arrays, i.e. 1:nx-2, 1:nz-2

It's done in the attached example.

3. nx and nz are less than ten (5 or 9 ), but not constants through all the spectral elements (spatial discretization method). Moreover if I put a directive like

!DIR$ LOOP COUNT (3,7)

which are the possible lengths for the loops (nx and nz are equal to 5 or 9 in my examples) the compiler does not vectorize anymore : the performances fall. So, even on short loops like these, vectorization gives a real gain. So no loop count directive. This is the reason why I try to collapse the two loops to get a longer one.

4. I try to put the two attribute directives you talked about in this example but the only place the compiler accepts the directive "!$DIR$ ASSUME_ALIGNED " is in mytypes.f90 not in the called routine. I don't manage to compile with it in mystuff.f90 . If I try, I get messages like this :

mystuff.f90(26): remark #7867: This directive is misplaced or not supported on this platform.

(I keep them in comments in mystuff.f90)

With 2. and 4. , I compile the code and in the reports, I still get lots of bad messages like

remark #15328: vectorization support: irregularly indexed load was emulated for the variable <veloc(ii,jj,1)>, part of index is nonlinearly computed [ mystuff.f90(49,49) ]

So I should have something wrong !

Regards,

jimdempseyatthecove · ‎02-20-2023

John,

Assume you have a string you wish to simulate.

Assume the end points are affixed to some structure.

Assume the dynamics of the string is simulated by synthesizing beads along the length of the string.

The forces for motion applies to the beads (indices 1:nz-2)

Yet the forces within the string extends to the two end point fixtures (the 0'th and nz-1'th indices). Thus the forces require two additional variables (in the string example). The OP had something akin to a 2D mesh.

Jim Dempsey

jimdempseyatthecove · ‎02-20-2023

BTW if the value of nx is large, then a collapse (if possible) would be of negligible value.

Jim Dempsey

jimdempseyatthecove · ‎02-21-2023

I made changes to your code and it now vectorizes.

main.f90 is unchanged.

module mystuff

use mytypes

real(REAL64) :: bega

contains

subroutine Correction_Elem_Veloc (Elem, bega, gam1, dt)
implicit none

type(Element), intent(INOUT) :: Elem
real(REAL64),  intent(IN)    :: bega, gam1, dt

real(REAL64) :: sc2
integer :: nx, nz

nx = Elem%ngllx
nz = Elem%ngllz

sc2 = bega * dt
CALL compute_correction_Elem_Veloc (Elem%Accel, Elem%Forces, Elem%Displ, &
                                    Elem%Veloc, Elem%V0,     gam1, dt, sc2, nx*nz)



end subroutine Correction_Elem_Veloc

    end module mystuff

! Move out of the module contained procedure
! IOW make it an external procedure
! External procedures take the passed arguments as-is
! meaning the arguments can change rank and shape (you specify the size)
! ***** Compile with interface checking off *****
    
subroutine compute_Correction_Elem_Veloc (Accel, Forces, Displ, Veloc, V0, g1, ddt, sc2, n)
use mytypes
implicit none
integer :: n
real(REAL64), dimension(n), intent(INOUT) :: Accel
real(REAL64), dimension(n), intent(INOUT) :: Forces
real(REAL64), dimension(n), intent(INOUT) :: Displ
real(REAL64), dimension(n), intent(INOUT) :: Veloc
real(REAL64), dimension(n), intent(IN)    :: V0
real(REAL64), intent(IN) :: g1, ddt, sc2

integer :: ii
!$OMP SIMD
do ii = 1, size(Accel)
     Veloc (ii) =V0   (ii) + ddt * Forces(ii)
     Accel (ii) =Accel(ii) + g1  * Forces(ii)
     Displ (ii) =Displ(ii) + sc2 * (Veloc (ii) +V0(ii) )
end do
!
return
end subroutine compute_Correction_Elem_Veloc

Using VTune, the majority of the time in compute_Correction_Elem_Veloc is in this section:
.B2.13::                        ; Preds .B2.13 .B2.12
                                ; Execution count [5.44e+00]
        vmovupd   ymm9, YMMWORD PTR [rax+r10*8]                 ;51.18
        vmovupd   ymm6, YMMWORD PTR [rdx+r10*8]                 ;51.36
        vmovupd   ymm7, YMMWORD PTR [rcx+r10*8]                 ;52.18
        vmovdqa   ymm8, ymm2                                    ;51.6
        vfmadd213pd ymm8, ymm6, ymm9                            ;51.6
        vfmadd231pd ymm7, ymm6, ymm1                            ;52.6
        vaddpd    ymm6, ymm8, ymm9                              ;53.48
        vmovupd   YMMWORD PTR [rbx+r10*8], ymm8                 ;51.6
        vmovupd   YMMWORD PTR [rcx+r10*8], ymm7                 ;52.6
        vfmadd213pd ymm6, ymm0, YMMWORD PTR [rsi+r10*8]         ;53.6
        vmovupd   YMMWORD PTR [rsi+r10*8], ymm6                 ;53.6
        add       r10, 4                                        ;50.1
        cmp       r10, r9                                       ;50.1
        jb        .B2.13        ; Prob 82%                      ;50.1

Jim Dempsey

jimdempseyatthecove · ‎02-21-2023

OOPS

Change:

CALL compute_correction_Elem_Veloc (Elem%Accel, Elem%Forces, Elem%Displ, &
                                    Elem%Veloc, Elem%V0,     gam1, dt, sc2, nx*nz)

to

CALL compute_correction_Elem_Veloc (Elem%Accel, Elem%Forces, Elem%Displ, &
                                    Elem%Veloc, Elem%V0,     gam1, dt, sc2, size(Elem%Accel))

Jim Dempsey

jimdempseyatthecove · ‎02-21-2023

I used 9,9 for input and ran the main loop 30 billion times (then stopped VTune run after about 30 seconds).

I did modify you main to loop the call for timing purposes.

Jim Dempsey

JohnNichols · ‎02-21-2023

Jim

I understand your points, the real issue is this is a computationally time intensive analysis. You will seek to speed it up, but even the fastest horse in the world can only run so fast. Sometimes you need a chevy.

Good improvement, perhaps a small Honda civic.

John

Yann_D_ · ‎02-22-2023

Thank you for your replies.

I know the trick of changing a 3D array to a 1D array when passing it as an argument to get a longer loop but I thought there were a way to do the job without it.

So I will implement your idea in the real application and see what I get back !

Thank you.

jimdempseyatthecove · ‎02-22-2023

Would I be correct that your application will have a large number of these elements?

If this be so, then consider:

type Elements
  ...
  real(REAL64) :: dimension(:,:,:,:) :: Forces ! allocate( Forces(nx, nz, 2, nElements) )
...

IOW the entire blob of forces (etc...) can be worked upon in a single call.

This will reduce the number of calls and more importantly, reduce the number of times the code performs the alignment checks and peel preamble, and post-able remainder to once for nElements.

Jim Dempsey

Yann_D_ · ‎02-23-2023

The number of elements can be quite large but the treatment is not the same for all elements. It's an unstructured mesh. The boundary elements are treated differently. And there are some other kind of elements too. Maybe a reordering may help but it implies lots of modifications in different parts of the source code. I'll talk about this to the other developpers.

Regards,

Yann

jimdempseyatthecove · ‎02-23-2023

Yann,

With some expansion of your simplified test program, you can model the step integration for a representative number of elements. The same model would contain the two or more different structures and implementation for your elements. Making timing test runs will give you a metric for a cost/benefit ratio.

>>treatment is not the same for all elements.

When the treatment is static per element, then you can have different classes of elements, each executed using their specific treatment.

When the treatment is dynamic per element, but the frequency and degree of change in treatment is small, then you may be able to employ a technique of incorporating "NULL" elements in each class. IOW elements that are computed, but have no effect on the results. To move an element from one class to another, you nullify it in the current class and then overlay a NULL element in the destination class. This technique permits the arrays to be allocated once (with spare entries of NULLs for element movement) or have a minimal amount of reallocation.

You have to be careful about division by 0.0. Depending on your model, this might involve making the mass or charge 0.0 or other such property that facilitates no net change in results while avoiding an IF test and branch. The idea is to get the most out of SIMD execution.

If you need some consulting help, I am available.

Jim Dempsey