Solved: Re: Can't understand why fortran inner loop not vectorized

Yann_D_ · ‎11-10-2022

Hello,

I try to vectorize a Fortran application. I look at the oneAPI ifort report for vectorization and I don't understand why it can't vectorize loops like the following example.

The if then else endif construct exists only to show that even if the different instructions are treated in different loops , there is quite no vectorization.

module selement
implicit none
!---------------------------------------
type :: simu_element
!---------------------------------------
   doubleprecision, dimension(:,:,:), pointer :: Forces
   doubleprecision, dimension(:,:,:), pointer :: Veloc
   doubleprecision, dimension(:,:,:), pointer :: Displ
   doubleprecision, dimension(:,:,:), pointer :: Accel
   doubleprecision, dimension(:,:,:), pointer :: V0
!!---------------------------------------
end type simu_element
!---------------------------------------

!---------------------------------------
type :: element
!--------------------------------------
integer :: ngllx, ngllz
integer :: nbsimu
type(simu_element), dimension(:), pointer :: simu
!---------------------------------------
end type element
!---------------------------------------

contains 

! ###########################################################
subroutine Prediction_Elem_Veloc (Elem,alpha,bega, dt)
implicit none

type (Element), intent (INOUT) :: Elem
double precision, intent (IN) :: bega, dt, alpha
double precision ::  sc1, sc2
integer :: nx, nz
integer :: is, ii, jj, kk
double precision, dimension(Elem%ngllx-2) :: tab1
IF (dt > 0.0) then

     sc1 = alpha * dt
     sc2 = alpha * dt**2 * (0.5d0 - bega)
     nx = Elem%ngllx ; nz = Elem%ngllz
     do is = 0, Elem%nbsimu-1
        do kk = 0, 1
           do jj = 1, nz-2
              do ii = 1, nx-2
                 tab1(ii) = Elem%simu(is)%Accel (ii,jj,kk)
              end do
              do ii = 1, nx-2
                 Elem%simu(is)%Forces(ii,jj,kk) = Elem%simu(is)%Displ(ii,jj,kk) + sc1 * Elem%simu(is)%Veloc(ii,jj,kk) + sc2 * tab1(ii)
              end do
              do ii = 1, nx-2
                 Elem%simu(is)%V0    (ii,jj,kk) = Elem%simu(is)%Veloc(ii,jj,kk)
              end do
              do ii = 1, nx-2
                 Elem%simu(is)%Accel (ii,jj,kk) = 0.d0
              end do
           end do
        end do
     end do
else
     sc1 = alpha * dt
     sc2 = alpha * dt**2 * (0.5d0 - bega)
     nx = Elem%ngllx ; nz = Elem%ngllz
     do is = 0, Elem%nbsimu-1
        do kk = 0, 1
           do jj = 1, nz-2
              do ii = 1, nx-2
                 tab1(ii) = Elem%simu(is)%Accel (ii,jj,kk)
              end do
              do ii = 1, nx-2
                 Elem%simu(is)%Forces(ii,jj,kk) = Elem%simu(is)%Displ(ii,jj,kk) + sc1 * Elem%simu(is)%Veloc(ii,jj,kk) + sc2 * tab1(ii)
                 Elem%simu(is)%V0    (ii,jj,kk) = Elem%simu(is)%Veloc(ii,jj,kk)
                 Elem%simu(is)%Accel (ii,jj,kk) = 0.d0
              end do
           end do
        end do
     end do
end if

return
end subroutine Prediction_Elem_Veloc

end module selement

The compilation sequence is

ifort -c -O3 -xCORE-AVX2 -ftz -g -traceback -qopt-report=5 -qopt-report-phase=vec  Element.f90

And the optimization report is

Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.

Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000

Compiler options: -c -O3 -xCORE-AVX2 -ftz -g -traceback -qopt-report=5 -qopt-report-phase=vec

Begin optimization report for: SELEMENT::PREDICTION_ELEM_VELOC

    Report from: Vector optimizations [vec]


LOOP BEGIN at Element.f90(42,6)
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at Element.f90(45,15)
      remark #15542: loop was not vectorized: inner loop was already vectorized

      LOOP BEGIN at Element.f90(44,12)
         remark #15542: loop was not vectorized: inner loop was already vectorized

         LOOP BEGIN at Element.f90(45,15)
            remark #15388: vectorization support: reference tab1(ii) has aligned access   [ Element.f90(46,18) ]
            remark #15328: vectorization support: non-unit strided load was emulated for the variable <elem(is,ii,jj,kk)>, stride is unknown to compiler   [ Element.f90(46,18
) ]
            remark #15305: vectorization support: vector length 2
            remark #15300: LOOP WAS VECTORIZED
            remark #15449: unmasked aligned unit stride stores: 1 
            remark #15452: unmasked strided loads: 1 
            remark #15475: --- begin vector cost summary ---
            remark #15476: scalar cost: 4 
            remark #15477: vector cost: 2.500 
            remark #15478: estimated potential speedup: 1.600 
            remark #15488: --- end vector cost summary ---
         LOOP END

         LOOP BEGIN at Element.f90(45,15)
         <Remainder loop for vectorization>
         LOOP END

         LOOP BEGIN at Element.f90(48,15)
            remark #15344: loop was not vectorized: vector dependence prevents vectorization
            remark #15346: vector dependence: assumed FLOW dependence between elem(is,ii,jj,kk) (49:18) and elem(is,ii,jj,kk) (49:18)
            remark #15346: vector dependence: assumed ANTI dependence between elem(is,ii,jj,kk) (49:18) and elem(is,ii,jj,kk) (49:18)
         LOOP END

         LOOP BEGIN at Element.f90(51,15)
            remark #15344: loop was not vectorized: vector dependence prevents vectorization
            remark #15346: vector dependence: assumed FLOW dependence between elem(is,ii,jj,kk) (52:18) and elem(is,ii,jj,kk) (52:18)
            remark #15346: vector dependence: assumed ANTI dependence between elem(is,ii,jj,kk) (52:18) and elem(is,ii,jj,kk) (52:18)
         LOOP END

         LOOP BEGIN at Element.f90(54,15)
            remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override 
            remark #15329: vectorization support: non-unit strided store was emulated for the variable <elem(is,ii,jj,kk)>, stride is unknown to compiler   [ Element.f90(55,1
8) ]
            remark #15305: vectorization support: vector length 2
            remark #15399: vectorization support: unroll factor set to 4
            remark #15453: unmasked strided stores: 1 
            remark #15475: --- begin vector cost summary ---
            remark #15476: scalar cost: 3 
            remark #15477: vector cost: 3.000 
            remark #15478: estimated potential speedup: 1.000 
            remark #15488: --- end vector cost summary ---
         LOOP END

         LOOP BEGIN at Element.f90(54,15)
         <Remainder>
         LOOP END
      LOOP END
   LOOP END
LOOP END

LOOP BEGIN at Element.f90(64,6)
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at Element.f90(67,15)
      remark #15542: loop was not vectorized: inner loop was already vectorized

      LOOP BEGIN at Element.f90(66,12)
         remark #15542: loop was not vectorized: inner loop was already vectorized

         LOOP BEGIN at Element.f90(67,15)
         <Multiversioned v1>
            remark #15388: vectorization support: reference tab1(ii) has aligned access   [ Element.f90(68,18) ]
            remark #15389: vectorization support: reference elem(is,ii,jj,kk) has unaligned access   [ Element.f90(68,18) ]
            remark #15381: vectorization support: unaligned access used inside loop body
            remark #15305: vectorization support: vector length 4
            remark #15309: vectorization support: normalized vectorization overhead 0.750
            remark #15300: LOOP WAS VECTORIZED
            remark #15449: unmasked aligned unit stride stores: 1 
            remark #15450: unmasked unaligned unit stride loads: 1 
            remark #15475: --- begin vector cost summary ---
            remark #15476: scalar cost: 4 
            remark #15477: vector cost: 1.000 
            remark #15478: estimated potential speedup: 3.960 
            remark #15488: --- end vector cost summary ---
         LOOP END

         LOOP BEGIN at Element.f90(67,15)
         <Remainder loop for vectorization, Multiversioned v1>
         LOOP END

         LOOP BEGIN at Element.f90(67,15)
         <Multiversioned v2>
            remark #15388: vectorization support: reference tab1(ii) has aligned access   [ Element.f90(68,18) ]
            remark #15328: vectorization support: non-unit strided load was emulated for the variable <elem(is,ii,jj,kk)>, stride is unknown to compiler   [ Element.f90(68,18
) ]
            remark #15305: vectorization support: vector length 2
            remark #15300: LOOP WAS VECTORIZED
            remark #15449: unmasked aligned unit stride stores: 1 
            remark #15452: unmasked strided loads: 1 
            remark #15475: --- begin vector cost summary ---
            remark #15476: scalar cost: 4 
            remark #15477: vector cost: 2.500 
            remark #15478: estimated potential speedup: 1.600 
            remark #15488: --- end vector cost summary ---
         LOOP END

         LOOP BEGIN at Element.f90(67,15)
         <Remainder loop for vectorization, Multiversioned v2>
         LOOP END

         LOOP BEGIN at Element.f90(70,15)
            remark #15344: loop was not vectorized: vector dependence prevents vectorization
            remark #15346: vector dependence: assumed OUTPUT dependence between elem(is,ii,jj,kk) (71:18) and elem(is,ii,jj,kk) (73:18)
            remark #15346: vector dependence: assumed OUTPUT dependence between elem(is,ii,jj,kk) (73:18) and elem(is,ii,jj,kk) (71:18)
         LOOP END
      LOOP END
   LOOP END
LOOP END
===========================================================================

Thank you for your advices.

Regards,

jimdempseyatthecove · ‎11-11-2022

At issue here (IMHO) is because you are using POINTER as opposed to allocatable, combined with 1+3 indices being involved, that the compiler optimization code had too many things to keep track of.

In particular, being POINTERs, ...Forces, could just as well point to ...Displ and/or ...Elem.

Do you really need POINTER as opposed to ALLOCATABLE?

As an alternative, try simplifying the work of the optimization code through use of ASSOCIATE. For example:

associate(Forces=>Elem%simu(is)%Forces, Displ=>Elem%simu(is)%Displ, Veloc=>Elem%simu(is)%Veloc)
  do ii = 1, nx-2
    Forces(ii,jj,kk) = Displ(ii,jj,kk) + sc1 * Veloc(ii,jj,kk) + sc2 * tab1(ii)
  end do
end associate

And similar changes for other loops.

Note, at times when associate does not work around the optimization issue, then resort to using a contained procedure.

subroutine Prediction_Elem_Veloc
  ...
  ! do ii = 1, nx-2
  !   Elem%simu(is)%Forces(ii,jj,kk) = Elem%simu(is)%Displ(ii,jj,kk) + sc1 * Elem%simu(is)%Veloc(ii,jj,kk) + sc2 * tab1(ii)
  ! end do
  call CalcForces(Elem%simu(is)%Forces, Elem%simu(is)%Displ, Elem%simu(is)%Veloc)
  ...
  return
  contains
  subroutine CalcForces(Forces, Displ, Veloc)
    implicit none
    doubleprecision, dimension(:,:,:) :: Forces ! without pointer
    doubleprecision, dimension(:,:,:) :: Veloc
    doubleprecision, dimension(:,:,:) :: Displ
    ! Note, ii, jj, kk, nx2, sc1, sc2, and tab1 are declared via host association
    do ii = 1, nx-2
      Forces(ii,jj,kk) = Displ(ii,jj,kk) + sc1 * Veloc(ii,jj,kk) + sc2 * tab1(ii)
    end do
  end subroutine CalcForces
  ...
end subroutine Prediction_Elem_Veloc

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎11-11-2022

At issue here (IMHO) is because you are using POINTER as opposed to allocatable, combined with 1+3 indices being involved, that the compiler optimization code had too many things to keep track of.

In particular, being POINTERs, ...Forces, could just as well point to ...Displ and/or ...Elem.

Do you really need POINTER as opposed to ALLOCATABLE?

As an alternative, try simplifying the work of the optimization code through use of ASSOCIATE. For example:

associate(Forces=>Elem%simu(is)%Forces, Displ=>Elem%simu(is)%Displ, Veloc=>Elem%simu(is)%Veloc)
  do ii = 1, nx-2
    Forces(ii,jj,kk) = Displ(ii,jj,kk) + sc1 * Veloc(ii,jj,kk) + sc2 * tab1(ii)
  end do
end associate

And similar changes for other loops.

Note, at times when associate does not work around the optimization issue, then resort to using a contained procedure.

subroutine Prediction_Elem_Veloc
  ...
  ! do ii = 1, nx-2
  !   Elem%simu(is)%Forces(ii,jj,kk) = Elem%simu(is)%Displ(ii,jj,kk) + sc1 * Elem%simu(is)%Veloc(ii,jj,kk) + sc2 * tab1(ii)
  ! end do
  call CalcForces(Elem%simu(is)%Forces, Elem%simu(is)%Displ, Elem%simu(is)%Veloc)
  ...
  return
  contains
  subroutine CalcForces(Forces, Displ, Veloc)
    implicit none
    doubleprecision, dimension(:,:,:) :: Forces ! without pointer
    doubleprecision, dimension(:,:,:) :: Veloc
    doubleprecision, dimension(:,:,:) :: Displ
    ! Note, ii, jj, kk, nx2, sc1, sc2, and tab1 are declared via host association
    do ii = 1, nx-2
      Forces(ii,jj,kk) = Displ(ii,jj,kk) + sc1 * Veloc(ii,jj,kk) + sc2 * tab1(ii)
    end do
  end subroutine CalcForces
  ...
end subroutine Prediction_Elem_Veloc

Jim Dempsey

Yann_D_ · ‎11-15-2022

Hello Jim,

thank you for your help.

This code was developed some years ago. At the early 2000, "only POINTER was available for derived types" told me the main programmer of this code. But you're right, now, one can replace POINTER by ALLOCATABLE.

So I begin to test ...

1. ALLOCATABLE alone, doesn't change anything, the computing loop is not vectorized.

2. Next, I try with the ASSOCIATE construction like in your proposal but the computing loop is still not vectorized.

3. So, finally, I write a contained procedure, but at a higher level :

     do is = 0, Elem%nbsimu-1
        CALL compute_Prediction_Elem_Veloc_2 (Elem%ngllx, Elem%ngllz, Elem%simu(is)%Accel, Elem%simu(is)%Forces, Elem%simu(is)%Displ, Elem%simu(is)%Veloc, Elem%simu(is)%V0, alpha, dt, bega)
     end do

contains

subroutine compute_Prediction_Elem_Veloc (nx, nz, Accel, Forces, Displ, Veloc, V0, alpha, dt, bega)
integer,  intent(in)  ::  nx, nz
double precision, dimension(1:nx-2, 1:nz-2, 0:1),  intent(INOUT)  ::  Accel
double precision, dimension(0:nx-1, 0:nz-1, 0:1),  intent(OUT)    ::  Forces
double precision, dimension(1:nx-2, 1:nz-2, 0:1),  intent(IN)     ::  Displ
double precision, dimension(1:nx-2, 1:nz-2, 0:1),  intent(IN)     ::  Veloc
double precision, dimension(1:nx-2, 1:nz-2, 0:1),  intent(OUT)    ::  V0
double precision, intent (IN) :: bega, dt, alpha

double precision, dimension(nx-2) :: tab1
integer ::  ii, jj, kk
double precision ::  sc1, sc2
     sc1 = alpha * dt
     sc2 = alpha * dt**2 * (0.5d0 - bega)
     do kk = 0, 1
        do jj = 1, nz-2
           do ii = 1, nx-2
              tab1(ii) = Accel (ii,jj,kk)
           end do
           do ii = 1, nx-2
              Forces(ii,jj,kk) = Displ(ii,jj,kk) + sc1 * Veloc(ii,jj,kk) + sc2 * tab1(ii)
              V0    (ii,jj,kk) = Veloc(ii,jj,kk)
              Accel (ii,jj,kk) = 0.d0
           end do
        end do
     end do
return
end subroutine compute_Prediction_Elem_Veloc

And the optimization report tells us it's vectorized ! Good news.

         LOOP BEGIN at Element.f90(104,12) inlined into Element.f90(43,14)
            remark #15389: vectorization support: reference forces(ii,jj,kk) has unaligned access   [ Element.f90(105,15) ]
            remark #15389: vectorization support: reference displ(ii,jj,kk) has unaligned access   [ Element.f90(105,34) ]
            remark #15389: vectorization support: reference veloc(ii,jj,kk) has unaligned access   [ Element.f90(105,58) ]
            remark #15388: vectorization support: reference tab1(ii) has aligned access   [ Element.f90(105,82) ]
            remark #15389: vectorization support: reference v0(ii,jj,kk) has unaligned access   [ Element.f90(106,15) ]
            remark #15389: vectorization support: reference veloc(ii,jj,kk) has unaligned access   [ Element.f90(106,15) ]
            remark #15389: vectorization support: reference accel(ii,jj,kk) has unaligned access   [ Element.f90(107,15) ]
            remark #15381: vectorization support: unaligned access used inside loop body
            remark #15305: vectorization support: vector length 4
            remark #15399: vectorization support: unroll factor set to 4
            remark #15309: vectorization support: normalized vectorization overhead 0.170
            remark #15300: LOOP WAS VECTORIZED
            remark #15448: unmasked aligned unit stride loads: 1 
            remark #15450: unmasked unaligned unit stride loads: 2 
            remark #15451: unmasked unaligned unit stride stores: 3 
            remark #15475: --- begin vector cost summary ---
            remark #15476: scalar cost: 20 
            remark #15477: vector cost: 6.250 
            remark #15478: estimated potential speedup: 2.910 
            remark #15488: --- end vector cost summary ---
         LOOP END

So I think I will continue in this direction.

I have an other question related to these nested loops :

This application is based on a spatial discretization done with spectral elements. So nx and nz are less than 10.

=> Is there a way to tell this to the compiler ?

=> Can something be done to fuse the two most internal loops (ii, jj) into a longer one ?

Regards,

jimdempseyatthecove · ‎11-11-2022

Edit. Also try:

associate(Forces=>Elem%simu(is)%Forces(:,jj,kk), Displ=>Elem%simu(is)%Displ(:,jj,kk) Veloc=>Elem%simu(is)%Veloc(:,jj,kk))
  do ii = 1, nx-2
    Forces(ii) = Displ(ii) + sc1 * Veloc(ii) + sc2 * tab1(ii)
  end do
end associate

And:

subroutine Prediction_Elem_Veloc
  ...
  ! do ii = 1, nx-2
  !   Elem%simu(is)%Forces(ii,jj,kk) = Elem%simu(is)%Displ(ii,jj,kk) + sc1 * Elem%simu(is)%Veloc(ii,jj,kk) + sc2 * tab1(ii)
  ! end do
  call CalcForces(Elem%simu(is)%Forces(:,jj,kk), Elem%simu(is)%Displ(:,jj,kk), Elem%simu(is)%Veloc(:,jj,kk))
  ...
  return
  contains
  subroutine CalcForces(Forces, Displ, Veloc)
    implicit none
    doubleprecision, dimension(:) :: Forces
    doubleprecision, dimension(:) :: Veloc
    doubleprecision, dimension(:) :: Displ
    ! Note, ii, nx2, sc1, sc2, and tab1 are declared via host association
    do ii = 1, nx-2
      Forces(ii) = Displ(ii) + sc1 * Veloc(ii) + sc2 * tab1(ii)
    end do
  end subroutine CalcForces
  ...
end subroutine Prediction_Elem_Veloc

Jim Dempsey

jimdempseyatthecove · ‎11-15-2022

=> Is there a way to tell this to the compiler ?

LOOP COUNT (intel.com)

!dir$ loop count max(10)

=> Can something be done to fuse the two most internal loops (ii, jj) into a longer one ?

*** Why is Forces 0-based indexed when the others are 1-based?

IOW Forces(ii,jj,kk) = Displ(ii,jj,kk) + sc1 * Veloc(ii,jj,kk) + sc2 * tab1(ii)

does not compute Forces(0,jj,kk)

? Is size(Forces) == size(Accel) == size(Displ) == size(Veloc) == size(tab1) == size(V0) ??

If so, have the contained procedure receive the 3D arrays as 1D

sketch:

 
    ...
    call compute_Prediction_Elem_Veloc(size(Forces), Accel, Forces, Displ, Veloc, V0, alpha, dt, bega)
    ...
contains
subroutine compute_Prediction_Elem_Veloc(n, Accel, Forces, Displ, Veloc, V0, alpha, dt, bega)
    integer,  intent(in)  ::  n
    double precision, dimension(n),  intent(INOUT)  ::  Accel
    double precision, dimension(n),  intent(OUT)    ::  Forces
    double precision, dimension(n),  intent(IN)     ::  Displ
    double precision, dimension(n),  intent(IN)     ::  Veloc
    double precision, dimension(n),  intent(OUT)    ::  V0
    double precision, intent (IN) :: bega, dt, alpha

    integer ::  i
    double precision ::  sc1, sc2
    sc1 = alpha * dt
    sc2 = alpha * dt**2 * (0.5d0 - bega)
    do i=1,n
        Forces(i) = Displ(i) + sc1 * Veloc(i) + sc2 * Accel(i)
        V0    (i) = Veloc(i)
        Accel (i) = 0.d0
    end do
    return
end subroutine compute_Prediction_Elem_Veloc

*** the above requires each of the arrays to be allocatable (contiguous) and same sizes.

Also, there is no need for array tab1

Jim Dempsey

Yann_D_ · ‎11-15-2022

Forces has different bounds than the other arrays, because it needs some "ghost cells" around for other computations elsewhere in the application. But we will examine the reverse : see if we can increase the loops and extend the other arrays with zero filled cells to the size of Forces. It depends where the "ghost cells" of Forces are used and can be overwritten.

It's up to me now.

your advices are very appreciated.

Thank you.

jimdempseyatthecove · ‎11-15-2022

>>Force...some "ghost cells" ...

Wouldn't this apply to the other cells as well?

IOW if you place "ghost cells" at the same indicies as those in Force into the other arrays, then you could perform the integration inclusive of the "ghost cells". While this may have unnecessary computation (integrations) of cells, it may recover more performance due to keeping within a single vectored loop. IOW vector over the top of the "ghost cells" thus eliminating start/stop sections of disjointed loops.

Jim Dempsey

TobiasK · ‎11-16-2022

Very nice solution from Jim, thanks!

In addition to the solution a few more ideas to consider:

Generally pointers are bad because because:

1) they are allowed to point to the same memory locations, or just partially overlap

2) the stride between the elements is unknown - here you could add the contiguous attribute to at least ensure that the stride is 1, same holds for assumed shape (:) arrays declared in an interface

There are cases where pointers are beneficial, e.g. if you want to avoid workspace allocations and implement your own workspace management. In the example here, you could actually avoid a copy by swapping the pointers V0 und VELOC instead of copying VELOC to V0 if you do not need two separate copies of VELOC. With allocatables you can achieve this by adding another dimension in VELOC and swap between the dimensions with something like this:

...

integer :: veloc_sel,v0_sel

veloc_sel=mod(timestep,2)+1

v0_sel=mod(timestep-1,2)+1

veloc(:,:,:,vsel) ==veloc

veloc(:,:,:,v0_sel)==v0

...

The trick proposed to move the computational stuff into a separate routine works because in Fortran allocatable and pointer are attributes which you can remove by passing the arrays to a subroutine. If the subroutine is small enough the compiler will inline the function call and there is practically no overhead. This is what happens in your code.

Per definition, the subroutine must not be called with overlapping arguments, so 1) is solved implicitly and 2) is usually solved by using assumed size (*) or explicit shape (n) declarations. Assumed shape (:) still allows for non-unit stride between the elements, and usually the compiler implements a runtime check if the stride is =1 or greater. To remove that implicitly generated code, assumed shape arrays can be declared as contiguous just as pointers.

The one drawback of using an subroutine call and removing the pointer / allocatable attribute is that the bounds are reset to lbound=1. e.g. if your forces array has the lower bound=0 and you simply pass the whole array the lower bound inside the subroutine is again 1 and the upper bound is increased by 1! To be clear, assumed shape arrays (:) inside a function will always have lbound=1.

...

module fun

contains

subroutine print_bounds(a)

integer, intent(in),contiguous :: a(:)

write(*,*) 'function side',lbound(a),ubound(a)

end subroutine print_bounds

end module fun

program bounds

use fun

integer :: a(-15:11)

write(*,*) 'call site',lbound(a),ubound(a)

call print_bounds(a)

end program bounds

...

If you know, nx and nz are never larger than 10 and this also corresponds to the arrays, I would consider

...

type :: simu_element

!---------------------------------------

doubleprecision, dimension(8,8,0:1) :: Forces

doubleprecision, dimension(8,8,0:1) :: Veloc

doubleprecision, dimension(8,8,0:1) :: Displ

doubleprecision, dimension(8,8,0:1) :: Accel

doubleprecision, dimension(8,8,0:1) :: V0

!!---------------------------------------

end type simu_element

...

if it does not waste a large amount of memory.

Another advice: try to always use the contiguous attribute when defining a interface and using assumed shape (:) arrays. Otherwise the compiler has to include runtime checks for the stride between the elements of the array.

Best

Tobias

Yann_D_ · ‎11-16-2022

Hello Tobias and thank you for your advices.

Some comments:

- I have removed all pointer attributes that I can in the application.

- I use explicite bounds declarations to be more readable and understandable, especially when there are lbounds different of 1. So, here, I can't use the contiguous attribute but I will try to remember it in case of assumed shape.

- I know and use the trick of the swapping pointers instead of copying arrays, especially in iterative methods but the loop above is a misleading example for the use of VELOC and V0. In other routines, they are treated differently.

- Concerning the size of the arrays, most of the runs use few grid points for the spatial discretization but it may happen that nx and nz are as large as 35.

Regards,

jimdempseyatthecove · ‎11-16-2022

Tobias,

>>you could actually avoid a copy by swapping the pointers V0 und VELOC instead of copying VELOC to V0...

I believe in this case it is required to have two copies of the velocity: current velocity and predicted velocity at next time step.

The two velocities are then integrated using a non-Euler method. IOW when multi-step method is used (Euler is One-Step method).

The other tips are noteworthy.

IMHO compiler optimization missed opportunities are greatly reduced when you simplify the code. In this case treating the multi-dimensional array as a single dimension array. Note, this works in this case due to the algorithm (code) being applied. This may not be suitable in all cases.

Jim Dempsey

Yann_D_ · ‎11-16-2022

So, I continue my optimization work. I write small contained subroutines. I'm facing a strange situation : the compiler finds flow dependences between array's elements and intent(in) scalar ! Below, there are some extracts of the files, but I put the source file and the report in attachment.

In the main subroutine I call the contained subroutine like this :

  nx = Elem%ngllx ; nz = Elem%ngllz
  do is = 0, Elem%nbsimu-1
     CALL compute_correction_Elem_Veloc (Elem%simu(is)%Accel, Elem%simu(is)%Forces, Elem%simu(is)%Displ, &
                                         Elem%simu(is)%Veloc, Elem%simu(is)%V0,     Elem%Massmat)
     CALL compute_correction_Elem_Veloc_2 (Elem%simu(is)%Accel, Elem%simu(is)%Forces, Elem%simu(is)%Displ, &
                                           Elem%simu(is)%Veloc, Elem%simu(is)%V0,     Elem%Massmat)
     CALL compute_correction_Elem_Veloc_orig (Elem%simu(is)%Accel, Elem%simu(is)%Forces, Elem%simu(is)%Displ, &
                                              Elem%simu(is)%Veloc, Elem%simu(is)%V0,     Elem%Massmat)
  end do    ! of do is = 0, Elem%nbsimu-1
!
return

There are three call to different routines ; it's just to show the different versions I tried and the analysis of the compiler. In the real application there will be only one, the one that works.

First, the call to the original algorithm

subroutine compute_Correction_Elem_Veloc_orig (Accel, Forces, Displ, Veloc, V0, Massmat)
double precision, dimension(1:nx-2, 1:nz-2, 0:1),  intent(INOUT)  :: Accel
double precision, dimension(0:nx-1, 0:nz-1, 0:1),  intent(INOUT)  :: Forces
double precision, dimension(1:nx-2, 1:nz-2),       intent(IN)     :: MassMat
double precision, dimension(1:nx-2, 1:nz-2, 0:1),  intent(INOUT)  :: Displ
double precision, dimension(1:nx-2, 1:nz-2, 0:1),  intent(INOUT)  :: Veloc
double precision, dimension(1:nx-2, 1:nz-2, 0:1),  intent(IN)     :: V0
!
integer ::  ii, jj, kk
  do kk = 0, 1
     do jj = 1, nz-2
        do ii = 1, nx-2
           Forces(ii,jj,kk) = MassMat(ii,jj)         * Forces(ii,jj,kk)
           Veloc (ii,jj,kk) = V0     (ii,jj,kk) + dt * Forces(ii,jj,kk)
           Accel (ii,jj,kk) = Accel  (ii,jj,kk) + gam1 / dt * (Veloc(ii,jj,kk) - V0(ii,jj,kk) )
           Displ (ii,jj,kk) = Displ  (ii,jj,kk) + bega * dt * (Veloc(ii,jj,kk) + V0(ii,jj,kk) )
        end do
     end do
  end do
!
return
end subroutine compute_Correction_Elem_Veloc_orig

There are dependences, so no immediate vectorization is possible

I don't find a solution without a temporary array, so here I I propose :

subroutine compute_Correction_Elem_Veloc (Accel, Forces, Displ, Veloc, V0, Massmat)
double precision, dimension(1:nx-2, 1:nz-2, 0:1),  intent(INOUT)  :: Accel
double precision, dimension(0:nx-1, 0:nz-1, 0:1),  intent(INOUT)  :: Forces
double precision, dimension(1:nx-2, 1:nz-2),       intent(IN)     :: MassMat
double precision, dimension(1:nx-2, 1:nz-2, 0:1),  intent(INOUT)  :: Displ
double precision, dimension(1:nx-2, 1:nz-2, 0:1),  intent(INOUT)  :: Veloc
double precision, dimension(1:nx-2, 1:nz-2, 0:1),  intent(IN)     :: V0
!
integer ::  ii, jj, kk
double precision, dimension(nx-2) :: tab1
  do kk = 0, 1
     do jj = 1, nz-2
        do ii = 1, nx-2
           tab1(ii) = MassMat(ii,jj)  * Forces(ii,jj,kk)
        end do
        do ii = 1, nx-2
           Veloc (ii,jj,kk) = V0     (ii,jj,kk) + dt * Forces(ii,jj,kk)
           Accel (ii,jj,kk) = Accel  (ii,jj,kk) + gam1 / dt * (Veloc(ii,jj,kk) - V0(ii,jj,kk) )
           Displ (ii,jj,kk) = Displ  (ii,jj,kk) + bega * dt * (Veloc(ii,jj,kk) + V0(ii,jj,kk) )
           Forces(ii,jj,kk) = tab1(ii)
        end do
     end do
  end do
!
return
end subroutine compute_Correction_Elem_Veloc

The compiler vectorizes the first loop on ii, but not the second one with the explanation :

      LOOP BEGIN at correction.f90(66,9)
         remark #15344: loop was not vectorized: vector dependence prevents vectorization
         remark #15346: vector dependence: assumed FLOW dependence between veloc(ii,jj,kk) (67:12) and dt (69:12)
         remark #15346: vector dependence: assumed ANTI dependence between dt (69:12) and veloc(ii,jj,kk) (67:12)
      LOOP END

And I don't understand why it finds a dependence between veloc and dt.

I have tried an other version, based on line 17 juste above :

Veloc (ii,jj,kk) = V0     (ii,jj,kk) + dt * Forces(ii,jj,kk)

that one can rewrite (in mathematical terms) to avoid the just computed vlue of VELOC in the RHS of lines 18 and 19 :

Veloc (ii,jj,kk) - V0     (ii,jj,kk) = dt * Forces(ii,jj,kk)

By this substitution, I have the second version :

subroutine compute_Correction_Elem_Veloc_2 (Accel, Forces, Displ, Veloc, V0, Massmat)
double precision, dimension(1:nx-2, 1:nz-2, 0:1),  intent(INOUT)  :: Accel
double precision, dimension(0:nx-1, 0:nz-1, 0:1),  intent(INOUT)  :: Forces
double precision, dimension(1:nx-2, 1:nz-2),       intent(IN)     :: MassMat
double precision, dimension(1:nx-2, 1:nz-2, 0:1),  intent(INOUT)  :: Displ
double precision, dimension(1:nx-2, 1:nz-2, 0:1),  intent(INOUT)  :: Veloc
double precision, dimension(1:nx-2, 1:nz-2, 0:1),  intent(IN)     :: V0
!
integer ::  ii, jj, kk
double precision, dimension(nx-2) :: tab1
  do kk = 0, 1
     do jj = 1, nz-2
        do ii = 1, nx-2
           tab1(ii) = MassMat(ii,jj)  * Forces(ii,jj,kk)
        end do
        do ii = 1, nx-2
           Veloc (ii,jj,kk) = V0   (ii,jj,kk) + dt   * tab1(ii)
           Accel (ii,jj,kk) = Accel(ii,jj,kk) + gam1 * tab1(ii)
           Displ (ii,jj,kk) = Displ(ii,jj,kk) + bega * dt * (dt * Forces(ii,jj,kk) + 2.0D0 * V0(ii,jj,kk) )
           Forces(ii,jj,kk) = tab1(ii)
        end do
     end do
  end do
!
return
end subroutine compute_Correction_Elem_Veloc_2

And again the compiler sees dependences :

LOOP BEGIN at correction.f90(13,9)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed FLOW dependence between tab1(ii) (14:12) and tab1(ii) (20:12)
   remark #15346: vector dependence: assumed ANTI dependence between tab1(ii) (20:12) and tab1(ii) (14:12)

   LOOP BEGIN at correction.f90(12,6)
      remark #15344: loop was not vectorized: vector dependence prevents vectorization
      remark #15346: vector dependence: assumed FLOW dependence between tab1(ii) (14:12) and tab1(ii) (20:12)
      remark #15346: vector dependence: assumed ANTI dependence between tab1(ii) (20:12) and tab1(ii) (14:12)

      LOOP BEGIN at correction.f90(13,9)
         remark #15344: loop was not vectorized: vector dependence prevents vectorization
         remark #15346: vector dependence: assumed FLOW dependence between veloc(ii,jj,kk) (17:12) and bega (19:12)
         remark #15346: vector dependence: assumed ANTI dependence between bega (19:12) and veloc(ii,jj,kk) (17:12)
      LOOP END

      LOOP BEGIN at correction.f90(16,9)
      LOOP END
   LOOP END
LOOP END

sorry for the log post !

Regards,

Yann_D_ · ‎11-16-2022

just one more small thing :

if I pass as an argument the scalar that is involved in the dependence, the dependence disappears.

TobiasK · ‎11-16-2022

Sure, I can't judge if veloc and v0 needs to be copied or not, just a remark. The same holds for the intermediate array tab which I would remove in the code example here.

By the way I would also advise to use the compiler flags: -pad and -align array64byte.

Sorry I missed that you are already using explicit shape arrays, these arrays are by definition contiguous, so no contiguous specifier needed, it is only needed for assumed shape arrays or pointers. Be aware, if the array is non-contiguous at call side, the compiler will create temporary copies in that case - this is actually one thing where assumed shape arrays do make a lot of sense, when you want to pass a non-contiguous array slice to a subroutine it can be implemented without an additional copy, if the interface does not specify the contiguous attribute.

Another thing that just came into my mind is to use OpenMP simd to force vectorization. You can enable them by -qopenmp-simd without having to compile with -qopenmp so that the compiler only honors !$omp simd and uses the information for vectorization.

https://www.openmp.org/spec-html/5.0/openmpsu42.html

With this clause you can specify a lot additional information like safelen, simdlen, stride, alignment, nontemporal stores etc. It is truly a very powerful tool, even if you do not use openmp anywhere in the code and much more portable than any other compiler directives.

But as Jim said, keeping the code simple allows the compiler to do the right thing, most of the time.

Best

Tobias

Yann_D_ · ‎11-16-2022

thank you for the flags.

I think that the "-warn all" flag warns us of temporary copies and it's a good thing !

I will take a look at the SIMD openMP directive ; thank you for the advice.

Regards,

jimdempseyatthecove · ‎11-16-2022

For performance reasons, you might want to consider extracting:

Forces(0,0,0:1), Forces(nx-1,0,0:1), Forces(0,nx-1,0:1), Forces(nx-1,nx-1,0:1) into, say Forces_perimeter(0:1,0:1,0:1)

Or something like that. Doing so, makes the calculations receive contiguous array sections. As written, you will have to stride over the 0'th and nx-1'th cells of the Forces array.

Jim Dempsey

TobiasK · ‎11-18-2022

One last general comment here, be careful when you read the opt-report, the opt-report also contains remainder loops.

If you are really interested in performance and vectorization the opt-report is a good starting point but to be really sure what code parts are actually executed I highly recommend using our performance analysis tools, e.g. VTune

https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html

For short loops with trip count 8 you might look at the wrong thing when you only rely on the opt-report.

Happy tuning ahead!

Tobias