- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello,

I try to vectorize a Fortran application. I look at the oneAPI ifort report for vectorization and I don't understand why it can't vectorize loops like the following example.

The if then else endif construct exists only to show that even if the different instructions are treated in different loops , there is quite no vectorization.

```
module selement
implicit none
!---------------------------------------
type :: simu_element
!---------------------------------------
doubleprecision, dimension(:,:,:), pointer :: Forces
doubleprecision, dimension(:,:,:), pointer :: Veloc
doubleprecision, dimension(:,:,:), pointer :: Displ
doubleprecision, dimension(:,:,:), pointer :: Accel
doubleprecision, dimension(:,:,:), pointer :: V0
!!---------------------------------------
end type simu_element
!---------------------------------------
!---------------------------------------
type :: element
!--------------------------------------
integer :: ngllx, ngllz
integer :: nbsimu
type(simu_element), dimension(:), pointer :: simu
!---------------------------------------
end type element
!---------------------------------------
contains
! ###########################################################
subroutine Prediction_Elem_Veloc (Elem,alpha,bega, dt)
implicit none
type (Element), intent (INOUT) :: Elem
double precision, intent (IN) :: bega, dt, alpha
double precision :: sc1, sc2
integer :: nx, nz
integer :: is, ii, jj, kk
double precision, dimension(Elem%ngllx-2) :: tab1
IF (dt > 0.0) then
sc1 = alpha * dt
sc2 = alpha * dt**2 * (0.5d0 - bega)
nx = Elem%ngllx ; nz = Elem%ngllz
do is = 0, Elem%nbsimu-1
do kk = 0, 1
do jj = 1, nz-2
do ii = 1, nx-2
tab1(ii) = Elem%simu(is)%Accel (ii,jj,kk)
end do
do ii = 1, nx-2
Elem%simu(is)%Forces(ii,jj,kk) = Elem%simu(is)%Displ(ii,jj,kk) + sc1 * Elem%simu(is)%Veloc(ii,jj,kk) + sc2 * tab1(ii)
end do
do ii = 1, nx-2
Elem%simu(is)%V0 (ii,jj,kk) = Elem%simu(is)%Veloc(ii,jj,kk)
end do
do ii = 1, nx-2
Elem%simu(is)%Accel (ii,jj,kk) = 0.d0
end do
end do
end do
end do
else
sc1 = alpha * dt
sc2 = alpha * dt**2 * (0.5d0 - bega)
nx = Elem%ngllx ; nz = Elem%ngllz
do is = 0, Elem%nbsimu-1
do kk = 0, 1
do jj = 1, nz-2
do ii = 1, nx-2
tab1(ii) = Elem%simu(is)%Accel (ii,jj,kk)
end do
do ii = 1, nx-2
Elem%simu(is)%Forces(ii,jj,kk) = Elem%simu(is)%Displ(ii,jj,kk) + sc1 * Elem%simu(is)%Veloc(ii,jj,kk) + sc2 * tab1(ii)
Elem%simu(is)%V0 (ii,jj,kk) = Elem%simu(is)%Veloc(ii,jj,kk)
Elem%simu(is)%Accel (ii,jj,kk) = 0.d0
end do
end do
end do
end do
end if
return
end subroutine Prediction_Elem_Veloc
end module selement
```

The compilation sequence is

`ifort -c -O3 -xCORE-AVX2 -ftz -g -traceback -qopt-report=5 -qopt-report-phase=vec Element.f90`

And the optimization report is

```
Intel(R) Advisor can now assist with vectorization and show optimization
report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000
Compiler options: -c -O3 -xCORE-AVX2 -ftz -g -traceback -qopt-report=5 -qopt-report-phase=vec
Begin optimization report for: SELEMENT::PREDICTION_ELEM_VELOC
Report from: Vector optimizations [vec]
LOOP BEGIN at Element.f90(42,6)
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at Element.f90(45,15)
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at Element.f90(44,12)
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at Element.f90(45,15)
remark #15388: vectorization support: reference tab1(ii) has aligned access [ Element.f90(46,18) ]
remark #15328: vectorization support: non-unit strided load was emulated for the variable <elem(is,ii,jj,kk)>, stride is unknown to compiler [ Element.f90(46,18
) ]
remark #15305: vectorization support: vector length 2
remark #15300: LOOP WAS VECTORIZED
remark #15449: unmasked aligned unit stride stores: 1
remark #15452: unmasked strided loads: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 4
remark #15477: vector cost: 2.500
remark #15478: estimated potential speedup: 1.600
remark #15488: --- end vector cost summary ---
LOOP END
LOOP BEGIN at Element.f90(45,15)
<Remainder loop for vectorization>
LOOP END
LOOP BEGIN at Element.f90(48,15)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between elem(is,ii,jj,kk) (49:18) and elem(is,ii,jj,kk) (49:18)
remark #15346: vector dependence: assumed ANTI dependence between elem(is,ii,jj,kk) (49:18) and elem(is,ii,jj,kk) (49:18)
LOOP END
LOOP BEGIN at Element.f90(51,15)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between elem(is,ii,jj,kk) (52:18) and elem(is,ii,jj,kk) (52:18)
remark #15346: vector dependence: assumed ANTI dependence between elem(is,ii,jj,kk) (52:18) and elem(is,ii,jj,kk) (52:18)
LOOP END
LOOP BEGIN at Element.f90(54,15)
remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
remark #15329: vectorization support: non-unit strided store was emulated for the variable <elem(is,ii,jj,kk)>, stride is unknown to compiler [ Element.f90(55,1
8) ]
remark #15305: vectorization support: vector length 2
remark #15399: vectorization support: unroll factor set to 4
remark #15453: unmasked strided stores: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 3
remark #15477: vector cost: 3.000
remark #15478: estimated potential speedup: 1.000
remark #15488: --- end vector cost summary ---
LOOP END
LOOP BEGIN at Element.f90(54,15)
<Remainder>
LOOP END
LOOP END
LOOP END
LOOP END
LOOP BEGIN at Element.f90(64,6)
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at Element.f90(67,15)
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at Element.f90(66,12)
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at Element.f90(67,15)
<Multiversioned v1>
remark #15388: vectorization support: reference tab1(ii) has aligned access [ Element.f90(68,18) ]
remark #15389: vectorization support: reference elem(is,ii,jj,kk) has unaligned access [ Element.f90(68,18) ]
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 4
remark #15309: vectorization support: normalized vectorization overhead 0.750
remark #15300: LOOP WAS VECTORIZED
remark #15449: unmasked aligned unit stride stores: 1
remark #15450: unmasked unaligned unit stride loads: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 4
remark #15477: vector cost: 1.000
remark #15478: estimated potential speedup: 3.960
remark #15488: --- end vector cost summary ---
LOOP END
LOOP BEGIN at Element.f90(67,15)
<Remainder loop for vectorization, Multiversioned v1>
LOOP END
LOOP BEGIN at Element.f90(67,15)
<Multiversioned v2>
remark #15388: vectorization support: reference tab1(ii) has aligned access [ Element.f90(68,18) ]
remark #15328: vectorization support: non-unit strided load was emulated for the variable <elem(is,ii,jj,kk)>, stride is unknown to compiler [ Element.f90(68,18
) ]
remark #15305: vectorization support: vector length 2
remark #15300: LOOP WAS VECTORIZED
remark #15449: unmasked aligned unit stride stores: 1
remark #15452: unmasked strided loads: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 4
remark #15477: vector cost: 2.500
remark #15478: estimated potential speedup: 1.600
remark #15488: --- end vector cost summary ---
LOOP END
LOOP BEGIN at Element.f90(67,15)
<Remainder loop for vectorization, Multiversioned v2>
LOOP END
LOOP BEGIN at Element.f90(70,15)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed OUTPUT dependence between elem(is,ii,jj,kk) (71:18) and elem(is,ii,jj,kk) (73:18)
remark #15346: vector dependence: assumed OUTPUT dependence between elem(is,ii,jj,kk) (73:18) and elem(is,ii,jj,kk) (71:18)
LOOP END
LOOP END
LOOP END
LOOP END
===========================================================================
```

Thank you for your advices.

Regards,

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

At issue here (IMHO) is because you are using POINTER as opposed to allocatable, combined with 1+3 indices being involved, that the compiler optimization code had too many things to keep track of.

In particular, being POINTERs, ...Forces, could just as well point to ...Displ and/or ...Elem.

Do you really need POINTER as opposed to ALLOCATABLE?

As an alternative, try simplifying the work of the optimization code through use of ASSOCIATE. For example:

```
associate(Forces=>Elem%simu(is)%Forces, Displ=>Elem%simu(is)%Displ, Veloc=>Elem%simu(is)%Veloc)
do ii = 1, nx-2
Forces(ii,jj,kk) = Displ(ii,jj,kk) + sc1 * Veloc(ii,jj,kk) + sc2 * tab1(ii)
end do
end associate
```

And similar changes for other loops.

Note, at times when associate does not work around the optimization issue, then resort to using a contained procedure.

```
subroutine Prediction_Elem_Veloc
...
! do ii = 1, nx-2
! Elem%simu(is)%Forces(ii,jj,kk) = Elem%simu(is)%Displ(ii,jj,kk) + sc1 * Elem%simu(is)%Veloc(ii,jj,kk) + sc2 * tab1(ii)
! end do
call CalcForces(Elem%simu(is)%Forces, Elem%simu(is)%Displ, Elem%simu(is)%Veloc)
...
return
contains
subroutine CalcForces(Forces, Displ, Veloc)
implicit none
doubleprecision, dimension(:,:,:) :: Forces ! without pointer
doubleprecision, dimension(:,:,:) :: Veloc
doubleprecision, dimension(:,:,:) :: Displ
! Note, ii, jj, kk, nx2, sc1, sc2, and tab1 are declared via host association
do ii = 1, nx-2
Forces(ii,jj,kk) = Displ(ii,jj,kk) + sc1 * Veloc(ii,jj,kk) + sc2 * tab1(ii)
end do
end subroutine CalcForces
...
end subroutine Prediction_Elem_Veloc
```

Jim Dempsey

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

At issue here (IMHO) is because you are using POINTER as opposed to allocatable, combined with 1+3 indices being involved, that the compiler optimization code had too many things to keep track of.

In particular, being POINTERs, ...Forces, could just as well point to ...Displ and/or ...Elem.

Do you really need POINTER as opposed to ALLOCATABLE?

As an alternative, try simplifying the work of the optimization code through use of ASSOCIATE. For example:

```
associate(Forces=>Elem%simu(is)%Forces, Displ=>Elem%simu(is)%Displ, Veloc=>Elem%simu(is)%Veloc)
do ii = 1, nx-2
Forces(ii,jj,kk) = Displ(ii,jj,kk) + sc1 * Veloc(ii,jj,kk) + sc2 * tab1(ii)
end do
end associate
```

And similar changes for other loops.

Note, at times when associate does not work around the optimization issue, then resort to using a contained procedure.

```
subroutine Prediction_Elem_Veloc
...
! do ii = 1, nx-2
! Elem%simu(is)%Forces(ii,jj,kk) = Elem%simu(is)%Displ(ii,jj,kk) + sc1 * Elem%simu(is)%Veloc(ii,jj,kk) + sc2 * tab1(ii)
! end do
call CalcForces(Elem%simu(is)%Forces, Elem%simu(is)%Displ, Elem%simu(is)%Veloc)
...
return
contains
subroutine CalcForces(Forces, Displ, Veloc)
implicit none
doubleprecision, dimension(:,:,:) :: Forces ! without pointer
doubleprecision, dimension(:,:,:) :: Veloc
doubleprecision, dimension(:,:,:) :: Displ
! Note, ii, jj, kk, nx2, sc1, sc2, and tab1 are declared via host association
do ii = 1, nx-2
Forces(ii,jj,kk) = Displ(ii,jj,kk) + sc1 * Veloc(ii,jj,kk) + sc2 * tab1(ii)
end do
end subroutine CalcForces
...
end subroutine Prediction_Elem_Veloc
```

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello Jim,

thank you for your help.

This code was developed some years ago. At the early 2000, "only POINTER was available for derived types" told me the main programmer of this code. But you're right, now, one can replace POINTER by ALLOCATABLE.

So I begin to test ...

1. ALLOCATABLE alone, doesn't change anything, the computing loop is not vectorized.

2. Next, I try with the ASSOCIATE construction like in your proposal but the computing loop is still not vectorized.

3. So, finally, I write a contained procedure, but at a higher level :

```
do is = 0, Elem%nbsimu-1
CALL compute_Prediction_Elem_Veloc_2 (Elem%ngllx, Elem%ngllz, Elem%simu(is)%Accel, Elem%simu(is)%Forces, Elem%simu(is)%Displ, Elem%simu(is)%Veloc, Elem%simu(is)%V0, alpha, dt, bega)
end do
contains
subroutine compute_Prediction_Elem_Veloc (nx, nz, Accel, Forces, Displ, Veloc, V0, alpha, dt, bega)
integer, intent(in) :: nx, nz
double precision, dimension(1:nx-2, 1:nz-2, 0:1), intent(INOUT) :: Accel
double precision, dimension(0:nx-1, 0:nz-1, 0:1), intent(OUT) :: Forces
double precision, dimension(1:nx-2, 1:nz-2, 0:1), intent(IN) :: Displ
double precision, dimension(1:nx-2, 1:nz-2, 0:1), intent(IN) :: Veloc
double precision, dimension(1:nx-2, 1:nz-2, 0:1), intent(OUT) :: V0
double precision, intent (IN) :: bega, dt, alpha
double precision, dimension(nx-2) :: tab1
integer :: ii, jj, kk
double precision :: sc1, sc2
sc1 = alpha * dt
sc2 = alpha * dt**2 * (0.5d0 - bega)
do kk = 0, 1
do jj = 1, nz-2
do ii = 1, nx-2
tab1(ii) = Accel (ii,jj,kk)
end do
do ii = 1, nx-2
Forces(ii,jj,kk) = Displ(ii,jj,kk) + sc1 * Veloc(ii,jj,kk) + sc2 * tab1(ii)
V0 (ii,jj,kk) = Veloc(ii,jj,kk)
Accel (ii,jj,kk) = 0.d0
end do
end do
end do
return
end subroutine compute_Prediction_Elem_Veloc
```

And the optimization report tells us it's vectorized ! Good news.

```
LOOP BEGIN at Element.f90(104,12) inlined into Element.f90(43,14)
remark #15389: vectorization support: reference forces(ii,jj,kk) has unaligned access [ Element.f90(105,15) ]
remark #15389: vectorization support: reference displ(ii,jj,kk) has unaligned access [ Element.f90(105,34) ]
remark #15389: vectorization support: reference veloc(ii,jj,kk) has unaligned access [ Element.f90(105,58) ]
remark #15388: vectorization support: reference tab1(ii) has aligned access [ Element.f90(105,82) ]
remark #15389: vectorization support: reference v0(ii,jj,kk) has unaligned access [ Element.f90(106,15) ]
remark #15389: vectorization support: reference veloc(ii,jj,kk) has unaligned access [ Element.f90(106,15) ]
remark #15389: vectorization support: reference accel(ii,jj,kk) has unaligned access [ Element.f90(107,15) ]
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 4
remark #15399: vectorization support: unroll factor set to 4
remark #15309: vectorization support: normalized vectorization overhead 0.170
remark #15300: LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 1
remark #15450: unmasked unaligned unit stride loads: 2
remark #15451: unmasked unaligned unit stride stores: 3
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 20
remark #15477: vector cost: 6.250
remark #15478: estimated potential speedup: 2.910
remark #15488: --- end vector cost summary ---
LOOP END
```

So I think I will continue in this direction.

I have an other question related to these nested loops :

This application is based on a spatial discretization done with spectral elements. So nx and nz are less than 10.

=> Is there a way to tell this to the compiler ?

=> Can something be done to fuse the two most internal loops (ii, jj) into a longer one ?

Regards,

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Edit. Also try:

```
associate(Forces=>Elem%simu(is)%Forces(:,jj,kk), Displ=>Elem%simu(is)%Displ(:,jj,kk) Veloc=>Elem%simu(is)%Veloc(:,jj,kk))
do ii = 1, nx-2
Forces(ii) = Displ(ii) + sc1 * Veloc(ii) + sc2 * tab1(ii)
end do
end associate
```

And:

```
subroutine Prediction_Elem_Veloc
...
! do ii = 1, nx-2
! Elem%simu(is)%Forces(ii,jj,kk) = Elem%simu(is)%Displ(ii,jj,kk) + sc1 * Elem%simu(is)%Veloc(ii,jj,kk) + sc2 * tab1(ii)
! end do
call CalcForces(Elem%simu(is)%Forces(:,jj,kk), Elem%simu(is)%Displ(:,jj,kk), Elem%simu(is)%Veloc(:,jj,kk))
...
return
contains
subroutine CalcForces(Forces, Displ, Veloc)
implicit none
doubleprecision, dimension(:) :: Forces
doubleprecision, dimension(:) :: Veloc
doubleprecision, dimension(:) :: Displ
! Note, ii, nx2, sc1, sc2, and tab1 are declared via host association
do ii = 1, nx-2
Forces(ii) = Displ(ii) + sc1 * Veloc(ii) + sc2 * tab1(ii)
end do
end subroutine CalcForces
...
end subroutine Prediction_Elem_Veloc
```

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

=> Is there a way to tell this to the compiler ?

!dir$ loop count max(10)

=> Can something be done to fuse the two most internal loops (ii, jj) into a longer one ?

*** Why is Forces 0-based indexed when the others are 1-based?

IOW Forces(ii,jj,kk) = Displ(ii,jj,kk) + sc1 * Veloc(ii,jj,kk) + sc2 * tab1(ii)

does not compute Forces(0,jj,kk)

? Is size(Forces) == size(Accel) == size(Displ) == size(Veloc) == size(tab1) == size(V0) ??

If so, have the contained procedure receive the 3D arrays as 1D

sketch:

```
...
call compute_Prediction_Elem_Veloc(size(Forces), Accel, Forces, Displ, Veloc, V0, alpha, dt, bega)
...
contains
subroutine compute_Prediction_Elem_Veloc(n, Accel, Forces, Displ, Veloc, V0, alpha, dt, bega)
integer, intent(in) :: n
double precision, dimension(n), intent(INOUT) :: Accel
double precision, dimension(n), intent(OUT) :: Forces
double precision, dimension(n), intent(IN) :: Displ
double precision, dimension(n), intent(IN) :: Veloc
double precision, dimension(n), intent(OUT) :: V0
double precision, intent (IN) :: bega, dt, alpha
integer :: i
double precision :: sc1, sc2
sc1 = alpha * dt
sc2 = alpha * dt**2 * (0.5d0 - bega)
do i=1,n
Forces(i) = Displ(i) + sc1 * Veloc(i) + sc2 * Accel(i)
V0 (i) = Veloc(i)
Accel (i) = 0.d0
end do
return
end subroutine compute_Prediction_Elem_Veloc
```

*** the above requires each of the arrays to be allocatable (contiguous) and same sizes.

Also, there is no need for array tab1

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Forces has different bounds than the other arrays, because it needs some "ghost cells" around for other computations elsewhere in the application. But we will examine the reverse : see if we can increase the loops and extend the other arrays with zero filled cells to the size of Forces. It depends where the "ghost cells" of Forces are used and can be overwritten.

It's up to me now.

your advices are very appreciated.

Thank you.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

>>Force...some "ghost cells" ...

Wouldn't this apply to the other cells as well?

IOW if you place "ghost cells" at the same indicies as those in Force into the other arrays, then you could perform the integration inclusive of the "ghost cells". While this may have unnecessary computation (integrations) of cells, it may recover more performance due to keeping within a single vectored loop. IOW vector over the top of the "ghost cells" thus eliminating start/stop sections of disjointed loops.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Very nice solution from Jim, thanks!

In addition to the solution a few more ideas to consider:

Generally pointers are bad because because:

1) they are allowed to point to the same memory locations, or just partially overlap

2) the stride between the elements is unknown - here you could add the contiguous attribute to at least ensure that the stride is 1, same holds for assumed shape (:) arrays declared in an interface

There are cases where pointers are beneficial, e.g. if you want to avoid workspace allocations and implement your own workspace management. In the example here, you could actually avoid a copy by swapping the pointers V0 und VELOC instead of copying VELOC to V0 if you do not need two separate copies of VELOC. With allocatables you can achieve this by adding another dimension in VELOC and swap between the dimensions with something like this:

...

integer :: veloc_sel,v0_sel

veloc_sel=mod(timestep,2)+1

v0_sel=mod(timestep-1,2)+1

veloc(:,:,:,vsel) ==veloc

veloc(:,:,:,v0_sel)==v0

...

The trick proposed to move the computational stuff into a separate routine works because in Fortran allocatable and pointer are attributes which you can remove by passing the arrays to a subroutine. If the subroutine is small enough the compiler will inline the function call and there is practically no overhead. This is what happens in your code.

Per definition, the subroutine must not be called with overlapping arguments, so 1) is solved implicitly and 2) is usually solved by using assumed size (*) or explicit shape (n) declarations. Assumed shape (:) still allows for non-unit stride between the elements, and usually the compiler implements a runtime check if the stride is =1 or greater. To remove that implicitly generated code, assumed shape arrays can be declared as contiguous just as pointers.

The one drawback of using an subroutine call and removing the pointer / allocatable attribute is that the bounds are reset to lbound=1. e.g. if your forces array has the lower bound=0 and you simply pass the whole array the lower bound inside the subroutine is again 1 and the upper bound is increased by 1! To be clear, assumed shape arrays (:) inside a function will always have lbound=1.

...

module fun

contains

subroutine print_bounds(a)

integer, intent(in),contiguous :: a(:)

write(*,*) 'function side',lbound(a),ubound(a)

end subroutine print_bounds

end module fun

program bounds

use fun

integer :: a(-15:11)

write(*,*) 'call site',lbound(a),ubound(a)

call print_bounds(a)

end program bounds

...

If you know, nx and nz are never larger than 10 and this also corresponds to the arrays, I would consider

...

type :: simu_element

!---------------------------------------

doubleprecision, dimension(8,8,0:1) :: Forces

doubleprecision, dimension(8,8,0:1) :: Veloc

doubleprecision, dimension(8,8,0:1) :: Displ

doubleprecision, dimension(8,8,0:1) :: Accel

doubleprecision, dimension(8,8,0:1) :: V0

!!---------------------------------------

end type simu_element

...

if it does not waste a large amount of memory.

Another advice: try to always use the contiguous attribute when defining a interface and using assumed shape (:) arrays. Otherwise the compiler has to include runtime checks for the stride between the elements of the array.

Best

Tobias

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello Tobias and thank you for your advices.

Some comments:

- I have removed all pointer attributes that I can in the application.

- I use explicite bounds declarations to be more readable and understandable, especially when there are lbounds different of 1. So, here, I can't use the contiguous attribute but I will try to remember it in case of assumed shape.

- I know and use the trick of the swapping pointers instead of copying arrays, especially in iterative methods but the loop above is a misleading example for the use of VELOC and V0. In other routines, they are treated differently.

- Concerning the size of the arrays, most of the runs use few grid points for the spatial discretization but it may happen that nx and nz are as large as 35.

Regards,

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Tobias,

>>you could actually avoid a copy by swapping the pointers V0 und VELOC instead of copying VELOC to V0...

I believe in this case it is required to have two copies of the velocity: current velocity and predicted velocity at next time step.

The two velocities are then integrated using a non-Euler method. IOW when multi-step method is used (Euler is One-Step method).

The other tips are noteworthy.

IMHO compiler optimization missed opportunities are greatly reduced when you simplify the code. In this case treating the multi-dimensional array as a single dimension array. Note, this works in this case due to the algorithm (code) being applied. This may not be suitable in all cases.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

So, I continue my optimization work. I write small contained subroutines. I'm facing a strange situation : the compiler finds flow dependences between array's elements and intent(in) scalar ! Below, there are some extracts of the files, but I put the source file and the report in attachment.

In the main subroutine I call the contained subroutine like this :

```
nx = Elem%ngllx ; nz = Elem%ngllz
do is = 0, Elem%nbsimu-1
CALL compute_correction_Elem_Veloc (Elem%simu(is)%Accel, Elem%simu(is)%Forces, Elem%simu(is)%Displ, &
Elem%simu(is)%Veloc, Elem%simu(is)%V0, Elem%Massmat)
CALL compute_correction_Elem_Veloc_2 (Elem%simu(is)%Accel, Elem%simu(is)%Forces, Elem%simu(is)%Displ, &
Elem%simu(is)%Veloc, Elem%simu(is)%V0, Elem%Massmat)
CALL compute_correction_Elem_Veloc_orig (Elem%simu(is)%Accel, Elem%simu(is)%Forces, Elem%simu(is)%Displ, &
Elem%simu(is)%Veloc, Elem%simu(is)%V0, Elem%Massmat)
end do ! of do is = 0, Elem%nbsimu-1
!
return
```

There are three call to different routines ; it's just to show the different versions I tried and the analysis of the compiler. In the real application there will be only one, the one that works.

First, the call to the original algorithm

```
subroutine compute_Correction_Elem_Veloc_orig (Accel, Forces, Displ, Veloc, V0, Massmat)
double precision, dimension(1:nx-2, 1:nz-2, 0:1), intent(INOUT) :: Accel
double precision, dimension(0:nx-1, 0:nz-1, 0:1), intent(INOUT) :: Forces
double precision, dimension(1:nx-2, 1:nz-2), intent(IN) :: MassMat
double precision, dimension(1:nx-2, 1:nz-2, 0:1), intent(INOUT) :: Displ
double precision, dimension(1:nx-2, 1:nz-2, 0:1), intent(INOUT) :: Veloc
double precision, dimension(1:nx-2, 1:nz-2, 0:1), intent(IN) :: V0
!
integer :: ii, jj, kk
do kk = 0, 1
do jj = 1, nz-2
do ii = 1, nx-2
Forces(ii,jj,kk) = MassMat(ii,jj) * Forces(ii,jj,kk)
Veloc (ii,jj,kk) = V0 (ii,jj,kk) + dt * Forces(ii,jj,kk)
Accel (ii,jj,kk) = Accel (ii,jj,kk) + gam1 / dt * (Veloc(ii,jj,kk) - V0(ii,jj,kk) )
Displ (ii,jj,kk) = Displ (ii,jj,kk) + bega * dt * (Veloc(ii,jj,kk) + V0(ii,jj,kk) )
end do
end do
end do
!
return
end subroutine compute_Correction_Elem_Veloc_orig
```

There are dependences, so no immediate vectorization is possible

I don't find a solution without a temporary array, so here I I propose :

```
subroutine compute_Correction_Elem_Veloc (Accel, Forces, Displ, Veloc, V0, Massmat)
double precision, dimension(1:nx-2, 1:nz-2, 0:1), intent(INOUT) :: Accel
double precision, dimension(0:nx-1, 0:nz-1, 0:1), intent(INOUT) :: Forces
double precision, dimension(1:nx-2, 1:nz-2), intent(IN) :: MassMat
double precision, dimension(1:nx-2, 1:nz-2, 0:1), intent(INOUT) :: Displ
double precision, dimension(1:nx-2, 1:nz-2, 0:1), intent(INOUT) :: Veloc
double precision, dimension(1:nx-2, 1:nz-2, 0:1), intent(IN) :: V0
!
integer :: ii, jj, kk
double precision, dimension(nx-2) :: tab1
do kk = 0, 1
do jj = 1, nz-2
do ii = 1, nx-2
tab1(ii) = MassMat(ii,jj) * Forces(ii,jj,kk)
end do
do ii = 1, nx-2
Veloc (ii,jj,kk) = V0 (ii,jj,kk) + dt * Forces(ii,jj,kk)
Accel (ii,jj,kk) = Accel (ii,jj,kk) + gam1 / dt * (Veloc(ii,jj,kk) - V0(ii,jj,kk) )
Displ (ii,jj,kk) = Displ (ii,jj,kk) + bega * dt * (Veloc(ii,jj,kk) + V0(ii,jj,kk) )
Forces(ii,jj,kk) = tab1(ii)
end do
end do
end do
!
return
end subroutine compute_Correction_Elem_Veloc
```

The compiler vectorizes the first loop on ii, but not the second one with the explanation :

```
LOOP BEGIN at correction.f90(66,9)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between veloc(ii,jj,kk) (67:12) and dt (69:12)
remark #15346: vector dependence: assumed ANTI dependence between dt (69:12) and veloc(ii,jj,kk) (67:12)
LOOP END
```

And I don't understand why it finds a dependence between veloc and dt.

I have tried an other version, based on line 17 juste above :

`Veloc (ii,jj,kk) = V0 (ii,jj,kk) + dt * Forces(ii,jj,kk)`

that one can rewrite (in mathematical terms) to avoid the just computed vlue of VELOC in the RHS of lines 18 and 19 :

`Veloc (ii,jj,kk) - V0 (ii,jj,kk) = dt * Forces(ii,jj,kk)`

By this substitution, I have the second version :

```
subroutine compute_Correction_Elem_Veloc_2 (Accel, Forces, Displ, Veloc, V0, Massmat)
double precision, dimension(1:nx-2, 1:nz-2, 0:1), intent(INOUT) :: Accel
double precision, dimension(0:nx-1, 0:nz-1, 0:1), intent(INOUT) :: Forces
double precision, dimension(1:nx-2, 1:nz-2), intent(IN) :: MassMat
double precision, dimension(1:nx-2, 1:nz-2, 0:1), intent(INOUT) :: Displ
double precision, dimension(1:nx-2, 1:nz-2, 0:1), intent(INOUT) :: Veloc
double precision, dimension(1:nx-2, 1:nz-2, 0:1), intent(IN) :: V0
!
integer :: ii, jj, kk
double precision, dimension(nx-2) :: tab1
do kk = 0, 1
do jj = 1, nz-2
do ii = 1, nx-2
tab1(ii) = MassMat(ii,jj) * Forces(ii,jj,kk)
end do
do ii = 1, nx-2
Veloc (ii,jj,kk) = V0 (ii,jj,kk) + dt * tab1(ii)
Accel (ii,jj,kk) = Accel(ii,jj,kk) + gam1 * tab1(ii)
Displ (ii,jj,kk) = Displ(ii,jj,kk) + bega * dt * (dt * Forces(ii,jj,kk) + 2.0D0 * V0(ii,jj,kk) )
Forces(ii,jj,kk) = tab1(ii)
end do
end do
end do
!
return
end subroutine compute_Correction_Elem_Veloc_2
```

And again the compiler sees dependences :

```
LOOP BEGIN at correction.f90(13,9)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between tab1(ii) (14:12) and tab1(ii) (20:12)
remark #15346: vector dependence: assumed ANTI dependence between tab1(ii) (20:12) and tab1(ii) (14:12)
LOOP BEGIN at correction.f90(12,6)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between tab1(ii) (14:12) and tab1(ii) (20:12)
remark #15346: vector dependence: assumed ANTI dependence between tab1(ii) (20:12) and tab1(ii) (14:12)
LOOP BEGIN at correction.f90(13,9)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between veloc(ii,jj,kk) (17:12) and bega (19:12)
remark #15346: vector dependence: assumed ANTI dependence between bega (19:12) and veloc(ii,jj,kk) (17:12)
LOOP END
LOOP BEGIN at correction.f90(16,9)
LOOP END
LOOP END
LOOP END
```

sorry for the log post !

Regards,

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

just one more small thing :

if I pass as an argument the scalar that is involved in the dependence, the dependence disappears.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Sure, I can't judge if veloc and v0 needs to be copied or not, just a remark. The same holds for the intermediate array tab which I would remove in the code example here.

By the way I would also advise to use the compiler flags: -pad and -align array64byte.

Sorry I missed that you are already using explicit shape arrays, these arrays are by definition contiguous, so no contiguous specifier needed, it is only needed for assumed shape arrays or pointers. Be aware, if the array is non-contiguous at call side, the compiler will create temporary copies in that case - this is actually one thing where assumed shape arrays do make a lot of sense, when you want to pass a non-contiguous array slice to a subroutine it can be implemented without an additional copy, if the interface does not specify the contiguous attribute.

Another thing that just came into my mind is to use OpenMP simd to force vectorization. You can enable them by -qopenmp-simd without having to compile with -qopenmp so that the compiler only honors !$omp simd and uses the information for vectorization.

https://www.openmp.org/spec-html/5.0/openmpsu42.html

With this clause you can specify a lot additional information like safelen, simdlen, stride, alignment, nontemporal stores etc. It is truly a very powerful tool, even if you do not use openmp anywhere in the code and much more portable than any other compiler directives.

But as Jim said, keeping the code simple allows the compiler to do the right thing, most of the time.

Best

Tobias

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

thank you for the flags.

I think that the "-warn all" flag warns us of temporary copies and it's a good thing !

I will take a look at the SIMD openMP directive ; thank you for the advice.

Regards,

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

For performance reasons, you might want to consider extracting:

Forces(0,0,0:1), Forces(nx-1,0,0:1), Forces(0,nx-1,0:1), Forces(nx-1,nx-1,0:1) into, say Forces_perimeter(0:1,0:1,0:1)

Or something like that. Doing so, makes the calculations receive contiguous array sections. As written, you will have to stride over the 0'th and nx-1'th cells of the Forces array.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

One last general comment here, be careful when you read the opt-report, the opt-report also contains remainder loops.

If you are really interested in performance and vectorization the opt-report is a good starting point but to be really sure what code parts are actually executed I highly recommend using our performance analysis tools, e.g. VTune

https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html

For short loops with trip count 8 you might look at the wrong thing when you only rely on the opt-report.

Happy tuning ahead!

Tobias

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page