topic How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ? in Intel® Fortran Compiler

How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

Yann_D_ — Mon, 20 Feb 2023 16:19:07 GMT

Hello

I want to vectorize the two nested loops because their length is small.

The example is the following :

module mystuff USE ISO_FORTRAN_ENV integer :: nx integer :: nz real(REAL64) :: bega contains subroutine compute_Correction_Elem_Veloc (Accel, Forces, Displ, Veloc, V0, g1, ddt) real(REAL64), dimension(1:nx-2, 1:nz-2, 0:1), intent(INOUT) :: Accel real(REAL64), dimension(0:nx-1, 0:nz-1, 0:1), intent(INOUT) :: Forces real(REAL64), dimension(1:nx-2, 1:nz-2, 0:1), intent(INOUT) :: Displ real(REAL64), dimension(1:nx-2, 1:nz-2, 0:1), intent(INOUT) :: Veloc real(REAL64), dimension(1:nx-2, 1:nz-2, 0:1), intent(IN) :: V0 real(REAL64), intent(IN) :: g1, ddt ! integer :: ii, jj real(REAL64) :: sc2 ! sc2 = bega * ddt !$OMP SIMD COLLAPSE(2) do jj = 1, nz-2 do ii = 1, nx-2 Veloc (ii,jj,0) = V0 (ii,jj,0) + ddt * Forces(ii,jj,0) Accel (ii,jj,0) = Accel(ii,jj,0) + g1 * Forces(ii,jj,0) Displ (ii,jj,0) = Displ(ii,jj,0) + sc2 * (Veloc(ii,jj,0) + V0(ii,jj,0) ) Veloc (ii,jj,1) = V0 (ii,jj,1) + ddt * Forces(ii,jj,1) Accel (ii,jj,1) = Accel(ii,jj,1) + g1 * Forces(ii,jj,1) Displ (ii,jj,1) = Displ(ii,jj,1) + sc2 * (Veloc(ii,jj,1) + V0(ii,jj,1) ) end do end do ! return end subroutine compute_Correction_Elem_Veloc end module mystuff

The compilation sequence is

ifort -c -O3 -xHost -ftz -g -traceback -implicitnone -qopt-report=5 -qopt-report-phase=vec -fpp -DGM_VEC -pad -align array64byte -c mystuff_v01.f90

The report is

Intel(R) Advisor can now assist with vectorization and show optimization report messages with your source code. See "https://software.intel.com/en-us/intel-advisor-xe" for details. Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000 Compiler options: -c -O3 -xHost -ftz -g -traceback -implicitnone -qopt-report=5 -qopt-report-phase=vec -fpp -DGM_VEC -pad -align array64byte -c Begin optimization report for: MYSTUFF::COMPUTE_CORRECTION_ELEM_VELOC Report from: Vector optimizations [vec] LOOP BEGIN at mystuff_v01.f90(26,25) remark #15329: vectorization support: irregularly indexed store was emulated for the variable <veloc(ii,jj,0)>, part of index is nonlinearly computed [ mystuff_v01.f90(25,7) ] remark #15328: vectorization support: irregularly indexed load was emulated for the variable <v0(ii,jj,0)>, part of index is nonlinearly computed [ mystuff_v01.f90(25,25) ] remark #15328: vectorization support: irregularly indexed load was emulated for the variable <forces(ii,jj,0)>, part of index is nonlinearly computed [ mystuff_v01.f90(25,49) ] remark #15329: vectorization support: irregularly indexed store was emulated for the variable <accel(ii,jj,0)>, part of index is nonlinearly computed [ mystuff_v01.f90(26,7) ] remark #15328: vectorization support: irregularly indexed load was emulated for the variable <accel(ii,jj,0)>, part of index is nonlinearly computed remark #15328: vectorization support: irregularly indexed load was emulated for the variable <forces(ii,jj,0)>, part of index is nonlinearly computed [ mystuff_v01.f90(26,49) ] remark #15329: vectorization support: irregularly indexed store was emulated for the variable <displ(ii,jj,0)>, part of index is nonlinearly computed [ mystuff_v01.f90(27,7) ] remark #15328: vectorization support: irregularly indexed load was emulated for the variable <displ(ii,jj,0)>, part of index is nonlinearly computed [ mystuff_v01.f90(27,25) ] remark #15328: vectorization support: irregularly indexed load was emulated for the variable <veloc(ii,jj,0)>, part of index is nonlinearly computed [ mystuff_v01.f90(27,49) ] remark #15328: vectorization support: irregularly indexed load was emulated for the variable <v0(ii,jj,0)>, part of index is nonlinearly computed [ mystuff_v01.f90(27,66) ] remark #15329: vectorization support: irregularly indexed store was emulated for the variable <veloc(ii,jj,1)>, part of index is nonlinearly computed [ mystuff_v01.f90(28,7) ] remark #15328: vectorization support: irregularly indexed load was emulated for the variable <v0(ii,jj,1)>, part of index is nonlinearly computed [ mystuff_v01.f90(28,25) ] remark #15328: vectorization support: irregularly indexed load was emulated for the variable <forces(ii,jj,1)>, part of index is nonlinearly computed [ mystuff_v01.f90(28,49) ] remark #15329: vectorization support: irregularly indexed store was emulated for the variable <accel(ii,jj,1)>, part of index is nonlinearly computed [ mystuff_v01.f90(29,7) ] remark #15328: vectorization support: irregularly indexed load was emulated for the variable <accel(ii,jj,1)>, part of index is nonlinearly computed [ mystuff_v01.f90(29,25) ] remark #15328: vectorization support: irregularly indexed load was emulated for the variable <forces(ii,jj,1)>, part of index is nonlinearly computed [ mystuff_v01.f90(29,49) ] remark #15329: vectorization support: irregularly indexed store was emulated for the variable <displ(ii,jj,1)>, part of index is nonlinearly computed [ mystuff_v01.f90(30,7) ] remark #15328: vectorization support: irregularly indexed load was emulated for the variable <displ(ii,jj,1)>, part of index is nonlinearly computed [ mystuff_v01.f90(30,25) ] remark #15328: vectorization support: irregularly indexed load was emulated for the variable <veloc(ii,jj,1)>, part of index is nonlinearly computed [ mystuff_v01.f90(30,49) ] remark #15328: vectorization support: irregularly indexed load was emulated for the variable <v0(ii,jj,1)>, part of index is nonlinearly computed [ mystuff_v01.f90(30,66) ] remark #15305: vectorization support: vector length 2 remark #15309: vectorization support: normalized vectorization overhead 0.073 remark #15301: OpenMP SIMD LOOP WAS VECTORIZED remark #15462: unmasked indexed (or gather) loads: 14 remark #15463: unmasked indexed (or scatter) stores: 6 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 134 remark #15477: vector cost: 191.500 remark #15478: estimated potential speedup: 0.690 remark #15482: vectorized math library calls: 2 remark #15488: --- end vector cost summary --- LOOP END LOOP BEGIN at mystuff_v01.f90(26,25) <Remainder loop for vectorization> LOOP END ===========================================================================

If I remove the OpenMP direction OMP SIMD COLLAPSE and I recompile, the new report is the following :

Intel(R) Advisor can now assist with vectorization and show optimization report messages with your source code. See "https://software.intel.com/en-us/intel-advisor-xe" for details. Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000 Compiler options: -c -O3 -xHost -ftz -g -traceback -implicitnone -qopt-report=5 -qopt-report-phase=vec -fpp -DGM_VEC -pad -align array64byte -c Begin optimization report for: MYSTUFF::COMPUTE_CORRECTION_ELEM_VELOC Report from: Vector optimizations [vec] LOOP BEGIN at mystuff_v00.f90(23,1) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at mystuff_v00.f90(24,4) <Peeled loop for vectorization> LOOP END LOOP BEGIN at mystuff_v00.f90(24,4) remark #15389: vectorization support: reference veloc(ii,jj,1) has unaligned access [ mystuff_v00.f90(28,7) ] remark #15389: vectorization support: reference v0(ii,jj,1) has unaligned access [ mystuff_v00.f90(28,25) ] remark #15389: vectorization support: reference forces(ii,jj,1) has unaligned access [ mystuff_v00.f90(28,49) ] remark #15389: vectorization support: reference veloc(ii,jj,0) has unaligned access [ mystuff_v00.f90(25,7) ] remark #15389: vectorization support: reference v0(ii,jj,0) has unaligned access [ mystuff_v00.f90(25,25) ] remark #15389: vectorization support: reference forces(ii,jj,0) has unaligned access [ mystuff_v00.f90(25,49) ] remark #15389: vectorization support: reference accel(ii,jj,0) has unaligned access [ mystuff_v00.f90(26,7) ] remark #15389: vectorization support: reference accel(ii,jj,0) has unaligned access [ mystuff_v00.f90(26,25) ] remark #15389: vectorization support: reference forces(ii,jj,0) has unaligned access [ mystuff_v00.f90(26,49) ] remark #15389: vectorization support: reference accel(ii,jj,1) has unaligned access [ mystuff_v00.f90(29,7) ] remark #15389: vectorization support: reference accel(ii,jj,1) has unaligned access [ mystuff_v00.f90(29,25) ] remark #15389: vectorization support: reference forces(ii,jj,1) has unaligned access [ mystuff_v00.f90(29,49) ] remark #15389: vectorization support: reference displ(ii,jj,1) has unaligned access [ mystuff_v00.f90(30,7) ] remark #15389: vectorization support: reference displ(ii,jj,1) has unaligned access [ mystuff_v00.f90(30,25) ] remark #15389: vectorization support: reference veloc(ii,jj,1) has unaligned access [ mystuff_v00.f90(30,49) ] remark #15389: vectorization support: reference v0(ii,jj,1) has unaligned access [ mystuff_v00.f90(30,66) ] remark #15389: vectorization support: reference displ(ii,jj,0) has unaligned access [ mystuff_v00.f90(27,7) ] remark #15389: vectorization support: reference displ(ii,jj,0) has unaligned access [ mystuff_v00.f90(27,25) ] remark #15389: vectorization support: reference veloc(ii,jj,0) has unaligned access [ mystuff_v00.f90(27,49) ] remark #15389: vectorization support: reference v0(ii,jj,0) has unaligned access [ mystuff_v00.f90(27,66) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15305: vectorization support: vector length 4 remark #15309: vectorization support: normalized vectorization overhead 0.892 remark #15300: LOOP WAS VECTORIZED remark #15442: entire loop may be executed in remainder remark #15450: unmasked unaligned unit stride loads: 10 remark #15451: unmasked unaligned unit stride stores: 6 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 55 remark #15477: vector cost: 16.250 remark #15478: estimated potential speedup: 3.240 remark #15488: --- end vector cost summary --- LOOP END LOOP BEGIN at mystuff_v00.f90(24,4) <Remainder loop for vectorization> LOOP END LOOP END ===========================================================================

If I look at the final lines of the first report, It seems to me that the estimated potential speedup with the OpenMP directive is very bad with respect to the last one. Is this normal ? What do I miss ?

Thank you for your comments.

Re: How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

jimdempseyatthecove — Mon, 20 Feb 2023 18:13:59 GMT

Due to (nz-2, and nx-2):

do jj = 1, nz-2 do ii = 1, nx-2

You have gaps in the blocks of data. You can only vectorize the inner loop. IOW these loops cannot be collapsed.

Jim Dempsey

Re: How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

JohnNichols — Mon, 20 Feb 2023 19:09:59 GMT

Jim

I fail to see why the code uses NZ-2 etc.. if this was lisp one would write a few functions to allow for the equations, there has to be a more efficient method?

Re: How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

jimdempseyatthecove — Mon, 20 Feb 2023 21:41:45 GMT

Edit,

In looking closer at your code, you have Forces dimensioned as:

real(REAL64), dimension(0:nx-1, 0:nz-1, 0:1), intent(INOUT) :: Forces

Nothing inherently wrong with that, except that your loop is iterating the in indices from 1 to nx-2 and nz-2

This means that indices 0 and nz-1 are skipped over in your do loop. This is the reason why you could not collapse the two loop with vectorization. The collapse requires the collapsed indices to be contiguous.

If you really need collapsing of these loops, then consider dimensioning Forces the same as the other arrays, and then, add the two boundary forces as a separate array.

You will have to trade off "looks odd" with "runs better".

Jim Dempsey

Re: How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

jimdempseyatthecove — Mon, 20 Feb 2023 21:48:16 GMT

John,

Assume you have a string you wish to simulate.

Assume the end points are affixed to some structure.

Assume the dynamics of the string is simulated by synthesizing beads along the length of the string.

The forces for motion applies to the beads (indices 1:nz-2)

Yet the forces within the string extends to the two end point fixtures (the 0'th and nz-1'th indices). Thus the forces require two additional variables (in the string example). The OP had something akin to a 2D mesh.

Jim Dempsey

Re: How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

jimdempseyatthecove — Mon, 20 Feb 2023 21:50:33 GMT

BTW if the value of nx is large, then a collapse (if possible) would be of negligible value.

Jim Dempsey

Re: How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

jimdempseyatthecove — Mon, 20 Feb 2023 22:02:21 GMT

Yann,

If you do take the route of separating the interior forces from the perimeter force then consider

1) In the caller (or module variables), attribute your arrays with:

!DIR$ ATTRIBUTES ALIGN : bytesOfAlignment :: YourArray

2) in your called subroutine use the compiler directive !DIR$ ASSUME_ALIGNED YourArray:bytesOfAlignment

Generally you use 64 for bytesOfAlignment

*** Caution, it is the programmers responsibility to assure the calling arguments meet the assumptions of the called subroutine.

Jim Dempsey

Re: How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

Yann_D_ — Tue, 21 Feb 2023 13:58:46 GMT

Hello Mr Dempsey,

I'm glad to read your comments !

Some time has passed since my previous thread on Intel's forums (Can't understand why fortran inner loop not vectorized).

It's the same application and your advices let me strongly improve its performances. Thank you very much for this.

Some remarks to answer your replies.

1. As the situation is a bit more complicated than I thought, I create a small program (attached below) that contains everything to mimic the way the whole application does the computations.

- a module named mytypes.f90 with the simplified definition of the derived type Element

- a module named mystuff.f90 with the computing routines

- a main program to run it, its results are meaningless

- a script compile.sh to compile the source files

- a script run.sh to execute the program

2. The dimensions of array "Forces", are not really a problem for me here, one can change them to the same as other arrays, i.e. 1:nx-2, 1:nz-2

It's done in the attached example.

3. nx and nz are less than ten (5 or 9 ), but not constants through all the spectral elements (spatial discretization method). Moreover if I put a directive like

!DIR$ LOOP COUNT (3,7)

which are the possible lengths for the loops (nx and nz are equal to 5 or 9 in my examples) the compiler does not vectorize anymore : the performances fall. So, even on short loops like these, vectorization gives a real gain. So no loop count directive. This is the reason why I try to collapse the two loops to get a longer one.

4. I try to put the two attribute directives you talked about in this example but the only place the compiler accepts the directive "!$DIR$ ASSUME_ALIGNED " is in mytypes.f90 not in the called routine. I don't manage to compile with it in mystuff.f90 . If I try, I get messages like this :

mystuff.f90(26): remark #7867: This directive is misplaced or not supported on this platform.

(I keep them in comments in mystuff.f90)

With 2. and 4. , I compile the code and in the reports, I still get lots of bad messages like

remark #15328: vectorization support: irregularly indexed load was emulated for the variable <veloc(ii,jj,1)>, part of index is nonlinearly computed [ mystuff.f90(49,49) ]

So I should have something wrong !

Regards,

Re: How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

jimdempseyatthecove — Tue, 21 Feb 2023 18:15:55 GMT

I made changes to your code and it now vectorizes.

main.f90 is unchanged.

module mystuff use mytypes real(REAL64) :: bega contains subroutine Correction_Elem_Veloc (Elem, bega, gam1, dt) implicit none type(Element), intent(INOUT) :: Elem real(REAL64), intent(IN) :: bega, gam1, dt real(REAL64) :: sc2 integer :: nx, nz nx = Elem%ngllx nz = Elem%ngllz sc2 = bega * dt CALL compute_correction_Elem_Veloc (Elem%Accel, Elem%Forces, Elem%Displ, & Elem%Veloc, Elem%V0, gam1, dt, sc2, nx*nz) end subroutine Correction_Elem_Veloc end module mystuff ! Move out of the module contained procedure ! IOW make it an external procedure ! External procedures take the passed arguments as-is ! meaning the arguments can change rank and shape (you specify the size) ! ***** Compile with interface checking off ***** subroutine compute_Correction_Elem_Veloc (Accel, Forces, Displ, Veloc, V0, g1, ddt, sc2, n) use mytypes implicit none integer :: n real(REAL64), dimension(n), intent(INOUT) :: Accel real(REAL64), dimension(n), intent(INOUT) :: Forces real(REAL64), dimension(n), intent(INOUT) :: Displ real(REAL64), dimension(n), intent(INOUT) :: Veloc real(REAL64), dimension(n), intent(IN) :: V0 real(REAL64), intent(IN) :: g1, ddt, sc2 integer :: ii !$OMP SIMD do ii = 1, size(Accel) Veloc (ii) =V0 (ii) + ddt * Forces(ii) Accel (ii) =Accel(ii) + g1 * Forces(ii) Displ (ii) =Displ(ii) + sc2 * (Veloc (ii) +V0(ii) ) end do ! return end subroutine compute_Correction_Elem_Veloc

Using VTune, the majority of the time in compute_Correction_Elem_Veloc is in this section: .B2.13:: ; Preds .B2.13 .B2.12 ; Execution count [5.44e+00] vmovupd ymm9, YMMWORD PTR [rax+r10*8] ;51.18 vmovupd ymm6, YMMWORD PTR [rdx+r10*8] ;51.36 vmovupd ymm7, YMMWORD PTR [rcx+r10*8] ;52.18 vmovdqa ymm8, ymm2 ;51.6 vfmadd213pd ymm8, ymm6, ymm9 ;51.6 vfmadd231pd ymm7, ymm6, ymm1 ;52.6 vaddpd ymm6, ymm8, ymm9 ;53.48 vmovupd YMMWORD PTR [rbx+r10*8], ymm8 ;51.6 vmovupd YMMWORD PTR [rcx+r10*8], ymm7 ;52.6 vfmadd213pd ymm6, ymm0, YMMWORD PTR [rsi+r10*8] ;53.6 vmovupd YMMWORD PTR [rsi+r10*8], ymm6 ;53.6 add r10, 4 ;50.1 cmp r10, r9 ;50.1 jb .B2.13 ; Prob 82% ;50.1

Jim Dempsey

Re: How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

jimdempseyatthecove — Tue, 21 Feb 2023 18:19:18 GMT

I used 9,9 for input and ran the main loop 30 billion times (then stopped VTune run after about 30 seconds).

I did modify you main to loop the call for timing purposes.

Jim Dempsey

Re: How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

jimdempseyatthecove — Tue, 21 Feb 2023 18:21:59 GMT

OOPS

Change:

CALL compute_correction_Elem_Veloc (Elem%Accel, Elem%Forces, Elem%Displ, & Elem%Veloc, Elem%V0, gam1, dt, sc2, nx*nz)

CALL compute_correction_Elem_Veloc (Elem%Accel, Elem%Forces, Elem%Displ, & Elem%Veloc, Elem%V0, gam1, dt, sc2, size(Elem%Accel))

Jim Dempsey

Re: How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

JohnNichols — Tue, 21 Feb 2023 18:50:22 GMT

Jim

I understand your points, the real issue is this is a computationally time intensive analysis. You will seek to speed it up, but even the fastest horse in the world can only run so fast. Sometimes you need a chevy.

Good improvement, perhaps a small Honda civic.

John

Re: How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

Yann_D_ — Wed, 22 Feb 2023 17:25:58 GMT

Thank you for your replies.

I know the trick of changing a 3D array to a 1D array when passing it as an argument to get a longer loop but I thought there were a way to do the job without it.

So I will implement your idea in the real application and see what I get back !

Thank you.

Re: How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

jimdempseyatthecove — Wed, 22 Feb 2023 19:36:28 GMT

Would I be correct that your application will have a large number of these elements?

If this be so, then consider:

type Elements ... real(REAL64) :: dimension(:,:,:,:) :: Forces ! allocate( Forces(nx, nz, 2, nElements) ) ...

IOW the entire blob of forces (etc...) can be worked upon in a single call.

This will reduce the number of calls and more importantly, reduce the number of times the code performs the alignment checks and peel preamble, and post-able remainder to once for nElements.

Jim Dempsey

Re: How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

Yann_D_ — Thu, 23 Feb 2023 15:57:57 GMT

The number of elements can be quite large but the treatment is not the same for all elements. It's an unstructured mesh. The boundary elements are treated differently. And there are some other kind of elements too. Maybe a reordering may help but it implies lots of modifications in different parts of the source code. I'll talk about this to the other developpers.

Regards,

Yann

Re: How to vectorize two nested fortran loops with OpenMP directive with COLLAPSE clause ?

jimdempseyatthecove — Thu, 23 Feb 2023 19:24:23 GMT

Yann,

With some expansion of your simplified test program, you can model the step integration for a representative number of elements. The same model would contain the two or more different structures and implementation for your elements. Making timing test runs will give you a metric for a cost/benefit ratio.

>>treatment is not the same for all elements.

When the treatment is static per element, then you can have different classes of elements, each executed using their specific treatment.

When the treatment is dynamic per element, but the frequency and degree of change in treatment is small, then you may be able to employ a technique of incorporating "NULL" elements in each class. IOW elements that are computed, but have no effect on the results. To move an element from one class to another, you nullify it in the current class and then overlay a NULL element in the destination class. This technique permits the arrays to be allocated once (with spare entries of NULLs for element movement) or have a minimal amount of reallocation.

You have to be careful about division by 0.0. Depending on your model, this might involve making the mass or charge 0.0 or other such property that facilitates no net change in results while avoiding an IF test and branch. The idea is to get the most out of SIMD execution.

If you need some consulting help, I am available.

Jim Dempsey