Vectorization and OpenMP threadprivate

AONym · ‎09-27-2019

I'm trying to vectorize this loop in Fortran XE2019 u5:

DO iRadius = 1, nRadii
    Grev(iRadius, 1)=Dradius(iRadius-1)*dt*A1rev(iRadius, 1)+Brev(iRadius, 1)-sRadius(iRadius-1)*omSq(iRun)*dt*A2rev(iRadius, 1)
    Grev(iRadius, 2)=Dradius(iRadius)*dt*A1rev(iRadius, 2)+Brev(iRadius, 2)-sRadius(iRadius)*omSq(iRun)*dt*A2rev(iRadius, 2)
    Grev(iRadius, 3)=Dradius(iRadius+1)*dt*A1rev(iRadius, 3)+Brev(iRadius, 3)-sRadius(iRadius+1)*omSq(iRun)*dt*A2rev(iRadius, 3)
END DO L_R2_Serial

All the variables except iRadius are REAL(8). Grev is declared in this subroutine as DIMENSION(10000, 3), INTENT(OUT). A1rev, A2rev and Brev are declared in a separate module as DIMENSION(10000, 3). There is also the OpenMP line

!$OMP THREADPRIVATE(Brev, A1rev, A2rev)

Compiling with /O3 in VisualStudio 2019 gives, in the optimization report

LOOP BEGIN at ...MyProgram.f90(414,1)
...MyProgram.f90(488,9):remark #25084: Preprocess Loopnests: Moving Out Store 
...MyProgram.f90(414,1):remark #17104: loop was not parallelized: existence of parallel dependence
...MyProgram.f90(414,1):remark #17106: parallel dependence: assumed FLOW dependence between GREV(IRADIUS,1) (420:13) and A1REV (427:13)
...MyProgram.f90(414,1):remark #17106: parallel dependence: assumed ANTI dependence between A1REV (427:13) and GREV(IRADIUS,1) (420:13)
...MyProgram.f90(414,1):remark #15344: loop was not vectorized: vector dependence prevents vectorization
...MyProgram.f90(414,1):remark #15346: vector dependence: assumed FLOW dependence between GREV(IRADIUS,1) (420:13) and A1REV (427:13)
...MyProgram.f90(414,1):remark #15346: vector dependence: assumed ANTI dependence between A1REV (427:13) and GREV(IRADIUS,1) (420:13)
...MyProgram.f90(414,1):remark #25015: Estimate of max trip count of loop=10002
LOOP END

However, if I remove the OMP threadprivate line, I get

LOOP BEGIN at ...MyProgram.f90(414,1)
...MyProgram.f90(488,9):remark #25084: Preprocess Loopnests: Moving Out Store 
...MyProgram.f90(420,13):remark #15388: vectorization support: reference GREV(IRADIUS,1) has aligned access
...MyProgram.f90(420,13):remark #15388: vectorization support: reference DRADIUS(IRADIUS-1) has aligned access
...MyProgram.f90(420,48):remark #15388: vectorization support: reference A1REV(IRADIUS,1) has aligned access
...MyProgram.f90(420,51):remark #15388: vectorization support: reference BREV(IRADIUS,1) has aligned access
...MyProgram.f90(420,69):remark #15388: vectorization support: reference SRADIUS(IRADIUS-1) has aligned access
...MyProgram.f90(420,116):remark #15388: vectorization support: reference A2REV(IRADIUS,1) has aligned access
...MyProgram.f90(423,13):remark #15388: vectorization support: reference GREV(IRADIUS,2) has aligned access
...MyProgram.f90(423,13):remark #15388: vectorization support: reference DRADIUS(IRADIUS) has aligned access
...MyProgram.f90(423,46):remark #15388: vectorization support: reference A1REV(IRADIUS,2) has aligned access
...MyProgram.f90(423,49):remark #15388: vectorization support: reference BREV(IRADIUS,2) has aligned access
...MyProgram.f90(423,67):remark #15388: vectorization support: reference SRADIUS(IRADIUS) has aligned access
...MyProgram.f90(423,112):remark #15388: vectorization support: reference A2REV(IRADIUS,2) has aligned access
...MyProgram.f90(427,13):remark #15388: vectorization support: reference GREV(IRADIUS,3) has aligned access
...MyProgram.f90(427,13):remark #15388: vectorization support: reference DRADIUS(IRADIUS+1) has aligned access
...MyProgram.f90(427,48):remark #15388: vectorization support: reference A1REV(IRADIUS,3) has aligned access
...MyProgram.f90(427,51):remark #15388: vectorization support: reference BREV(IRADIUS,3) has aligned access
...MyProgram.f90(427,69):remark #15388: vectorization support: reference SRADIUS(IRADIUS+1) has aligned access
...MyProgram.f90(427,116):remark #15388: vectorization support: reference A2REV(IRADIUS,3) has aligned access
...MyProgram.f90(414,1):remark #15305: vectorization support: vector length 4
...MyProgram.f90(414,1):remark #15309: vectorization support: normalized vectorization overhead 0.400
...MyProgram.f90(414,1):remark #15300: LOOP WAS VECTORIZED
...MyProgram.f90(414,1):remark #15448: unmasked aligned unit stride loads: 15 
...MyProgram.f90(414,1):remark #15449: unmasked aligned unit stride stores: 3 
...MyProgram.f90(414,1):remark #15475: --- begin vector cost summary ---
...MyProgram.f90(414,1):remark #15476: scalar cost: 65 
...MyProgram.f90(414,1):remark #15477: vector cost: 11.250 
...MyProgram.f90(414,1):remark #15478: estimated potential speedup: 5.770 
...MyProgram.f90(414,1):remark #15488: --- end vector cost summary ---
...MyProgram.f90(414,1):remark #25015: Estimate of max trip count of loop=2500
LOOP END

which is what I'm trying to achieve. Unfortunately, due to parallelization in the caller of this subroutine, A1rev, A2rev and Brev need to be different for each thread.

I don't understand why the optimizer is seeing this flow and anti flow dependence. In fact, A1rev etc are not written in this subroutine, only in the caller.

Can someone suggest what is necessary to get the vectorization I am looking for?

jimdempseyatthecove · ‎09-28-2019

Not sure why the difference. As a work around, the procedure could be passed in Brev, A1rev, A2rev as dummy arguments (tpBrev, tpA1rev, tpA2rev), then those used in the loop.

Note, it is difficult to assess OpenMP issues without seeing the full context of the !$OMP directives. In this case "Preprocess Loopnests: Moving Out Store" is suspicious of a nested loop. Can you rewrite your code sketching anything related to how the parallel region, inclusive of nesting, is organized (you can elide non-pertinent code using ...).

Jim Dempsey

AONym · ‎09-28-2019

@jimdempseyatthecove: Here's an outline showing the OMP structure:

Subroutine GOF:

!$OMP PARALLEL DEFAULT(SHARED)
!$OMP DO PRIVATE(iRun, nRadii, line,  ...
DO iRun=1, nRuns
...
   CALL CS(..., iRun)
...
END DO ! iRun
!$OMP END DO
!$OMP BARRIER
!$OMP END PARALLEL
-----------------------------------------------
Subroutine CS:

REAL(8), DIMENSION(10000, 3) :: Grev
nThreads= GC(... , Grev)
-----------------------------------------------
Function GC:

USE CI
USE RI
REAL(8), DIMENSION(10000), INTENT(OUT) :: Grev
REAL(8), DIMENSION(0:10001) :: Dradius, sRadius
...
L_R2_Serial: &
    DO iRadius = 1, nRadii
        Grev(iRadius, 1)=Dradius(iRadius-1)*dt*A1rev(iRadius, 1)+Brev(iRadius, 1)-sRadius(iRadius-1)*omSq(iRun)*dt*A2rev(iRadius, 1)
        Grev(iRadius, 2)=Dradius(iRadius)*dt*A1rev(iRadius, 2)+Brev(iRadius, 2)-sRadius(iRadius)*omSq(iRun)*dt*A2rev(iRadius, 2)
        Grev(iRadius, 3)=Dradius(iRadius+1)*dt*A1rev(iRadius, 3)+Brev(iRadius, 3)-sRadius(iRadius+1)*omSq(iRun)*dt*A2rev(iRadius, 3)
    END DO L_R2_Serial
-----------------------------------------------
Module CI:

!DIR$ IF DEFINED(ALIGNED)
!DIR$ attributes align : 64 :: Brev, A1rev, A2rev
!DIR$ ENDIF
REAL(8), DIMENSION(10000, 3 ) :: Brev, A1rev, A2rev
!$OMP THREADPRIVATE(Brev, A1rev, A2rev)   ! **** if this is removed, vectorization is successful
REAL(8) :: dt
-----------------------------------------------
Module RI:

REAL(8), DIMENSION(32) :: omSq

jimdempseyatthecove · ‎09-29-2019

Shape of Grev on line 23 does not match shape used on line 28.

As sketched, it is unknown as to if Grev on line 16 is a local array to subroutine CS or a dummy argument of CS or a global (module) array.

If Grev on line 16 references the same memory for all threads, then the DO loop on line 27 is not correct.

Do CS or GC contain !$OMP directives, if so, include this with your sketch.

Jim Dempsey

John_Campbell · ‎09-29-2019

Would the following change help ? It could have the benefit of processing memory sequentially

do k = 1,3
  DO iRadius = 1, nRadii
    Grev(iRadius, k)=Brev(iRadius, k) + dt*( Dradius(iRadius+k-2)*A1rev(iRadius, k) - sRadius(iRadius+k-2)*A2rev(iRadius, k)*omSq(iRun) )
  END DO
end do

AONym · ‎09-29-2019

@jimdempseyatthecove: I made a mistake in copying. The correct declaration of Grev in GC is

REAL(8), DIMENSION(10000, 3), INTENT(OUT) :: Grev

As far as !$OMP directives, CS does contain a few !$OMP CRITICAL / !$OMP END CRITICAL surrounding debugging outputs that should be contiguous for each thread, but nothing else. GC did contain a number or them, but I removed them all for this test.

@John Campbell: This is a good idea. The original code was DO iRadius / DO k, then it was changed to reverse the subscripts of Grev, then the k loop was unrolled, then the subscripts were reversed back to the original!

AONym · ‎09-29-2019

In addition to implementing John Campbell's suggestion, I used jimdempseyatthecove's idea of changing A1rev, A2rev and Brev from threadprivate in module CI to local in CS, and passed as arguments to all subroutines that used them, including GC. Now the code is

DO iRadius = 1, nRadii
    DO j = 1, 3
        k = j - 2 ! -1, 0, +1
        Grev(iRadius, j)=Dradius(iRadius+k)*dt*A1rev(iRadius, j) + Brev(iRadius, j) - sRadius(iRadius+k)*omSq(iRun)*dt*A2rev(iRadius, j)
    END DO
END DO

and the optimization report shows vectorization as shown in the second report in my initial post.

John_Campbell · ‎09-29-2019

Is there any difference if you have the alternative DO order:

DO j = 1, 3
   k = j - 2 ! -1, 0, +1
   DO iRadius = 1, nRadii

This would process sequentially and could be faster for AVX for large nRadii, although could already be changed by optimisation ?

AONym · ‎10-03-2019

@John Campbell: The loop order you suggested (j, then iRadius) is slightly faster (9:06 vs 8:48 m:ss), but there is something strange in the optimization report.

For DO iRadius / DO j, I get

LOOP BEGIN at ...MyProgram.f90(416,1)
...MyProgram.f90(504,5):remark #25084: Preprocess Loopnests: Moving Out Store 
...MyProgram.f90(425,13):remark #15388: vectorization support: reference GREV(IRADIUS,J) has aligned access
...MyProgram.f90(425,13):remark #15388: vectorization support: reference DRADIUS(IRADIUS+J-2) has aligned access
...MyProgram.f90(425,52):remark #15388: vectorization support: reference A1REV(IRADIUS,J) has aligned access
...MyProgram.f90(425,72):remark #15388: vectorization support: reference BREV(IRADIUS,J) has aligned access
...MyProgram.f90(425,70):remark #15388: vectorization support: reference SRADIUS(IRADIUS+J-2) has aligned access
...MyProgram.f90(425,124):remark #15388: vectorization support: reference A2REV(IRADIUS,J) has aligned access
...MyProgram.f90(425,13):remark #15388: vectorization support: reference GREV(IRADIUS,J) has aligned access
...MyProgram.f90(425,13):remark #15388: vectorization support: reference DRADIUS(IRADIUS+J-2) has aligned access
...MyProgram.f90(425,52):remark #15388: vectorization support: reference A1REV(IRADIUS,J) has aligned access
...MyProgram.f90(425,72):remark #15388: vectorization support: reference BREV(IRADIUS,J) has aligned access
...MyProgram.f90(425,70):remark #15388: vectorization support: reference SRADIUS(IRADIUS+J-2) has aligned access
...MyProgram.f90(425,124):remark #15388: vectorization support: reference A2REV(IRADIUS,J) has aligned access
...MyProgram.f90(425,13):remark #15388: vectorization support: reference GREV(IRADIUS,J) has aligned access
...MyProgram.f90(425,13):remark #15388: vectorization support: reference DRADIUS(IRADIUS+J-2) has aligned access
...MyProgram.f90(425,52):remark #15388: vectorization support: reference A1REV(IRADIUS,J) has aligned access
...MyProgram.f90(425,72):remark #15388: vectorization support: reference BREV(IRADIUS,J) has aligned access
...MyProgram.f90(425,70):remark #15388: vectorization support: reference SRADIUS(IRADIUS+J-2) has aligned access
...MyProgram.f90(425,124):remark #15388: vectorization support: reference A2REV(IRADIUS,J) has aligned access
...MyProgram.f90(416,1):remark #15305: vectorization support: vector length 4
...MyProgram.f90(416,1):remark #15309: vectorization support: normalized vectorization overhead 0.400
...MyProgram.f90(416,1):remark #15300: LOOP WAS VECTORIZED
...MyProgram.f90(416,1):remark #15448: unmasked aligned unit stride loads: 15 
...MyProgram.f90(416,1):remark #15449: unmasked aligned unit stride stores: 3 
...MyProgram.f90(416,1):remark #15475: --- begin vector cost summary ---
...MyProgram.f90(416,1):remark #15476: scalar cost: 65 
...MyProgram.f90(416,1):remark #15477: vector cost: 11.250 
...MyProgram.f90(416,1):remark #15478: estimated potential speedup: 5.770 
...MyProgram.f90(416,1):remark #15488: --- end vector cost summary ---

But for DO j / DO iRadius, I get

LOOP BEGIN at ...MyProgram.f90(419,9)
...MyProgram.f90(421,9):remark #25084: Preprocess Loopnests: Moving Out Store 
...MyProgram.f90(419,9):remark #17108: loop was not parallelized: insufficient computational work
...MyProgram.f90(420,13):remark #15389: vectorization support: reference GREV(IRADIUS,J) has unaligned access
...MyProgram.f90(420,13):remark #15388: vectorization support: reference DRADIUS(IRADIUS+J-2) has aligned access
...MyProgram.f90(420,52):remark #15389: vectorization support: reference A1REV(IRADIUS,J) has unaligned access
...MyProgram.f90(420,72):remark #15389: vectorization support: reference BREV(IRADIUS,J) has unaligned access
...MyProgram.f90(420,70):remark #15388: vectorization support: reference SRADIUS(IRADIUS+J-2) has aligned access
...MyProgram.f90(420,124):remark #15389: vectorization support: reference A2REV(IRADIUS,J) has unaligned access
...MyProgram.f90(419,9):remark #15381: vectorization support: unaligned access used inside loop body
...MyProgram.f90(419,9):remark #15305: vectorization support: vector length 4
...MyProgram.f90(419,9):remark #15399: vectorization support: unroll factor set to 4
...MyProgram.f90(419,9):remark #15309: vectorization support: normalized vectorization overhead 0.171
...MyProgram.f90(419,9):remark #15300: LOOP WAS VECTORIZED
...MyProgram.f90(419,9):remark #15448: unmasked aligned unit stride loads: 2 
...MyProgram.f90(419,9):remark #15450: unmasked unaligned unit stride loads: 3 
...MyProgram.f90(419,9):remark #15451: unmasked unaligned unit stride stores: 1 
...MyProgram.f90(419,9):remark #15475: --- begin vector cost summary ---
...MyProgram.f90(419,9):remark #15476: scalar cost: 22 
...MyProgram.f90(419,9):remark #15477: vector cost: 10.250 
...MyProgram.f90(419,9):remark #15478: estimated potential speedup: 2.140 
...MyProgram.f90(419,9):remark #15488: --- end vector cost summary ---
...MyProg

One difference is the aligned vs unaligned arrays. However, the arrays are allocated aligned to 64 bytes and the first dimension is 10000, which is divisible by 8, so the columns are also 64-byte aligned. The only difference between the two builds is the reversal of the DO loops.

The other difference is the "unit stride loads"; not sure what this means.

jimdempseyatthecove · ‎10-04-2019

Try this (untested code):

DO j = 1, 3
    k = j - 2 ! -1, 0, +1
    ! use ASSOCIATE to re-interpret 2D arrays as a 1D slice
    ASSOCIATE(qGrev=>Grev(:,j), qA1Rev=>A1rev(:,j), qBrev=>Brev(:,j), qA2Rev=>A2rev(:,j))
    ! inform the compiler of alignment (that you asserted was true)
    !DIR$ ASSUME_ALIGNED qGrev:64, qA1rev:64, qBrev:64, qA2rev:64
    DO iRadius = 1, nRadii
        qGrev(iRadius)=Dradius(iRadius+k)*dt*qA1rev(iRadius) + qBrev(iRadius) - sRadius(iRadius+k)*omSq(iRun)*dt*qA2rev(iRadius)
    END DO
    !DIR$ END ASSOCIATE
END DO

"unit stride" means an array slice has elements that are adjacent in memory A(1,J), A(2), A(3,J),... as opposed to A(I,1), A(I,2), A(I,3),...

Jim Dempsey