- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I'm trying to vectorize this loop in Fortran XE2019 u5:

DO iRadius = 1, nRadii Grev(iRadius, 1)=Dradius(iRadius-1)*dt*A1rev(iRadius, 1)+Brev(iRadius, 1)-sRadius(iRadius-1)*omSq(iRun)*dt*A2rev(iRadius, 1) Grev(iRadius, 2)=Dradius(iRadius)*dt*A1rev(iRadius, 2)+Brev(iRadius, 2)-sRadius(iRadius)*omSq(iRun)*dt*A2rev(iRadius, 2) Grev(iRadius, 3)=Dradius(iRadius+1)*dt*A1rev(iRadius, 3)+Brev(iRadius, 3)-sRadius(iRadius+1)*omSq(iRun)*dt*A2rev(iRadius, 3) END DO L_R2_Serial

All the variables except iRadius are REAL(8). Grev is declared in this subroutine as DIMENSION(10000, 3), INTENT(OUT). A1rev, A2rev and Brev are declared in a separate module as DIMENSION(10000, 3). There is also the OpenMP line

!$OMP THREADPRIVATE(Brev, A1rev, A2rev)

Compiling with /O3 in VisualStudio 2019 gives, in the optimization report

LOOP BEGIN at ...MyProgram.f90(414,1) ...MyProgram.f90(488,9):remark #25084: Preprocess Loopnests: Moving Out Store ...MyProgram.f90(414,1):remark #17104: loop was not parallelized: existence of parallel dependence ...MyProgram.f90(414,1):remark #17106: parallel dependence: assumed FLOW dependence between GREV(IRADIUS,1) (420:13) and A1REV (427:13) ...MyProgram.f90(414,1):remark #17106: parallel dependence: assumed ANTI dependence between A1REV (427:13) and GREV(IRADIUS,1) (420:13) ...MyProgram.f90(414,1):remark #15344: loop was not vectorized: vector dependence prevents vectorization ...MyProgram.f90(414,1):remark #15346: vector dependence: assumed FLOW dependence between GREV(IRADIUS,1) (420:13) and A1REV (427:13) ...MyProgram.f90(414,1):remark #15346: vector dependence: assumed ANTI dependence between A1REV (427:13) and GREV(IRADIUS,1) (420:13) ...MyProgram.f90(414,1):remark #25015: Estimate of max trip count of loop=10002 LOOP END

However, if I remove the OMP threadprivate line, I get

LOOP BEGIN at ...MyProgram.f90(414,1) ...MyProgram.f90(488,9):remark #25084: Preprocess Loopnests: Moving Out Store ...MyProgram.f90(420,13):remark #15388: vectorization support: reference GREV(IRADIUS,1) has aligned access ...MyProgram.f90(420,13):remark #15388: vectorization support: reference DRADIUS(IRADIUS-1) has aligned access ...MyProgram.f90(420,48):remark #15388: vectorization support: reference A1REV(IRADIUS,1) has aligned access ...MyProgram.f90(420,51):remark #15388: vectorization support: reference BREV(IRADIUS,1) has aligned access ...MyProgram.f90(420,69):remark #15388: vectorization support: reference SRADIUS(IRADIUS-1) has aligned access ...MyProgram.f90(420,116):remark #15388: vectorization support: reference A2REV(IRADIUS,1) has aligned access ...MyProgram.f90(423,13):remark #15388: vectorization support: reference GREV(IRADIUS,2) has aligned access ...MyProgram.f90(423,13):remark #15388: vectorization support: reference DRADIUS(IRADIUS) has aligned access ...MyProgram.f90(423,46):remark #15388: vectorization support: reference A1REV(IRADIUS,2) has aligned access ...MyProgram.f90(423,49):remark #15388: vectorization support: reference BREV(IRADIUS,2) has aligned access ...MyProgram.f90(423,67):remark #15388: vectorization support: reference SRADIUS(IRADIUS) has aligned access ...MyProgram.f90(423,112):remark #15388: vectorization support: reference A2REV(IRADIUS,2) has aligned access ...MyProgram.f90(427,13):remark #15388: vectorization support: reference GREV(IRADIUS,3) has aligned access ...MyProgram.f90(427,13):remark #15388: vectorization support: reference DRADIUS(IRADIUS+1) has aligned access ...MyProgram.f90(427,48):remark #15388: vectorization support: reference A1REV(IRADIUS,3) has aligned access ...MyProgram.f90(427,51):remark #15388: vectorization support: reference BREV(IRADIUS,3) has aligned access ...MyProgram.f90(427,69):remark #15388: vectorization support: reference SRADIUS(IRADIUS+1) has aligned access ...MyProgram.f90(427,116):remark #15388: vectorization support: reference A2REV(IRADIUS,3) has aligned access ...MyProgram.f90(414,1):remark #15305: vectorization support: vector length 4 ...MyProgram.f90(414,1):remark #15309: vectorization support: normalized vectorization overhead 0.400 ...MyProgram.f90(414,1):remark #15300: LOOP WAS VECTORIZED ...MyProgram.f90(414,1):remark #15448: unmasked aligned unit stride loads: 15 ...MyProgram.f90(414,1):remark #15449: unmasked aligned unit stride stores: 3 ...MyProgram.f90(414,1):remark #15475: --- begin vector cost summary --- ...MyProgram.f90(414,1):remark #15476: scalar cost: 65 ...MyProgram.f90(414,1):remark #15477: vector cost: 11.250 ...MyProgram.f90(414,1):remark #15478: estimated potential speedup: 5.770 ...MyProgram.f90(414,1):remark #15488: --- end vector cost summary --- ...MyProgram.f90(414,1):remark #25015: Estimate of max trip count of loop=2500 LOOP END

which is what I'm trying to achieve. Unfortunately, due to parallelization in the caller of this subroutine, A1rev, A2rev and Brev need to be different for each thread.

I don't understand why the optimizer is seeing this flow and anti flow dependence. In fact, A1rev etc are not written in this subroutine, only in the caller.

Can someone suggest what is necessary to get the vectorization I am looking for?

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Not sure why the difference. As a work around, the procedure could be passed in Brev, A1rev, A2rev as dummy arguments (tpBrev, tpA1rev, tpA2rev), then those used in the loop.

Note, it is difficult to assess OpenMP issues without seeing the full context of the !$OMP directives. In this case "**Preprocess Loopnests: Moving Out Store**" is suspicious of a nested loop. Can you rewrite your code sketching anything related to how the parallel region, inclusive of nesting, is organized (you can elide non-pertinent code using ...).

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

@jimdempseyatthecove: Here's an outline showing the OMP structure:

Subroutine GOF: !$OMP PARALLEL DEFAULT(SHARED) !$OMP DO PRIVATE(iRun, nRadii, line, ... DO iRun=1, nRuns ... CALL CS(..., iRun) ... END DO ! iRun !$OMP END DO !$OMP BARRIER !$OMP END PARALLEL ----------------------------------------------- Subroutine CS: REAL(8), DIMENSION(10000, 3) :: Grev nThreads= GC(... , Grev) ----------------------------------------------- Function GC: USE CI USE RI REAL(8), DIMENSION(10000), INTENT(OUT) :: Grev REAL(8), DIMENSION(0:10001) :: Dradius, sRadius ... L_R2_Serial: & DO iRadius = 1, nRadii Grev(iRadius, 1)=Dradius(iRadius-1)*dt*A1rev(iRadius, 1)+Brev(iRadius, 1)-sRadius(iRadius-1)*omSq(iRun)*dt*A2rev(iRadius, 1) Grev(iRadius, 2)=Dradius(iRadius)*dt*A1rev(iRadius, 2)+Brev(iRadius, 2)-sRadius(iRadius)*omSq(iRun)*dt*A2rev(iRadius, 2) Grev(iRadius, 3)=Dradius(iRadius+1)*dt*A1rev(iRadius, 3)+Brev(iRadius, 3)-sRadius(iRadius+1)*omSq(iRun)*dt*A2rev(iRadius, 3) END DO L_R2_Serial ----------------------------------------------- Module CI: !DIR$ IF DEFINED(ALIGNED) !DIR$ attributes align : 64 :: Brev, A1rev, A2rev !DIR$ ENDIF REAL(8), DIMENSION(10000, 3 ) :: Brev, A1rev, A2rev !$OMP THREADPRIVATE(Brev, A1rev, A2rev) ! **** if this is removed, vectorization is successful REAL(8) :: dt ----------------------------------------------- Module RI: REAL(8), DIMENSION(32) :: omSq

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Shape of Grev on line 23 does not match shape used on line 28.

As sketched, it is unknown as to if Grev on line 16 is a local array to subroutine CS or a dummy argument of CS or a global (module) array.

If Grev on line 16 references the same memory for all threads, then the DO loop on line 27 is not correct.

Do CS or GC contain !$OMP directives, if so, include this with your sketch.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Would the following change help ? It could have the benefit of processing memory sequentially

do k = 1,3 DO iRadius = 1, nRadii Grev(iRadius, k)=Brev(iRadius, k) + dt*( Dradius(iRadius+k-2)*A1rev(iRadius, k) - sRadius(iRadius+k-2)*A2rev(iRadius, k)*omSq(iRun) ) END DO end do

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

@jimdempseyatthecove: I made a mistake in copying. The correct declaration of Grev in GC is

REAL(8), DIMENSION(10000, 3), INTENT(OUT) :: Grev

As far as !$OMP directives, CS does contain a few !$OMP CRITICAL / !$OMP END CRITICAL surrounding debugging outputs that should be contiguous for each thread, but nothing else. GC did contain a number or them, but I removed them all for this test.

@John Campbell: This is a good idea. The original code was DO iRadius / DO k, then it was changed to reverse the subscripts of Grev, then the k loop was unrolled, then the subscripts were reversed back to the original!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

In addition to implementing John Campbell's suggestion, I used jimdempseyatthecove's idea of changing A1rev, A2rev and Brev from threadprivate in module CI to local in CS, and passed as arguments to all subroutines that used them, including GC. Now the code is

DO iRadius = 1, nRadii DO j = 1, 3 k = j - 2 ! -1, 0, +1 Grev(iRadius, j)=Dradius(iRadius+k)*dt*A1rev(iRadius, j) + Brev(iRadius, j) - sRadius(iRadius+k)*omSq(iRun)*dt*A2rev(iRadius, j) END DO END DO

and the optimization report shows vectorization as shown in the second report in my initial post.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Is there any difference if you have the alternative DO order:

DO j = 1, 3 k = j - 2 ! -1, 0, +1 DO iRadius = 1, nRadii

This would process sequentially and could be faster for AVX for large nRadii, although could already be changed by optimisation ?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

@John Campbell: The loop order you suggested (j, then iRadius) is slightly faster (9:06 vs 8:48 m:ss), but there is something strange in the optimization report.

For DO iRadius / DO j, I get

LOOP BEGIN at ...MyProgram.f90(416,1) ...MyProgram.f90(504,5):remark #25084: Preprocess Loopnests: Moving Out Store ...MyProgram.f90(425,13):remark #15388: vectorization support: reference GREV(IRADIUS,J) has aligned access ...MyProgram.f90(425,13):remark #15388: vectorization support: reference DRADIUS(IRADIUS+J-2) has aligned access ...MyProgram.f90(425,52):remark #15388: vectorization support: reference A1REV(IRADIUS,J) has aligned access ...MyProgram.f90(425,72):remark #15388: vectorization support: reference BREV(IRADIUS,J) has aligned access ...MyProgram.f90(425,70):remark #15388: vectorization support: reference SRADIUS(IRADIUS+J-2) has aligned access ...MyProgram.f90(425,124):remark #15388: vectorization support: reference A2REV(IRADIUS,J) has aligned access ...MyProgram.f90(425,13):remark #15388: vectorization support: reference GREV(IRADIUS,J) has aligned access ...MyProgram.f90(425,13):remark #15388: vectorization support: reference DRADIUS(IRADIUS+J-2) has aligned access ...MyProgram.f90(425,52):remark #15388: vectorization support: reference A1REV(IRADIUS,J) has aligned access ...MyProgram.f90(425,72):remark #15388: vectorization support: reference BREV(IRADIUS,J) has aligned access ...MyProgram.f90(425,70):remark #15388: vectorization support: reference SRADIUS(IRADIUS+J-2) has aligned access ...MyProgram.f90(425,124):remark #15388: vectorization support: reference A2REV(IRADIUS,J) has aligned access ...MyProgram.f90(425,13):remark #15388: vectorization support: reference GREV(IRADIUS,J) has aligned access ...MyProgram.f90(425,13):remark #15388: vectorization support: reference DRADIUS(IRADIUS+J-2) has aligned access ...MyProgram.f90(425,52):remark #15388: vectorization support: reference A1REV(IRADIUS,J) has aligned access ...MyProgram.f90(425,72):remark #15388: vectorization support: reference BREV(IRADIUS,J) has aligned access ...MyProgram.f90(425,70):remark #15388: vectorization support: reference SRADIUS(IRADIUS+J-2) has aligned access ...MyProgram.f90(425,124):remark #15388: vectorization support: reference A2REV(IRADIUS,J) has aligned access ...MyProgram.f90(416,1):remark #15305: vectorization support: vector length 4 ...MyProgram.f90(416,1):remark #15309: vectorization support: normalized vectorization overhead 0.400 ...MyProgram.f90(416,1):remark #15300: LOOP WAS VECTORIZED ...MyProgram.f90(416,1):remark #15448: unmasked aligned unit stride loads: 15 ...MyProgram.f90(416,1):remark #15449: unmasked aligned unit stride stores: 3 ...MyProgram.f90(416,1):remark #15475: --- begin vector cost summary --- ...MyProgram.f90(416,1):remark #15476: scalar cost: 65 ...MyProgram.f90(416,1):remark #15477: vector cost: 11.250 ...MyProgram.f90(416,1):remark #15478: estimated potential speedup: 5.770 ...MyProgram.f90(416,1):remark #15488: --- end vector cost summary ---

But for DO j / DO iRadius, I get

LOOP BEGIN at ...MyProgram.f90(419,9) ...MyProgram.f90(421,9):remark #25084: Preprocess Loopnests: Moving Out Store ...MyProgram.f90(419,9):remark #17108: loop was not parallelized: insufficient computational work ...MyProgram.f90(420,13):remark #15389: vectorization support: reference GREV(IRADIUS,J) has unaligned access ...MyProgram.f90(420,13):remark #15388: vectorization support: reference DRADIUS(IRADIUS+J-2) has aligned access ...MyProgram.f90(420,52):remark #15389: vectorization support: reference A1REV(IRADIUS,J) hasunaligned access...MyProgram.f90(420,72):remark #15389: vectorization support: reference BREV(IRADIUS,J) has unaligned access ...MyProgram.f90(420,70):remark #15388: vectorization support: reference SRADIUS(IRADIUS+J-2) has aligned access ...MyProgram.f90(420,124):remark #15389: vectorization support: reference A2REV(IRADIUS,J) has unaligned access ...MyProgram.f90(419,9):remark #15381: vectorization support: unaligned access used inside loop body ...MyProgram.f90(419,9):remark #15305: vectorization support: vector length 4 ...MyProgram.f90(419,9):remark #15399: vectorization support: unroll factor set to 4 ...MyProgram.f90(419,9):remark #15309: vectorization support: normalized vectorization overhead 0.171 ...MyProgram.f90(419,9):remark #15300: LOOP WAS VECTORIZED ...MyProgram.f90(419,9):remark #15448: unmasked aligned unit stride loads: 2 ...MyProgram.f90(419,9):remark #15450: unmasked unaligned unit stride loads: 3 ...MyProgram.f90(419,9):remark #15451: unmasked unaligned unit stride stores: 1 ...MyProgram.f90(419,9):remark #15475: --- begin vector cost summary --- ...MyProgram.f90(419,9):remark #15476: scalar cost: 22 ...MyProgram.f90(419,9):remark #15477: vector cost: 10.250 ...MyProgram.f90(419,9):remark #15478: estimated potential speedup: 2.140 ...MyProgram.f90(419,9):remark #15488: --- end vector cost summary --- ...MyProg

One difference is the aligned vs unaligned arrays. However, the arrays are allocated aligned to 64 bytes and the first dimension is 10000, which is divisible by 8, so the columns are also 64-byte aligned. The only difference between the two builds is the reversal of the DO loops.

The other difference is the "unit stride loads"; not sure what this means.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Try this (untested code):

DO j = 1, 3 k = j - 2 ! -1, 0, +1 ! use ASSOCIATE to re-interpret 2D arrays as a 1D slice ASSOCIATE(qGrev=>Grev(:,j), qA1Rev=>A1rev(:,j), qBrev=>Brev(:,j), qA2Rev=>A2rev(:,j)) ! inform the compiler of alignment (that you asserted was true) !DIR$ ASSUME_ALIGNED qGrev:64, qA1rev:64, qBrev:64, qA2rev:64 DO iRadius = 1, nRadii qGrev(iRadius)=Dradius(iRadius+k)*dt*qA1rev(iRadius) + qBrev(iRadius) - sRadius(iRadius+k)*omSq(iRun)*dt*qA2rev(iRadius) END DO !DIR$ END ASSOCIATE END DO

"unit stride" means an array slice has elements that are adjacent in memory A(1,J), A(2), A(3,J),... as opposed to A(I,1), A(I,2), A(I,3),...

Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page