IFort not vectorizing loop in specific cases

Grund__Aalexander · ‎04-03-2014

I have a big set of code with OMP4.0 directives (target, simd...)
In one module the compiler throws lot's of warnings about "loops not vectorized with simd" although it should.

I cut the code down to the bare minimum that still produces this behaviour:

   SUBROUTINE simdTest

      IMPLICIT NONE

      INTEGER ::  i, j, k, sr, tn,nzb,nzt,nxl,nxr,nys,nyn
      REAL    ::  s1, s2, s3, s4
      REAL, DIMENSION(:,:,:), ALLOCATABLE :: u,v,pt,rmask,sums_l
      REAL, DIMENSION(:,:), ALLOCATABLE :: usws,vsws,shf

      !$omp parallel do schedule(runtime) private(s1,s2,s3)
      DO  k = nzb, nzt+1
        !$omp simd collapse( 2 ) reduction( +: s1, s2, s3 )
        DO  i = nxl, nxr
           DO  j =  nys, nyn
              s1 = s1 + u(k,j,i)  * rmask(j,i,sr)
              s2 = s2 + v(k,j,i)  * rmask(j,i,sr)
              s3 = s3 + pt(k,j,i) * rmask(j,i,sr)
           ENDDO
        ENDDO
        sums_l(k,1,tn) = s1
        sums_l(k,2,tn) = s2
        sums_l(k,4,tn) = s3
      ENDDO

      !$omp parallel do reduction( +: s1, s2, s3, s4) schedule(runtime)
      DO  i = nxl, nxr
       DO  j =  nys, nyn
          s1 = s1 + usws(j,i) * rmask(j,i,sr)
          s2 = s2 + vsws(j,i) * rmask(j,i,sr)
          s3 = s3 + shf(j,i)  * rmask(j,i,sr)
          s4 = s4 + 0.0
       ENDDO
      ENDDO
      sums_l(nzb,12,tn) = s1
      sums_l(nzb,14,tn) = s2
      sums_l(nzb,16,tn) = s3

   END SUBROUTINE

If you compile this with "ifort -openmp -O2" it will warn about the first loop. If you remove literally anything (even from second loop) it will vectorize.
Message from vec-report is "subscript to complex".

Could you explain that? IMO not vectorizing the first loop would lead to significant performance loss.

TimP · ‎04-03-2014

I'm afraid this is too much cut down to be a useful example. I'm concerned about the uninitialized reduction variables, unallocated arrays, and do-nothing reduction on s4.

Without the omp simd reduction, the compiler would be influenced by the architecture target in deciding whether to vectorize a sum reduction with a strided operand (and it usually makes a good decision). If you could put the k index last on u,v, pt it would be a lot more favorable to this organization of loops.

Grund__Aalexander · ‎04-03-2014

The uninitialized variables do not influence the optimization as far as I could see. In the real code, they are globals in a separate module and set before calling the function.

As far as I remember, using the reduction(+: list) clause will initialize the variables in the list with 0.

However here is another version of the code:

   SUBROUTINE simdTest

      IMPLICIT NONE

      INTEGER ::  i, j, k, sr, tn,nzb,nzt,nxl,nxr,nys,nyn
      REAL    ::  s1, s2, s3, s4
      REAL, DIMENSION(:,:,:), ALLOCATABLE :: u,v,pt,rmask,sums_l,rflags_invers
      REAL, DIMENSION(:,:), ALLOCATABLE :: usws,vsws,shf
      
      nzb = 0; nxl = 0; nys = 0; sr = 5; tn = 0
      nzt = 100; nxr = 100; nyn = 100
      allocate(u(nzt+1,nyn,nxr))
      allocate(v(nzt+1,nyn,nxr))
      allocate(pt(nzt+1,nyn,nxr))
      allocate(rmask(nzt+1,nyn,nxr))
      allocate(rflags_invers(nzt+1,nyn,nxr))
      allocate(sums_l(nzt+1,20,1))
      allocate(usws(nyn,nxr))
      allocate(vsws(nyn,nxr))
      allocate(shf(nyn,nxr))

      !$omp parallel do schedule(runtime) private(s1,s2,s3)
      DO  k = nzb, nzt+1
        s1=0;s2=0;s3=0
        !$omp simd collapse( 2 ) reduction( +: s1, s2, s3 )
        DO  i = nxl, nxr
           DO  j =  nys, nyn
              s1 = s1 + u(k,j,i)  * rmask(j,i,sr) * rflags_invers(j,i,k+1)
              s2 = s2 + v(k,j,i)  * rmask(j,i,sr) * rflags_invers(j,i,k+1)
              s3 = s3 + pt(k,j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1)
           ENDDO
        ENDDO
        sums_l(k,1,tn) = s1
        sums_l(k,2,tn) = s2
        sums_l(k,4,tn) = s3
      ENDDO

      s1=0;s2=0;s3=0;s4=0
      !$omp parallel do reduction( +: s1, s2, s3, s4) schedule(runtime)
      DO  i = nxl, nxr
       DO  j =  nys, nyn
          s1 = s1 + usws(j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1)
          s2 = s2 + vsws(j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1)
          s3 = s3 + shf(j,i)  * rmask(j,i,sr) * rflags_invers(j,i,k+1)
          s4 = s4 + shf(j,i)
       ENDDO
      ENDDO
      sums_l(nzb,12,tn) = s1
      sums_l(nzb,14,tn) = s2
      sums_l(nzb,16,tn) = s3

   END SUBROUTINE

Please note that the loop order is reasonable with the additional array (it is there in real code) and also changing the s4 reduction does not influence the behavior either.

jimdempseyatthecove · ‎04-03-2014

At issue here is your index order for the arrays are not favorable to vectorization. Either the loops should be reordered or the dimensions of u, v, pt be reorganized to index as (j,i,k).

Jim Dempsey

Grund__Aalexander · ‎04-03-2014

Well I cannot change the order of the array dimensions. reordering the loops for u,v,pt will make them worse for rflags_invers

But strangely: If you remove anything (e.g. s4 = s4 + shf(j,i) from 2nd loop) then the first loop gets vectorized. Can you explain that?

jimdempseyatthecove · ‎04-03-2014

For this loop (with stride indexing) I would look at the disassembly window from the debugger or VTune (not the assembler output as it will not be fully optimized). Merely seeing vectorized or not in the report is insufficient to tell what was done.

>>worse for rflags_invers

rflags_inverse is read once per iteration and used three times. This cost should be less than three strided fetches.

**** alternate suggestion

SUBROUTINE simdTest

   IMPLICIT NONE

   INTEGER ::  i, j, k, sr, tn,nzb,nzt,nxl,nxr,nys,nyn
   REAL    ::  s1, s2, s3, s4
   REAL, DIMENSION(:,:,:), ALLOCATABLE :: u,v,pt,rmask,sums_l,rflags_invers
   REAL, DIMENSION(:,:), ALLOCATABLE :: usws,vsws,shf, k_u, k_v, k_pt ! ****
   
   nzb = 0; nxl = 0; nys = 0; sr = 5; tn = 0
   nzt = 100; nxr = 100; nyn = 100
   allocate(u(nzt+1,nyn,nxr))
   allocate(v(nzt+1,nyn,nxr))
   allocate(pt(nzt+1,nyn,nxr))
   ! vvvvvvv
   allocate(k_u(nyn,nxr))  ! ****************
   allocate(k_v(nyn,nxr))  ! ****************
   allocate(k_pt(nyn,nxr)) ! ****************
   ! ^^^^^^^
   allocate(rmask(nzt+1,nyn,nxr))
   allocate(rflags_invers(nzt+1,nyn,nxr))
   allocate(sums_l(nzt+1,20,1))
   allocate(usws(nyn,nxr))
   allocate(vsws(nyn,nxr))
   allocate(shf(nyn,nxr))

   !$omp parallel do schedule(runtime) private(s1,s2,s3)
   DO  k = nzb, nzt+1
     s1=0;s2=0;s3=0
     k_u = u(k,:,:)   ! ****************
     k_v = v(k,:,:)   ! ****************
     k_pt = pt(k,:,:) ! ****************
     !$omp simd collapse( 2 ) reduction( +: s1, s2, s3 )
     DO  i = nxl, nxr
        DO  j =  nys, nyn
           s1 = s1 + k_u(j,i)  * rmask(j,i,sr) * rflags_invers(j,i,k+1) ! ****************
           s2 = s2 + k_v(j,i)  * rmask(j,i,sr) * rflags_invers(j,i,k+1) ! ****************
           s3 = s3 + k_pt(j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1) ! ****************
        ENDDO
     ENDDO
     sums_l(k,1,tn) = s1
     sums_l(k,2,tn) = s2
     sums_l(k,4,tn) = s3
   ENDDO

   s1=0;s2=0;s3=0;s4=0
   !$omp parallel do reduction( +: s1, s2, s3, s4) schedule(runtime)
   DO  i = nxl, nxr
    DO  j =  nys, nyn
       s1 = s1 + usws(j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1)
       s2 = s2 + vsws(j,i) * rmask(j,i,sr) * rflags_invers(j,i,k+1)
       s3 = s3 + shf(j,i)  * rmask(j,i,sr) * rflags_invers(j,i,k+1)
       s4 = s4 + shf(j,i)
    ENDDO
   ENDDO
   sums_l(nzb,12,tn) = s1
   sums_l(nzb,14,tn) = s2
   sums_l(nzb,16,tn) = s3

END SUBROUTINE

Jim Dempsey