Why does this not vectorize?

AThar2 · ‎04-15-2019

Hello

I will show a code which shows kind of same structure as my original code. This one, though, does not really do anything physically sensible and certain things might seem strange to you, like e.g. I am converting to double and then returning it single (in the original case there is a reason why - but for now I am trying to keep what I think could spoil my attempt to vectorize).

Unfortunately, I was still not able to reproduce exactly what is happening to my original case, so please allow me to explain the problem.

When I add

!DEC$ ATTRIBUTES VECTOR :: HIT

in the definition of HIT, certain array elements (usually the last 4-5) their results are VERY wrong, orders of thousands. When the attribute vector is remove I get the correct results. Do you know what the reasons could be? Note, I still got the forceinline in the subroutine definition and where I call it, like shown in the example as well.

In the code I am about to show does not have same problem, however, it just seem to hang when I enable the simd loop.

Again, my main problem is the former, but as mentioned the code shown could probably illuminate potential bugs in my original case.

Thanks very much

      program TEST_SIMD

      implicit none

      real,dimension(:), allocatable :: x,y,z, u, v, w, dt
      integer                        :: nopart, i, ns
      real                           :: fluct, dtloc
      integer                        :: general_seed(2), randseed
      data general_seed /123456789, 987654321/
      nopart = 10

      allocate(x(nopart),y(nopart),z(nopart), u(nopart), v(nopart),w(nopart), dt(nopart))

      call RANDOM_SEED( put = general_seed )
      do i = 1, nopart
         call random_number(fluct)
         x(i) = fluct
         call random_number(fluct)
         y(i) = fluct
         call random_number(fluct)
         z(i) = fluct
         call random_number(fluct)
         u(i) = fluct+10.
         call random_number(fluct)
         v(i) = fluct+ 10.
         call random_number(fluct)
         w(i) = fluct + 10.

      enddo

!$OMP SIMD private(dtloc)
      do i = 1, nopart

!DIR$ FORCEINLINE
         call HIT(x(i), y(i), z(i), u(i), v(i), w(i), ns, dtloc, 1, 10 , 1, 10 ,120)
         dt(i) = dtloc

      enddo

      print*, dt

      contains

!DEC$ ATTRIBUTES FORCEINLINE :: HIT
      subroutine HIT(xp, yp, zp, up, vp, wp, ns, dtloc, nf1, nf2 , nn1, nn2, ic)
!DEC$ ATTRIBUTES VECTOR :: HIT

      integer                  , intent(in)    ::  nf1, nf2, nn1, nn2, ic
      real                     , intent(in)    :: xp, yp, zp  !particle loc and vel
      real                     , intent(in)    :: up, vp, wp
      integer                  , intent(out)   :: ns
      real                     , intent(out)   :: dtloc

      real                                     :: time(10000)
      integer                                  :: nf, k, ipc, nn
      integer                                  :: ii, jj, i1, i2, i3
      real(kind=8)                            :: updble(3), xpdble(3)
      real(kind=8)                            :: vecsign, un, xn,normun
      real(kind=8)                            :: vecn(3), xd(3)
      real                                     :: uaux(3)
      real                                     :: tmin
      real                                     :: t1
      real, parameter :: dtstep = 1e-3
      integer                                  :: imin, icor

      ipc = 121
      updble(1) = dble(up)
      updble(2) = dble(vp)
      updble(3) = dble(wp)
      uaux(1)   = up
      uaux(2)   = vp
      uaux(3)   = wp


      xpdble(1) = dble(xp)
      xpdble(2) = dble(yp)
      xpdble(3) = dble(zp)

      ii = 0
      jj = 0

      do nf = nf1, nf2


         if( ipc == ic )then
            vecsign = 1.
         else
            vecsign = -1.
         endif

         xd(1)  =  xp
         xd(2)  =  yp
         xd(3)  =  zp

         do nn = nn1, nn2
            jj       = jj  + 1


            time(jj) = 1.0e6

            un       = vecsign* L2NORM(REAL(updble)) ! we normally do dot_product with a normal vector defined elsewhere, hence in that case the double prec would have been more useful than here.
            xn       = vecsign * L2NORM(REAL(xd))
!DIR$ FORCEINLINE
            normun   = un / L2NORM(uaux)


               ii             = ii + 1
               time(ii) = REAL(abs( xn/un ))
         enddo
      enddo

      ns = ii

      dtloc = 2.*minval( time(1:ns) ) *dtstep


      end subroutine

!DEC$ ATTRIBUTES FORCEINLINE :: L2NORM
      pure real function L2NORM( vec )  ! THIS IS NORMALLY DEFINED IN A DIFFERENT MODULE
!DEC$ ATTRIBUTES VECTOR :: L2NORM

      implicit none

      real, intent(in)     :: vec(3)
      real                 :: tmp

      tmp    = dot_product( vec, vec )
      L2NORM = sqrt( tmp )

      end function L2NORM

      end program TEST_SIMD

jimdempseyatthecove · ‎04-15-2019

On my system HIT is not vectorized due to:

"simd loop was not vectorized: scalar assignment in simd loop is prohibited, consider private, lastprivate or reduction clauses (line: 75, column: 7)"

Note, in MS VS the above message shows in the Compiler Optimization Report window, but not in the annotated source code which states CALL HIT was inlined.

I suggest you consider removing ns from the CALL HIT, and replace on the same call dtloc withy dt(i)

However, when I do that I get Infinity's as output. I am not going to spend any more time on this, as debugging your code should be your responsibility.

A suggestion I have is to use the remove ns, and replace dtloc with dt(i), then...
Simplify (for the compiler in vectorizing) the contents of HIT by making the local arrays in HIT: updbl(1) becomes up_d, updbl(2) becomes vp_d, updbl(3) becomes wp_d, etc And reformulate your L2NORM to take in three args as X, Y, Z, and replace dot_product with sqrt(X*X + Y*Y + Z*Z)

Then see if you get: SIMD, INLINE'd HIT .AND. correct results.

Jim Dempsey

AThar2 · ‎04-15-2019

Hello Jim,

Apologies for the bug, my intention was not for you to get into that. My example was not as clean as I wanted.

thanks for your suggestions. When I applied them, it seemed that the results no longer changed when I had the !DEC$ ATTRIBUTES VECTOR :: HIT.

So that is good, although, it is a bit strange why the error was occurring. Sometimes I do feel the responses from the compiler can be quite random when simd is enabled, or the simd warning to be confusing/misleading.

Regarding the simd message you mentioned. I did also see this in my *.optrt file regarding HIT. Although, this message can sometimes be quite misleading, since when I revise my code over and over , I don't see any variables being assigned without having the private attribute.

Also, sometimes when I have e.g.

this% xd(ip) = bla bla ! where ip is the loop index

Sometimes the compiler likes it and accepts to vectorize it. Other times it complains that there is an assignment, so the loop is not simd vectorized. My thumb rule is just to avoid any structures within a simd loop, although it can be inconvenient sometimes.

jimdempseyatthecove · ‎04-15-2019

>>Sometimes I do feel the responses from the compiler can be quite random when simd is enabled

It is the programmer's responsibility to provide code suitable to be SIMD-ized when using SIMD. Same issue if(when) you program using parallel regions (same with recursive, pure, elemental, ...).

I presume the issue with "simd loop was not vectorized: scalar assignment in simd loop is prohibited" was that you had scalar outs, in particular ns.

Consider the implications of:

call HIT(x(i), y(i), z(i), u(i), v(i), w(i), ns, dtloc, 1, 10 , 1, 10 ,120)
dt(ns) = dtloc

With the above (ns), dt(ns) is not in lock-step with the other arrays x(i),...

While we can look at the code in #1 and see that (ns) is not used, the compiler may have had too many things on its mind to elide ns until later in the optimization phase.

Additional caveat.

In this specific case, all calls to HIT have the same values for nn1 and nn2. This may not necessarily be the case in your actual application. Should they differ on different calls, then each lane of the vector may require to iterate a different number of times in your do nn=nn1,nn2 loop. This won't vector.

Similar issue with ii and jj should these conditionally differ (else why not use do nn=1,nf2-nf1+1 and then use nn in place of ii and jj).

Jim Dempsey

AThar2 · ‎04-19-2019

Hello Jim,

Thank you. Yes, unfortunately on the real application it can indeed differ with regards to values of nn1 and nn2. These refer to some sort of mapping between volume cells and their associated triangulated surfaces. Since the cells can be geometrically different, you can have different number of triangulated surfaces. This is a bit typical when you got mapping - find it difficult to still make it vectorizable.

But are you saying if my nn1 and nn2 are varying from , the execution may terminate the vectorization process ?

What do you mean by that dt(ns) is not in lock-step?

jimdempseyatthecove · ‎04-20-2019

Mathematically a vector can be considered x(1:ns), from the CPU perspective, say one with AVX2, and using REAL(4), a SIMD vector a slice of x(pos:pos+7). or think of this as vec(0:7). With lanes 0-7

Graraphics in a text window is difficult, lets turn this vec vertical. All SIMD loops must execute the same code, and same number of iterations of interior loops should there be internal loops. When "tracing" the potential internal loops, the accesses to/from array elements of the vector must look like:

vec(0) xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
vec(1) xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
vec(2) xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
vec(3) xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
vec(4) xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
vec(5) xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
vec(6) xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
vec(7) xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

IOW same number of reads*, same number of writes*.. same code path, same number of internal loop iterations for each lane in the vector..

*The SIMD instruction set of the CPU can have a read or write of say this 8 lane vector with a mask. The mask can inhibit a lane from being read or written.

Your array x(1:ns) is decomposed into a series of these vec(0:7), though the last vec has 8 lanes, it may have fewer of these lanes producing data that is used. On systems without SIMD masking capability, the remainder part is performed in a scalar manner.

nn1 and nn2 can vary for each entry into the SIMD loop, but they cannot differ for each lane of the SIMD vector.

>> lock-step

x(indexX) decomposed into vecX(indexX:indexX+7)
dt(indexDT) decomposed into vecDT(indexDT:indexDT+7)

where indexX does not necessarily have to equal indexDT

IOW the starting representative indices need not be the same, but the same number of iterations and increments must occur.

Jim Dempsey