Counting within a SIMD loop in Fortran

AThar2 · ‎03-21-2019

I made this simplified test code

program openmp_test

implicit none


type :: t_data
   integer :: size_
   integer, allocatable :: counter(:)
end type t_data

type t_stat
   integer :: trc(20)
end type t_stat


type(t_data)  :: td
type(t_stat) :: ts
integer                    :: i , ip

allocate(td% counter(100))
td%size_ = 100

ts% trc = 3

td% counter = 10

ip = 40

!$OMP SIMD 
do i = 1, 10  ! the loop I want to vectorize
   ip = ip + 1
   td% counter(ip) = td% counter(ip) + 1
   ts% trc(8)      = ts% trc(8) + 1
enddo


do i = 0, 11
   print*, td% counter(i+40)
enddo

print*, "TRC" , ts% trc(8)


end program openmp_test

Now my question is how can I correctly/safely make this loop so my ip is incremented corretly and that I get my ts% trc counter correct?

I am aware of some of the different clauses, however, some seem not to work with simd? For example, if I do !$OMP SIMD SHARED(ip) it complains that it was not expecting such clause. Similarly for firstprivate. Is that purposely made?

Also, seems that there is an internal error when doing reduction(+:ts%trc(8)). Does reduction clause normally accepts declared type?. If so, I will report this to intel Support team

I am using ifort 19.0.1

Juergen_R_R · ‎03-21-2019

Internal compiler errors are always errors and should be reported to the Support team.

TimP · ‎03-21-2019

The compiler probably attempts to hoist ip and ts%trc from the loop, unless you have cut back optimization. That would be clearly better; besides the loop is too short for simd. As Juergen said, internal error is always a reportable bug.

omp simd has no firstprivate clause. reduction clause takes care of that and lastprivate. Intel compilers have a habit of breaking when omp simd is set without correct clauses. It should issue a diagnostic rather than an internal error.

AThar2 · ‎03-21-2019

Hello Tim,

This is just an attempt to replicate my real application.

I still did not get how to instruct the compiler about the two counting variables. The problem is that the compiler complains that it CANNOT vectorise the loop because of local assignments. If I e.g. put ip to private (although that would be wrong) it does not complain any more.

Also, does reduction clauses not accept declared type variables?

jimdempseyatthecove · ‎03-21-2019

...
!dir$ if(.false.)
!$OMP SIMD 
do i = 1, 10  ! the loop I want to vectorize
   ip = ip + 1
   td% counter(ip) = td% counter(ip) + 1
   ts% trc(8)      = ts% trc(8) + 1
enddo
!dir$ else
   i = 10
   !dir$ vector
   td% counter(ip+1:ip+i) = td% counter(ip+1:ip+i) + 1
   ts% trc(8)      = ts% trc(8) + i
   ip = ip + i
!dir$ endif
...

Jim Dempsey

AThar2 · ‎03-21-2019

Hello Jim,

Thanks I got your point. However, I my loop is actually larger and a bit more complicated. So with the example I made was to try to show a part of it and what I am trying to do.

My question really is if there is a way to apply reduction clauses?

Would that have worked for example:

!$OMP SIMD REDUCTION(+:trc)
do i = 1, 10  ! the loop I want to vectorize
   ip = ip + 1
   trc(8)      = trc(8) + 1
enddo

jimdempseyatthecove · ‎03-22-2019

The loop you have shown in #6 is operating on a scalar (iow a single element in an array). IOW vectorization is nonsensical. Also, if you intend to parallelize this loop (we will assume the iteration count will be much larger than 10, say N) then:

ip = ip + 1

is likely not what you intend it to be, or, if it is, will not do what you intend it to do.

In the case that it is what you intend it to do (e.g. a counter), then include ip in the reduction clause .OR. place !$omp atomic in front of that statement.

On the other hand, if (when) ip is used as an index, you will need to code something like this:

ip_start = ip
!$OMP SIMD PRIVATE(ip) REDUCTION(+:trc)
do i = 1, N  ! the loop I want to vectorize
   ip = ip_start + i
   trc(8)      = trc(8) + 1
   ...
enddo
ip = ip_start + N

If possible, please show either your actual loop or a facsimile that shows the characteristics of your loop as well as exhibits failure to vectorize.

Jim Dempsey

jimdempseyatthecove · ‎03-22-2019

Note, if ip always starts out at 1, then use do ip=1, N and use ip instead of i in the loop.

Jim Dempsey

AThar2 · ‎03-23-2019

Hello Jim,

I ended up manipulating the loop so it always goes from 1 to N and then doing the indexing inside the loop as you showed.

BTW: is there a tump rule when one should not vectorize since the overhead could be larger.

In one of my example I have three nested loops like

do k = 1, nk  ! ~30-60
do j = 1, nj    ! ~90-300
do i = 1, ni    ! ~10-20 -> the loop to be vectorized

This is repeated millions of time, however, as the most nested loop is only 10-20 it is probably not worth doing anything.

Notice that it is not possible to reverse the order of loop (it is not sensical in my application)

Finally, I know that statements like BREAK/EXIT or while loops are not vectorizable as such, but what about CYCLE/CONTINUE in general.

Thanks very much in advance

jimdempseyatthecove · ‎03-23-2019

>>as the most nested loop is only 10-20 it is probably not worth doing anything

Not so. if your inner loop code is vectorizable, then depending on your vector load (16 floats/8 doubles, 8 floats/4 doubles, or 4 floats/2 doubles), the inner loop trip count can be reduced by the inverse of the vector load. And it looks likely that at your high end loop counts (360,000) that you should consider parallelizing the loops either just the outer loop or the outer two loops with the collapse(2) clause.

>>CYCLE

In the above code (3 nested loops) it is OK to cycle or exit a loop that is neither vectorized nor parallelized as this makes your loop count in determinant. This said, with the loop count as short as the inner loop (10-20) note that unused lanes within a vector cost you nothing (assuming no divide by 0 or other faulting condition). Therefor you can (for this short loop) compute event the extra iterations, and then ignore the meaningless data. For longer vectorized loops, consider batching the each iteration by vector width

! padd arrays to fill vector width
do i=1,niPadded,VectorWidth ! 2, 4, 8, 16
  a(i:i+VectorWidth-1) = ...
  ...

Then afterwards use the unpadded size a(1:ni)

Jim Dempsey

jimdempseyatthecove · ‎03-23-2019

Additional information:

IIF the inner most loop count (ni) is a parameter, (known at compile time) then the compiler may vectorize the code without you doing anything to the code. Aligning and padding may eliminate some unnecessary code.

Jim Dempsey

AThar2 · ‎03-26-2019

jimdempseyatthecove wrote:
>>as the most nested loop is only 10-20 it is probably not worth doing anything
Not so. if your inner loop code is vectorizable, then depending on your vector load (16 floats/8 doubles, 8 floats/4 doubles, or 4 floats/2 doubles), the inner loop trip count can be reduced by the inverse of the vector load. And it looks likely that at your high end loop counts (360,000) that you should consider parallelizing the loops either just the outer loop or the outer two loops with the collapse(2) clause.
>>CYCLE
In the above code (3 nested loops) it is OK to cycle or exit a loop that is neither vectorized nor parallelized as this makes your loop count in determinant. This said, with the loop count as short as the inner loop (10-20) note that unused lanes within a vector cost you nothing (assuming no divide by 0 or other faulting condition). Therefor you can (for this short loop) compute event the extra iterations, and then ignore the meaningless data. For longer vectorized loops, consider batching the each iteration by vector width
! padd arrays to fill vector width
do i=1,niPadded,VectorWidth ! 2, 4, 8, 16
  a(i:i+VectorWidth-1) = ...
  ...
Then afterwards use the unpadded size a(1:ni)

Jim Dempsey

I just came across an issue which I think is related to the lack of not having what you just suggested?

I have a loop of i = 1, novar where at one case novar is 100 (does not seem a high number), however, it is a long routine with a few other routine calls. Everything has been made vectorized (at least according to the optimisation report etc), however, when running with the optimised mode (-O3) I get the error that my array is out of bounds. It is clearly not! I have tried to run at full DEBUG mode and everything runs fine.

If I instead say do i = 1, 100 ---> IT WORKS!

So can I assume that it may be because I need to padd the arrays.

Also what do you mean by

Then afterwards use the unpadded size a(1:ni)

It just means that my entire array of `a` can now be used?

Juergen_R_R · ‎03-26-2019

I have a loop of i = 1, novar where at one case novar is 100 (does not seem a high number), however, it is a long routine with a few other routine calls. Everything has been made vectorized (at least according to the optimisation report etc), however, when running with the optimised mode (-O3) I get the error that my array is out of bounds. It is clearly not! I have tried to run at full DEBUG mode and everything runs fine.
If I instead say do i = 1, 100 ---> IT WORKS!

Are you sure that novar is properly set in your parallelized setup? Maybe it runs out of scope?

AThar2 · ‎03-26-2019

I am not running in parallel at all yet. So I am not sure how else That could run out of scope?

jimdempseyatthecove · ‎03-26-2019

Without seeing your code it is hard to tell what is going on.

IIF you are using the vector scheme presented in #10, then you must padd the array(s) to vector width. With array size of 100 (not padded), and vector width of say 8. Your last reference becomes array(97:97+8-1) or (97:104).

IIF you are not using the vector scheme presented in #10, and not padding, then you have a genuine address outside of bounds.

I suggest you debug what is going on. Note, adding statements to optimized code often hides the problem.

Jim Dempsey

AThar2 · ‎03-26-2019

Thanks Jim.

Two things

how do you debug in vectorized mode? - as soon as I do print statement, the compiler refuses to vectorize and then my code will work.

Is what you mentioned how it always work? Because i have many times used omp simd without payinh attention to padding. For example when I reset certain various arrays before each time step I do

!$omp simd

do i=1,size_

are(i) = 0.

(...)

enddo

I never encountered a problem before now.

Also, when you are saying this. Does it mean I need to consider the indices from 97 to 100 in a serialized mode by having a separete piece of code to account for this?

thanks very much

jimdempseyatthecove · ‎03-26-2019

I said: IIF you are using the vector scheme presented in #10

Your code in #16 is not.

>>Also, when you are saying this. Does it mean I need to consider the indices from 97 to 100 in a serialized mode by having a separete piece of code to account for this?

Well you haven't shown the code. Generally (when coding do I=1,size) you do not have to worry about the partial vector at the end.

However, in a specific case where you pass Array(97) to a subroutine/function that declares the dummy as X(VectorWidth), and then manipulate X as if it had a full width of the vector to use, then you have a coding problem.

Jim Dempsey

AThar2 · ‎03-26-2019

Okay I guess I misunderstod what you meant by vector scheme. But no I am not, I am just doing a conventional do loop

it is basically:

!$omp simd

do I=1,no_particles

enddo

Inside I am running over particles, hence, i am operating over arrays of size 3 (x,y,z)for each loop.

I will try to simplify a case for illustration, however, I am just concerned that case might just be problemfree and it is only when the things become a bit more complicated iget this problem

TimP · ‎03-27-2019

In an attempt to explain what Jim probably means, if you have a loop inside a procedure with a fixed size declared, it is likely to break if you call the procedure with a different vector length. The problem might be exposed by vectorization. When you have argument cross checking enabled or use an explicit interface, it should catch this violation of Fortran standard.

jimdempseyatthecove · ‎03-27-2019

>>get the error that my array is out of bounds

When you get that error... what was the value of the invalid index as compared to the last (first) valid index.

Also, in order to get this error, you must have enabled array out of bounds checking. I think that this disables vectorization, though if it does not, there could be a runtime bug where the oob indication is made on the full width of the vector as opposed to the partial width that may be in effect (vector instruction with mask).

Jim Dempsey

AThar2 · ‎03-27-2019

Thanks for the answers Jim & Tim.

I made some significant progress I would say.

I found two bugs, one which is a silent bug (i.e. the results would be wrong, but the code is valid), and the one which triggers the error I mentioned.

BTW: sometime it says segmentation fault and other time I get array out of bounds. This is only happening when having the OMP SIMD loop enabled.

As I explained my routine which contains the omp simd loop is type bound, hence, it got the declaration (class). IF I change this to type it works!!

I have tried to recreate a simplified version, however, it does not seem to recreate the problem. Perhaps you can early on spot a potential issue.

Two issues:

1) The first section of the code where I have three loops and doing simd loop over `rw=1,randw`. As you observe I am calling a procedure to add some terms into my arrays. Where to put the various terms are determined by `ic`. This ic can be completely arbitrary. Basically, it is an index pointing to which grid cell a particle is located. It means that two or more particles can have THE SAME ic. I tried to recreate that by setting up an ic array that have some repeatability - and if you run a vectorized and serial mode you will realize that the print statements are not giving the same results. Clearly, because my omp simd needs somehow a reduction. But how do I enforce this for such applications where my reduction arrays are obtained from another routine and they are properties of a declared type?

2) The dum_loop routine works. However, in my a bit more lengthty application it did not work before I changed ` class(part1), intent(inout) :: this` to `type(part1), intent(inout) :: this`. Clearly, I had to change the procedure declaration from `pass(this)` to `nopass`

BTW: All calculations are completely random and might be very primitive

   module mod_part

      implicit none
      type part1
         real , allocatable :: duma(:), dumb(:), dumc(:), dumd (:)
         real  , allocatable :: mass(:), xmom(:), ymom(:), zmom(:)
      contains
         procedure, pass(this) :: ADD_FUEL_SOURCES
         procedure, pass(this) :: dum_loop
      end type

   contains

      subroutine ADD_FUEL_SOURCES(this, m1, m2, u1,v1,w1, nd, ic)
!$OMP DECLARE SIMD(ADD_FUEL_SOURCES) UNIFORM(this)
      implicit none
      class(part1), intent(inout)    :: this
      integer, intent(in)            :: ic
      real, intent(in)               :: nd, m1,m2,w1,u1,v1
      real                           :: amd, dmasum, dm, mc

      dm = m1  - m2

      amd = nd * dm


      this% mass(ic) = this% mass(ic) + amd
      this% xmom(ic) = this% xmom(ic) + amd*u1
      this% ymom(ic) = this% ymom(ic) + amd*v1
      this% zmom(ic) = this% zmom(ic) + amd*w1
   !   this% enthalpy(ic) = this% enthalpy(ic) + nd*hfuel



      end subroutine ADD_FUEL_SOURCES

      subroutine dum_loop(this,size_)

      implicit none
      class(part1), intent(inout)    :: this
      integer, intent(in)            :: size_
      integer                        :: i

!$OMP SIMD
      do i = 1, size_
         this% mass(i) = i + 1-size_+size_/4.3
      enddo

      end subroutine dum_loop

   end module


   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
      program TEST_OMPSIMD

      use mod_part

      implicit none


      integer :: noz, inj, rw , nsl , fn ,i, randw, nozcnt, injcnt, ind
      integer, allocatable :: ic(:)
      real :: m1,m2, u1,v1,w1, nd

      type(part1) :: part

      noz   = 60
      nsl   = 120
      randw = 10*noz*nsl

      allocate(part% duma(noz*nsl*randw),part% dumb(noz*nsl*randw),&
      &        part% dumc(noz*nsl*randw),part% dumd(noz*nsl*randw),&
      &        part% mass(randw*noz*nsl),part% xmom(randw*noz*nsl),part% ymom(randw*noz*nsl),&
      &        part% zmom(randw*noz*nsl),ic(randw*noz*nsl))
      randw = 10
      ind = 0
      do i = 1, randw
         if( mod(i,3 ) ) then
            ind = 1
         else
            ind = ind + 1
         endif
      enddo
         ic = ind

nozcnt = 0
injcnt = 0
   ind = 0


      do fn = 1, noz
      do inj= 1, nsl
!$OMP SIMD private(ind)
      do rw = 1, randw
         ind         = ind + rw
         part% duma( ind ) = REAL(fn*rw*inj-rw*inj )
         part% dumb( ind ) = REAL(fn*rw*inj-1-rw*inj)
         part% dumc( ind ) = REAL(fn*rw*inj-2-rw*inj)
         part% dumd( ind ) = REAL(fn*rw*inj-3-rw*inj)
         m1 = part% duma( ind )-part% dumb( ind )
         nd = part% dumc(ind)*m1-part% dumd( ind )
         m2 = m1 + nd
         u1 = part% duma(ind)/part% dumb(ind)
         v1 = part% dumb(ind)/100.
         w1 = part% dumb(ind)
      !   ic = rw
         !print*, m1,m2, u1,v1,w1,nd,ic
         call part% ADD_FUEL_SOURCES(m1,m2, u1,v1,w1,nd,ic(ind))
      enddo

      enddo
      enddo

      print*, sum(part%duma)+sum(part%dumb)+sum(part%dumc)+sum(part%dumd)
      print*, sum(part%mass ), sum(part%ymom )




      call part% dum_loop(10000)
      print*, sum(part% mass(1:10000))

      end program TEST_OMPSIMD