Counting within a SIMD loop in Fortran - Page 2

AThar2 · ‎03-21-2019

I made this simplified test code

program openmp_test

implicit none


type :: t_data
   integer :: size_
   integer, allocatable :: counter(:)
end type t_data

type t_stat
   integer :: trc(20)
end type t_stat


type(t_data)  :: td
type(t_stat) :: ts
integer                    :: i , ip

allocate(td% counter(100))
td%size_ = 100

ts% trc = 3

td% counter = 10

ip = 40

!$OMP SIMD 
do i = 1, 10  ! the loop I want to vectorize
   ip = ip + 1
   td% counter(ip) = td% counter(ip) + 1
   ts% trc(8)      = ts% trc(8) + 1
enddo


do i = 0, 11
   print*, td% counter(i+40)
enddo

print*, "TRC" , ts% trc(8)


end program openmp_test

Now my question is how can I correctly/safely make this loop so my ip is incremented corretly and that I get my ts% trc counter correct?

I am aware of some of the different clauses, however, some seem not to work with simd? For example, if I do !$OMP SIMD SHARED(ip) it complains that it was not expecting such clause. Similarly for firstprivate. Is that purposely made?

Also, seems that there is an internal error when doing reduction(+:ts%trc(8)). Does reduction clause normally accepts declared type?. If so, I will report this to intel Support team

I am using ifort 19.0.1

jimdempseyatthecove · ‎03-27-2019

      noz   = 60
      nsl   = 120
      randw = 10*noz*nsl

      allocate(part% duma(noz*nsl*randw),part% dumb(noz*nsl*randw),&
      &        part% dumc(noz*nsl*randw),part% dumd(noz*nsl*randw),&
      &        part% mass(randw*noz*nsl),part% xmom(randw*noz*nsl),part% ymom(randw*noz*nsl),&
      &        part% zmom(randw*noz*nsl),ic(randw*noz*nsl))
      randw = 10

On my test system I have 16GB

Each allocation is 60*120*10*60*120(*4 for real and integer) = 2073600000 bytes * 9 allocations == 18GB.

While the allocations succeeded for Virtual Memory, the Windows Page file was thrashing horribly. I assume that the allocation for this test program should have been to randw and not to randw*noz*nsl.

Jim Dempsey

jimdempseyatthecove · ‎03-27-2019

In Debug build, you do have an index out of range:

   ind = 0
   do fn = 1, noz
   do inj= 1, nsl
!$OMP SIMD private(ind)
   do rw = 1, randw
      ind         = ind + rw  (72001, size+1)

Two issues with the above. private may need to be firstprivate such that the initial value of ind is copied in. *** however this may not be your intention.

*** but this is not what the serial code will do. Because:

1) assuming ind on first iteration starts at 0 (Debug build it will, Release ???) ind has the values 1, 3, 6, 10, 15, 21, 28, 36...
2) On subsequent iterations it resumes in single threaded mode, in parallel mode the additional threads may resume with junk.
3) In any event (either event) ind will eventually index out of range of allocation (but may not have with allocation as in #22 above)
4) Due to ind being non-sequential, there is no way that this loop can be SIMD-ized.

The following modifications to your code, vectorized the inner loop, **** however, this may or may not be what you intended to do.

   module mod_part

      implicit none
      type part1
         real , allocatable :: duma(:), dumb(:), dumc(:), dumd (:)
         real  , allocatable :: mass(:), xmom(:), ymom(:), zmom(:)
      contains
         procedure, pass(this) :: ADD_FUEL_SOURCES
         procedure, pass(this) :: dum_loop
      end type

   contains

      subroutine ADD_FUEL_SOURCES(this, m1, m2, u1,v1,w1, nd, ic)
!$OMP DECLARE SIMD(ADD_FUEL_SOURCES) UNIFORM(this)
      implicit none
      class(part1), intent(inout)    :: this
      integer, intent(in)            :: ic
      real, intent(in)               :: nd, m1,m2,w1,u1,v1
      real                           :: amd, dmasum, dm, mc

      dm = m1  - m2

      amd = nd * dm


      this% mass(ic) = this% mass(ic) + amd
      this% xmom(ic) = this% xmom(ic) + amd*u1
      this% ymom(ic) = this% ymom(ic) + amd*v1
      this% zmom(ic) = this% zmom(ic) + amd*w1
   !   this% enthalpy(ic) = this% enthalpy(ic) + nd*hfuel



      end subroutine ADD_FUEL_SOURCES

      subroutine dum_loop(this,size_)

      implicit none
      class(part1), intent(inout)    :: this
      integer, intent(in)            :: size_
      integer                        :: i

!$OMP SIMD
      do i = 1, size_
         this% mass(i) = i + 1-size_+size_/4.3
      enddo

      end subroutine dum_loop

   end module


   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
      program TEST_OMPSIMD

      use mod_part

      implicit none


      integer :: noz, inj, rw , nrw, nsl , fn ,i, randw, nozcnt, injcnt, ind, indoffset
      integer, allocatable :: ic(:)
      real :: m1,m2, u1,v1,w1, nd

      type(part1) :: part

      noz   = 60
      nsl   = 120
      nrw   = 10
      randw = nrw*noz*nsl

      allocate(part% duma(randw),part% dumb(randw),&
      &        part% dumc(randw),part% dumd(randw),&
      &        part% mass(randw),part% xmom(randw),part% ymom(randw),&
      &        part% zmom(randw),ic(randw))
      randw = 10
      ind = 0
      do i = 1, randw
         if( mod(i,3 ) ) then
            ind = 1
         else
            ind = ind + 1
         endif
      enddo
         ic = ind

nozcnt = 0
injcnt = 0

! assuming linear array allocation dum?(nrw*noz*nsl) is equivilent to dum?(nrw, noz, nsl)
      do fn = 1, noz
      do inj= 1, nsl
          indoffset = (inj-1)*nrw + (fn-1)*nrw*nsl
!$OMP SIMD private(ind)
      do rw = 1, nrw
         ind         = indoffset + rw
         part% duma( ind ) = REAL(fn*rw*inj-rw*inj )
         part% dumb( ind ) = REAL(fn*rw*inj-1-rw*inj)
         part% dumc( ind ) = REAL(fn*rw*inj-2-rw*inj)
         part% dumd( ind ) = REAL(fn*rw*inj-3-rw*inj)
         m1 = part% duma( ind )-part% dumb( ind )
         nd = part% dumc(ind)*m1-part% dumd( ind )
         m2 = m1 + nd
         u1 = part% duma(ind)/part% dumb(ind)
         v1 = part% dumb(ind)/100.
         w1 = part% dumb(ind)
      !   ic = rw
         !print*, m1,m2, u1,v1,w1,nd,ic
         call part% ADD_FUEL_SOURCES(m1,m2, u1,v1,w1,nd,ic(ind))
      enddo

      enddo
      enddo

      print*, sum(part%duma)+sum(part%dumb)+sum(part%dumc)+sum(part%dumd)
      print*, sum(part%mass ), sum(part%ymom )




      call part% dum_loop(10000)
      print*, sum(part% mass(1:10000))

      end program TEST_OMPSIMD

Jim Dempsey

jimdempseyatthecove · ‎03-27-2019

Note, private(ind) is valid now because it is not first read. Also, if later you OpenMP parallel the outer loop or outer two loops with collaps(2) the code will still work, assuming no overlapping regions of ind amongst threads.

Jim Dempsey

AThar2 · ‎03-27-2019

Hello Jim

Thanks very much indeed.

the only problem i am still seeing or not entirely sure is what happens when ic has a repeated index for one fn and nsl iteration.

Imagine that my ic is a mapping to another grid. Hence, it size is not dependent on the dumb arrays. Alike for the source terms which are of same size as ic.

So imagine my rw =2:6 all got the index ic=5. In vectorized mode they willeach indepedently access ic=5 for the mass term and update it. However, i would need all contributions added up at ic=5.

So i need some sort of reduction dont I? How can I do this with my arrays mass, xmom etc when they are part of the declared type

jimdempseyatthecove · ‎03-28-2019

Fully vectorizing a loop with an indirect index is problematic. The ic(ind) should thwart usual vectorization as while the dum?(ind) is contiguous, ic(inc) is unknown as to be contiguous or not. This said, compiler optimizations have come a long way. I haven't looked at the generated assembly code on my test system (Core i7 2600K with AVX), which does not have scatter/gather. The vectorization report stated it vectorized two lanes of the possible 8. It could possibly be that it generated code to loop ind with stride of 2 and then check to see if ic(ind) and ic(ind+1) are adjacent, if so, call a 2-wide vector version of ADD_FUEL_SOURCES, and if not call, twice, a scalar version of ADD_FUEL_SOURCES. Don't know.

This said, if (when) you add parallelization, the compiler could not be capable making any determination about possible conflicts amongst threads. This means that you will have to smarten up your code when you go parallel. As to what to do, this greatly depends on your actual code. In particular what is actually done in your do rw loop and what is actually don in your ADD_FUEL_SOURCES. IOW which portion to favor in your optimization.

Jim Dempsey

AThar2 · ‎03-28-2019

Yes, although IC(1:end) is contiguous, the way we would access it in the rw loop is completely arbitrary. Would a Reduction in this sense help?

For example, reduction(+:this% mass). Although, it seems that reduction cannot cope with declared type so I would need to find a workaround or simply use auxiliary arrays for mass etc and do reduction of these. Then after the loop I could copy over the aux arrays to the declared type?

Is that usually how it is done?

PS: My RW loop is literally what I have shown although many more components of my declare types. But all have trivial operations and are done in the loop and independent on each rw iteration. It is only at the end when I have the ADD_FUEL_SOURCES I am facing the indirect indexing.

Cheers!

jimdempseyatthecove · ‎03-28-2019

>>For example, reduction(+:this% mass).

While that would work, it would be extremely inefficient, not only do you have mass but you also have xmom, ymom, zmom, ???mom - each allocated to the full extent of the number of particles. It may (may stressed) be better to determine for each xxx(ic) a list of contributors this is to say a list of ind's that contribute to the fuel sources. *** but the strategy greatly depends on the computation required between the caller and the callee. For example (not saying this is what you do), in the caller you might use

...
do rw = 1, nrw
  ind         = indoffset + rw
  part% duma( ind ) = REAL(fn*rw*inj-rw*inj )
  part% dumb( ind ) = REAL(fn*rw*inj-1-rw*inj)
  part% dumc( ind ) = REAL(fn*rw*inj-2-rw*inj)
  part% dumd( ind ) = REAL(fn*rw*inj-3-rw*inj)
  m1(ind) = part% duma( ind )-part% dumb( ind )
  nd(ind) = part% dumc(ind)*m1-part% dumd( ind )
  m2(ind) = m1(ind) + nd(ind)
  u1(ind) = part% duma(ind)/part% dumb(ind)
  v1(ind) = part% dumb(ind)/100.
  w1(ind) = part% dumb(ind)
enddo
! the above loop is fully vectorizable
! but it has not performed:
! call part% ADD_FUEL_SOURCES(m1,m2, u1,v1,w1,nd,ic(ind))

!dir$ if defined(UseSimpleMethod)
  do rw = 1, nrw
    ind         = indoffset + rw
    call part% ADD_FUEL_SOURCES(m1,m2, u1,v1,w1,nd,ic(ind))
  end do
!dir$ else
  !dir$ if defined(Use2WideMethod)
    do rw = 1, nrw, 2
      ind         = indoffset + rw
      if(rw == nrw) then
        call part% ADD_FUEL_SOURCES(m1,m2, u1,v1,w1,nd,ic(ind))
      else
        if(ic(ind) + 1 == ic(ind+1)) then
          ! adjacent cells
          call part% ADD_FUEL_SOURCES_X2(m1,m2, u1,v1,w1,nd,ic(ind)) ! and ic(ind)+1 same as ic(inc+1)
        else
          call part% ADD_FUEL_SOURCES(m1,m2, u1,v1,w1,nd,ic(ind))
          call part% ADD_FUEL_SOURCES(m1,m2, u1,v1,w1,nd,ic(ind+1))
        endif
      endif 
    end do
  !dir$ else
    ... You expand to 4-wide and 8-wide
  !dir$ endif  
!dir$ endif
  ...

Is it more work, yes, is it faster, TBD. This depends on the computation recoverable by vectorizing add fuel sources.

Also not shown is if (when)

ic(ind) == ic(ind+1) ! and (ind+2), ...

IOW adjacent ic(x)'s have the same indes

as to if the arguments to add_fuel_sources can be summed, and then use a single call to the scalar variant of the subroutine.

Without seeing the actual code, it is difficult to offer the correct assessment.

Do you need professional help on this?

Jim Dempsey

AThar2 · ‎03-28-2019

Thanks very much for your time Jim!

I am not sure I understand the second part of the code. Are the 'use2WideMethod' meant to make the code vectorized when you have ic(ind) == ic(ind+1), or just reducing the cost.?

Also not sure what you mean here :

Also not shown is if (when)
ic(ind) == ic(ind+1) ! and (ind+2), ...
IOW adjacent ic(x)'s have the same indes
as to if the arguments to add_fuel_sources can be summed, and then use a single call to the scalar variant of the subroutine.

Is that related to the add_fuel_sources_X2 call ?

Regarding professional help, I am not sure that I am entitled (depends on what it involves ) to it as this work is related to my PhD research.

jimdempseyatthecove · ‎03-28-2019

When ic(ind) == ic(ind+1), this means that two adjacent ind's in array ic contain the same index value (duplicates) held in ic(ind) and ic(ind+1).

IIF the components (m1(ind), nd(ind), ...) can be added to (m1(ind+1), nd(ind+1), ...) respectively, and then submitted to a single call to ADD_FUEL_SOURCES and get the same result as making two individual calls to ADD_FUEL_SOURCES, one with (m1(ind), nd(ind), ...) and a second call with (m1(ind+), nd(ind+), ...) then you will save time. Assuming the frequency of identical values in ic(ind) and ic(ind+1) is sufficient to offset the overhead of determining as to if to combine or not.

The add_fuel_sources_X2 call is different. It will require (and be called only when) ic(ind) +1 == ic(ind+1)
IOW when two adjacent cells in ic are two sequential numbers (e.g. ic(ind)==4, ic(ind+1)==5)

Jim Dempsey