optimization help/vectorization/SIMD questions

Izaak_Beekman · ‎01-27-2015

Hi,

Consider the following code snippet:

    do i = 1, size(rhs,dim=2)
       if (n == biggest_int) exit !Overflow!
       n1 = n
       n = n + 1
       n1on = real(n1,WP)/real(n,WP)
       ! Add SIMD dir?
       !!!!DIR$ SIMD PRIVATE(p,k)
       do concurrent (j=1:size(lhs)) ! I'm nervous about p and k getting stepped on
          delta(j) = rhs(j,i) - local_res(j)%M(1)
          local_res(j)%M(1) = local_res(j)%M(1) + delta(j)/real(n,WP)
          !DIR$ LOOP COUNT (1,2,3,4,5)
          do p = local_res(j)%p,2,-1 !iterate backwards because new M4 depends on old M3,M2 etc.
             sum(j) = 0
             !DIR$ LOOP COUNT (0,1,2,3,4)
             do k = 1,p-2
                sum(j) = sum(j) + &
                     local_res(j)%binkp(k,p)*((-delta(j)/n)**k)*n1on*local_res(j)%M(p-k)
             end do
             local_res(j)%M(p) = n1on*local_res(j)%M(p) + sum(j) + &
                  ((delta(j)/n)**p)*(n1**p + n1*((-1)**p))/n
          end do
          local_res(j)%n   = n
          local_res(j)%min = min(lhs(j)%min,rhs(j,i))
          local_res(j)%max = max(lhs(j)%max,rhs(j,i))
       end do
    end do

Note that the outermost do loop has data dependencies… It is performing a data reduction operation over one dimension of rhs.

Iterations of the next do loop (expressed as do concurrent) may be performed in any order. My reading of ‘Modern Fortran Explained’ p. 360:

any variable referenced is either previously defined in the same iteration, or its value is

not affected by any other iteration;

To me this means that the loop indices contained within this `do concurrent` loop, p, and k, are *NOT* in danger of getting stepped on, since they are “previously defined in the same iteration” of the current `do concurrent` loop. Is this correct?

The loops *inside* the `do concurrent` are order dependent. The loop over p moves backwards, because we are updating values of `local_res(j)%M(p)` using the old (relative to the iteration over p) values of `local_res(j)%M(2)` … `local_res(j)%M(p-1)` for the update.

The loop over k has a data dependency because it is a sum reduction. The sum may be performed in any order, so long as two iterations don’t try to write the sum at once.

Typically the upper bound of the `do concurrent` loop will be quite large, so my idea was to try to do some optimizations on this loop, such as SIMD or vectorization. Am I on the right track here?

If the directive on line 7 is *not* commented out (by removing 3 of 4 !s) then the compiler gives an error:

error #8592: Within a SIMD region, a DO-loop control-variable must not be specified in a PRIVATE/REDUCTION/FIRSTPRIVATE/LASTPRIVATE SIMD clause.

do p = local_res(j)%p,2,-1 !iterate backwards because new M4 depends on old M3,M2 etc. -------------^

Does this mean I don’t need the private clause and that p and k are safe from getting stepped on, or does this mean that the directive is telling the compiler that loops over p and k have no data dependencies?

Any and all advice is greatly welcome.

Thanks!

jimdempseyatthecove · ‎01-27-2015

Zaak,

do concurrent implies you intend for multi-threaded. Consider:

do i = 1, size(rhs,dim=2)
   if (n == biggest_int) exit !Overflow!
   n1 = n
   n = n + 1
   n1on = real(n1,WP)/real(n,WP)
   !$OMP PARALLEL SIMD PRIVATE(p,k)
   do j=1, size(lhs) ! I'm nervous about p and k getting stepped on
      delta(j) = rhs(j,i) - local_res(j)%M(1)
      local_res(j)%M(1) = local_res(j)%M(1) + delta(j)/real(n,WP)
      !DIR$ LOOP COUNT (1,2,3,4,5)
      do p = local_res(j)%p,2,-1 !iterate backwards because new M4 depends on old M3,M2 etc.
         sum(j) = 0
         !DIR$ LOOP COUNT (0,1,2,3,4)
         do k = 1,p-2
            sum(j) = sum(j) + &
                 local_res(j)%binkp(k,p)*((-delta(j)/n)**k)*n1on*local_res(j)%M(p-k)
         end do
         local_res(j)%M(p) = n1on*local_res(j)%M(p) + sum(j) + &
              ((delta(j)/n)**p)*(n1**p + n1*((-1)**p))/n
      end do
      local_res(j)%n   = n
      local_res(j)%min = min(lhs(j)%min,rhs(j,i))
      local_res(j)%max = max(lhs(j)%max,rhs(j,i))
   end do
end do

(unverified)

Jim Dempsey

Izaak_Beekman · ‎01-27-2015

Thanks for the response Jim. A couple of notes:

The do concurrent, is not strictly necessary, I’d be happy to use a normal do loop.
There is no guarantee in the standard that compilers will implement do concurrent using threads, they are free to choose.
I’d prefer to work on single threaded optimizations because the software may be easily parallelized at a higher/coarser level using MPI/coarrays/OpenMP
I want to learn more about SIMD, vectorization and pipelining

Steven_L_Intel1 · ‎01-27-2015

Keep in mind that DO CONCURRENT tells the compiler that there are no iteration dependencies.

Izaak_Beekman · ‎01-27-2015

Steve Lionel (Intel) wrote:

Keep in mind that DO CONCURRENT tells the compiler that there are no iteration dependencies.

Right, the iterations over j may be preformed in any order. However, for each j iteration, the loops nested within j are order dependent.

jimdempseyatthecove · ‎01-27-2015

Steve,

When you have !$OMP PARALLEL SIMD...

And you have OpenMP stubs enabled, does the enclosed loop have the SIMD attributes/characteristics.

I would imagine when generate sequential code is enabled that it would not.

Jim Dempsey

Steven_L_Intel1 · ‎01-27-2015

SIMD is for vectorization - I am not aware that there is any cross-effect here.

TimP · ‎01-27-2015

Inner loops could be Simd vector if they weren't so short. Loop count max spec may be sufficient to avoid counter productive Simd.

Izaak_Beekman · ‎01-27-2015

While none of these comments have directly answered my questions, I think I’ve realized that the inner loops are preventing me from achieving my goal… While it means writing special code for the most common cases (p=2,3,4) I can eliminate (manually unroll) these loops, and then the j loop (which will have high iteration counts) should be able to parallelized or vectorized.

One parting question: Does `do concurrent` always result in the equivalent of !OMP PARALLEL DO (if the compiler determines it’s worth while) or does it ever generate SIMD code with the intel compiler?

Steven_L_Intel1 · ‎01-27-2015

Why not both? DO CONCURRENT helps with vectorization by eliminating the possibility of loop-carried dependencies. If you have -parallel specified it will put a !DIR$ PARALLEL (not !OMP$ PARALLEL) in front of the loop.

TimP · ‎01-27-2015

do concurrent will not generate (OpenMP) threaded code unless auto-parallelized by -parallel (and possibly -par-threshold:nn). If it did so, -par-report would flag it.

Your directives instruct the compiler to generate separate optimized code for each designated count if feasible. I have doubts.

Izaak_Beekman · ‎01-27-2015

Tim Prince wrote:

do concurrent will not generate (OpenMP) threaded code unless auto-parallelized by -parallel (and possibly -par-threshold:nn). If it did so, -par-report would flag it.

Your directives instruct the compiler to generate separate optimized code for each designated count if feasible. I have doubts.

Doubts?