Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
29285 Discussions

optimization help/vectorization/SIMD questions

Izaak_Beekman
New Contributor II
1,828 Views

Hi,

Consider the following code snippet: 

    do i = 1, size(rhs,dim=2)
       if (n == biggest_int) exit !Overflow!
       n1 = n
       n = n + 1
       n1on = real(n1,WP)/real(n,WP)
       ! Add SIMD dir?
       !!!!DIR$ SIMD PRIVATE(p,k)
       do concurrent (j=1:size(lhs)) ! I'm nervous about p and k getting stepped on
          delta(j) = rhs(j,i) - local_res(j)%M(1)
          local_res(j)%M(1) = local_res(j)%M(1) + delta(j)/real(n,WP)
          !DIR$ LOOP COUNT (1,2,3,4,5)
          do p = local_res(j)%p,2,-1 !iterate backwards because new M4 depends on old M3,M2 etc.
             sum(j) = 0
             !DIR$ LOOP COUNT (0,1,2,3,4)
             do k = 1,p-2
                sum(j) = sum(j) + &
                     local_res(j)%binkp(k,p)*((-delta(j)/n)**k)*n1on*local_res(j)%M(p-k)
             end do
             local_res(j)%M(p) = n1on*local_res(j)%M(p) + sum(j) + &
                  ((delta(j)/n)**p)*(n1**p + n1*((-1)**p))/n
          end do
          local_res(j)%n   = n
          local_res(j)%min = min(lhs(j)%min,rhs(j,i))
          local_res(j)%max = max(lhs(j)%max,rhs(j,i))
       end do
    end do

Note that the outermost do loop has data dependencies… It is performing a data reduction operation over one dimension of rhs.

Iterations of the next do loop (expressed as do concurrent) may be performed in any order. My reading of ‘Modern Fortran Explained’ p. 360:

 any variable referenced is either previously defined in the same iteration, or its value is

not affected by any other iteration;

To me this means that the loop indices contained within this `do concurrent` loop, p, and k, are *NOT* in danger of getting stepped on, since they are “previously defined in the same iteration” of the current `do concurrent` loop. Is this correct?

The loops *inside* the `do concurrent` are order dependent. The loop over p moves backwards, because we are updating values of `local_res(j)%M(p)` using the old (relative to the iteration over p) values of `local_res(j)%M(2)` … `local_res(j)%M(p-1)` for the update.

The loop over k has a data dependency because it is a sum reduction. The sum may be performed in any order, so long as two iterations don’t try to write the sum at once.

Typically the upper bound of the `do concurrent` loop will be quite large, so my idea was to try to do some optimizations on this loop, such as SIMD or vectorization. Am I on the right track here?

If the directive on line 7 is *not* commented out (by removing 3 of 4 !s) then the compiler gives an error: 

error #8592: Within a SIMD region, a DO-loop control-variable must not be specified in a PRIVATE/REDUCTION/FIRSTPRIVATE/LASTPRIVATE SIMD clause.   

do p = local_res(j)%p,2,-1 !iterate backwards because new M4 depends on old M3,M2 etc. -------------^

Does this mean I don’t need the private clause and that p and k are safe from getting stepped on, or does this mean that the directive is telling the compiler that loops over p and k have no data dependencies?

Any and all advice is greatly welcome.

Thanks!

0 Kudos
11 Replies
jimdempseyatthecove
Honored Contributor III
1,828 Views

Zaak,

do concurrent implies you intend for multi-threaded. Consider:

do i = 1, size(rhs,dim=2)
   if (n == biggest_int) exit !Overflow!
   n1 = n
   n = n + 1
   n1on = real(n1,WP)/real(n,WP)
   !$OMP PARALLEL SIMD PRIVATE(p,k)
   do j=1, size(lhs) ! I'm nervous about p and k getting stepped on
      delta(j) = rhs(j,i) - local_res(j)%M(1)
      local_res(j)%M(1) = local_res(j)%M(1) + delta(j)/real(n,WP)
      !DIR$ LOOP COUNT (1,2,3,4,5)
      do p = local_res(j)%p,2,-1 !iterate backwards because new M4 depends on old M3,M2 etc.
         sum(j) = 0
         !DIR$ LOOP COUNT (0,1,2,3,4)
         do k = 1,p-2
            sum(j) = sum(j) + &
                 local_res(j)%binkp(k,p)*((-delta(j)/n)**k)*n1on*local_res(j)%M(p-k)
         end do
         local_res(j)%M(p) = n1on*local_res(j)%M(p) + sum(j) + &
              ((delta(j)/n)**p)*(n1**p + n1*((-1)**p))/n
      end do
      local_res(j)%n   = n
      local_res(j)%min = min(lhs(j)%min,rhs(j,i))
      local_res(j)%max = max(lhs(j)%max,rhs(j,i))
   end do
end do

(unverified)

Jim Dempsey

0 Kudos
Izaak_Beekman
New Contributor II
1,828 Views

Thanks for the response Jim. A couple of notes:

  1. The do concurrent, is not strictly necessary, I’d be happy to use a normal do loop.
  2. There is no guarantee in the standard that compilers will implement do concurrent using threads, they are free to choose.
  3. I’d prefer to work on single threaded optimizations because the software may be easily parallelized at a higher/coarser level using MPI/coarrays/OpenMP
  4. I want to learn more about SIMD, vectorization and pipelining
0 Kudos
Steven_L_Intel1
Employee
1,828 Views

Keep in mind that DO CONCURRENT tells the compiler that there are no iteration dependencies.

0 Kudos
Izaak_Beekman
New Contributor II
1,828 Views

Steve Lionel (Intel) wrote:

Keep in mind that DO CONCURRENT tells the compiler that there are no iteration dependencies.

Right, the iterations over j may be preformed in any order. However, for each j iteration, the loops nested within j are order dependent.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,828 Views

Steve,

When you have !$OMP PARALLEL SIMD...

And you have OpenMP stubs enabled, does the enclosed loop have the SIMD attributes/characteristics.

I would imagine when generate sequential code is enabled that it would not.

Jim Dempsey

0 Kudos
Steven_L_Intel1
Employee
1,828 Views

SIMD is for vectorization - I am not aware that there is any cross-effect here.

0 Kudos
TimP
Honored Contributor III
1,828 Views
Inner loops could be Simd vector if they weren't so short. Loop count max spec may be sufficient to avoid counter productive Simd.
0 Kudos
Izaak_Beekman
New Contributor II
1,828 Views

While none of these comments have directly answered my questions, I think I’ve realized that the inner loops are preventing me from achieving my goal… While it means writing special code for the most common cases (p=2,3,4) I can eliminate (manually unroll) these loops, and then the j loop (which will have high iteration counts) should be able to parallelized or vectorized.

One parting question: Does `do concurrent` always result in the equivalent of !OMP PARALLEL DO (if the compiler determines it’s worth while) or does it ever generate SIMD code with the intel compiler?

0 Kudos
Steven_L_Intel1
Employee
1,828 Views

Why not both? DO CONCURRENT helps with vectorization by eliminating the possibility of loop-carried dependencies. If you have -parallel specified it will put a !DIR$ PARALLEL (not !OMP$ PARALLEL) in front of the loop.

0 Kudos
TimP
Honored Contributor III
1,827 Views

do concurrent will not generate (OpenMP) threaded code unless auto-parallelized by -parallel (and possibly -par-threshold:nn).  If it did so, -par-report would flag it.

Your directives instruct the compiler to generate separate optimized code for each designated count if feasible.  I have doubts.

0 Kudos
Izaak_Beekman
New Contributor II
1,827 Views

Tim Prince wrote:

do concurrent will not generate (OpenMP) threaded code unless auto-parallelized by -parallel (and possibly -par-threshold:nn).  If it did so, -par-report would flag it.

Your directives instruct the compiler to generate separate optimized code for each designated count if feasible.  I have doubts.

Doubts?

0 Kudos
Reply