- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Consider the following code snippet:
do i = 1, size(rhs,dim=2) if (n == biggest_int) exit !Overflow! n1 = n n = n + 1 n1on = real(n1,WP)/real(n,WP) ! Add SIMD dir? !!!!DIR$ SIMD PRIVATE(p,k) do concurrent (j=1:size(lhs)) ! I'm nervous about p and k getting stepped on delta(j) = rhs(j,i) - local_res(j)%M(1) local_res(j)%M(1) = local_res(j)%M(1) + delta(j)/real(n,WP) !DIR$ LOOP COUNT (1,2,3,4,5) do p = local_res(j)%p,2,-1 !iterate backwards because new M4 depends on old M3,M2 etc. sum(j) = 0 !DIR$ LOOP COUNT (0,1,2,3,4) do k = 1,p-2 sum(j) = sum(j) + & local_res(j)%binkp(k,p)*((-delta(j)/n)**k)*n1on*local_res(j)%M(p-k) end do local_res(j)%M(p) = n1on*local_res(j)%M(p) + sum(j) + & ((delta(j)/n)**p)*(n1**p + n1*((-1)**p))/n end do local_res(j)%n = n local_res(j)%min = min(lhs(j)%min,rhs(j,i)) local_res(j)%max = max(lhs(j)%max,rhs(j,i)) end do end do
Note that the outermost do loop has data dependencies… It is performing a data reduction operation over one dimension of rhs.
Iterations of the next do loop (expressed as do concurrent) may be performed in any order. My reading of ‘Modern Fortran Explained’ p. 360:
any variable referenced is either previously defined in the same iteration, or its value is
not affected by any other iteration;
To me this means that the loop indices contained within this `do concurrent` loop, p, and k, are *NOT* in danger of getting stepped on, since they are “previously defined in the same iteration” of the current `do concurrent` loop. Is this correct?
The loops *inside* the `do concurrent` are order dependent. The loop over p moves backwards, because we are updating values of `local_res(j)%M(p)` using the old (relative to the iteration over p) values of `local_res(j)%M(2)` … `local_res(j)%M(p-1)` for the update.
The loop over k has a data dependency because it is a sum reduction. The sum may be performed in any order, so long as two iterations don’t try to write the sum at once.
Typically the upper bound of the `do concurrent` loop will be quite large, so my idea was to try to do some optimizations on this loop, such as SIMD or vectorization. Am I on the right track here?
If the directive on line 7 is *not* commented out (by removing 3 of 4 !s) then the compiler gives an error:
error #8592: Within a SIMD region, a DO-loop control-variable must not be specified in a PRIVATE/REDUCTION/FIRSTPRIVATE/LASTPRIVATE SIMD clause.do p = local_res(j)%p,2,-1 !iterate backwards because new M4 depends on old M3,M2 etc. -------------^
Does this mean I don’t need the private clause and that p and k are safe from getting stepped on, or does this mean that the directive is telling the compiler that loops over p and k have no data dependencies?
Any and all advice is greatly welcome.
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Zaak,
do concurrent implies you intend for multi-threaded. Consider:
do i = 1, size(rhs,dim=2) if (n == biggest_int) exit !Overflow! n1 = n n = n + 1 n1on = real(n1,WP)/real(n,WP) !$OMP PARALLEL SIMD PRIVATE(p,k) do j=1, size(lhs) ! I'm nervous about p and k getting stepped on delta(j) = rhs(j,i) - local_res(j)%M(1) local_res(j)%M(1) = local_res(j)%M(1) + delta(j)/real(n,WP) !DIR$ LOOP COUNT (1,2,3,4,5) do p = local_res(j)%p,2,-1 !iterate backwards because new M4 depends on old M3,M2 etc. sum(j) = 0 !DIR$ LOOP COUNT (0,1,2,3,4) do k = 1,p-2 sum(j) = sum(j) + & local_res(j)%binkp(k,p)*((-delta(j)/n)**k)*n1on*local_res(j)%M(p-k) end do local_res(j)%M(p) = n1on*local_res(j)%M(p) + sum(j) + & ((delta(j)/n)**p)*(n1**p + n1*((-1)**p))/n end do local_res(j)%n = n local_res(j)%min = min(lhs(j)%min,rhs(j,i)) local_res(j)%max = max(lhs(j)%max,rhs(j,i)) end do end do
(unverified)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the response Jim. A couple of notes:
- The do concurrent, is not strictly necessary, I’d be happy to use a normal do loop.
- There is no guarantee in the standard that compilers will implement do concurrent using threads, they are free to choose.
- I’d prefer to work on single threaded optimizations because the software may be easily parallelized at a higher/coarser level using MPI/coarrays/OpenMP
- I want to learn more about SIMD, vectorization and pipelining
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Keep in mind that DO CONCURRENT tells the compiler that there are no iteration dependencies.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve Lionel (Intel) wrote:
Keep in mind that DO CONCURRENT tells the compiler that there are no iteration dependencies.
Right, the iterations over j may be preformed in any order. However, for each j iteration, the loops nested within j are order dependent.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve,
When you have !$OMP PARALLEL SIMD...
And you have OpenMP stubs enabled, does the enclosed loop have the SIMD attributes/characteristics.
I would imagine when generate sequential code is enabled that it would not.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
SIMD is for vectorization - I am not aware that there is any cross-effect here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
While none of these comments have directly answered my questions, I think I’ve realized that the inner loops are preventing me from achieving my goal… While it means writing special code for the most common cases (p=2,3,4) I can eliminate (manually unroll) these loops, and then the j loop (which will have high iteration counts) should be able to parallelized or vectorized.
One parting question: Does `do concurrent` always result in the equivalent of !OMP PARALLEL DO (if the compiler determines it’s worth while) or does it ever generate SIMD code with the intel compiler?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why not both? DO CONCURRENT helps with vectorization by eliminating the possibility of loop-carried dependencies. If you have -parallel specified it will put a !DIR$ PARALLEL (not !OMP$ PARALLEL) in front of the loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
do concurrent will not generate (OpenMP) threaded code unless auto-parallelized by -parallel (and possibly -par-threshold:nn). If it did so, -par-report would flag it.
Your directives instruct the compiler to generate separate optimized code for each designated count if feasible. I have doubts.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim Prince wrote:
do concurrent will not generate (OpenMP) threaded code unless auto-parallelized by -parallel (and possibly -par-threshold:nn). If it did so, -par-report would flag it.
Your directives instruct the compiler to generate separate optimized code for each designated count if feasible. I have doubts.
Doubts?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page