LOOP COUNT

high_end_c_ · ‎09-26-2018

I have a rather long FORTRAN code where the kernel comprises over 8 levels of nested DO loops but yet I only get information in the optrpt files for innermost (about 4 levels) loops... the outer loops are not referenced at all and if I use the 'annotated' (HTML) version there are no embedded comments for the outer loops. I've tried -qopt-report5 but was joyless.

The code is sensitive so cannot just share but if really needed I can try to reproduce another kernel exhibiting same lack of info from the compiler

Hints welcome, M

high_end_c_ · ‎10-04-2018

hi all, somebody suggested that the compiler may focus on the 'leaves' (innermost loops) than the 'root' (outermost enclosing DO loop). whilst I can see that applies for optimising variables for registers and for optimising data for vectorisation, my presumption is any OpenMP compiler would rather implement coarse grained than fine grained and thus need to consider outermost loops as candidates.

i am likely barking up the wrong tree, but maybe there is a "max time" or "max amount of work" that compiler does before "giving up" - so if it starts innermost (to check vectorisation options) it may run out of steam before being possible to examine outer loops to determine if any dependencies preventing parallelism - but how would I know this? and is there a flag/s to set to tell the compiler to keep on going...

cheers, michael
https://highendcompute.co.uk

jimdempseyatthecove · ‎10-05-2018

>> my presumption is any OpenMP compiler would rather implement coarse grained than fine grained and thus need to consider outermost loops as candidates.

Are you confusing OpenMP programmer directive programming with auto-parallelism?

You, as the programmer, are responsible for selecting the appropriate level at which to inject parallelism into your application, or electing not to parallelize when counterproductive. Use VTune or other profiling means to assess the practicality of parallizing as well as where to apply your directives.

Jim Dempsey

TimP · ‎10-05-2018

I would not expect auto-parallel to analyze consistently more levels of loops than are required for the SPEC benchmarks. Many of those are already set up for OpenMP with requirement that OpenMP is disabled for benchmarking, so the loop nesting is already reasonable in most cases.

As Jim suggested, an application of any complexity is better handled by explicit OpenMP directives. There are directives to guide auto-parallel, but that isn't so popular where OpenMP is no more difficult. ifort is intended to work with both auto-parallel and OpenMP in the same application. My expectation is that an OpenMP directive over-rides auto-parallel within its scope. Of course, separate compilation of procedures could overcome that limit.

It used to be that OpenMP directives would disable multi-level loop optimizations, although the newer compilers should perform those on inner loops, leaving the outer loops under control of OpenMP. This situation argues against specifying an excessive number of loops in a collapse clause. In my experience, ifort is more versatile than other compilers in applying the simd clause to outer loops, which requires some degree of multi-level optimization.

high_end_c_ · ‎10-06-2018

jimdempseyatthecove wrote:

>> my presumption is any OpenMP compiler would rather implement coarse grained than fine grained and thus need to consider outermost loops as candidates.

Are you confusing OpenMP programmer directive programming with auto-parallelism?

You, as the programmer, are responsible for selecting the appropriate level at which to inject parallelism into your application, or electing not to parallelize when counterproductive. Use VTune or other profiling means to assess the practicality of parallizing as well as where to apply your directives.

Jim Dempsey

Thanks Jim. I do appreciate the difference but isn't the compiler's auto-parallelisation method to think a similar way to an OpenMP programmer. And my -parallel and -opt-report one can see which loops have been considered *generally*. But my question is why I do not get anything in the optrpt when I have ~8 levels of nested loops and ONLY the inner most ones have such optrpt comments. Sorry if I was unclear in original Q. Yrs, M

high_end_c_ · ‎10-06-2018

Tim P. wrote:

I would not expect auto-parallel to analyze consistently more levels of loops than are required for the SPEC benchmarks. Many of those are already set up for OpenMP with requirement that OpenMP is disabled for benchmarking, so the loop nesting is already reasonable in most cases.

As Jim suggested, an application of any complexity is better handled by explicit OpenMP directives. There are directives to guide auto-parallel, but that isn't so popular where OpenMP is no more difficult. ifort is intended to work with both auto-parallel and OpenMP in the same application. My expectation is that an OpenMP directive over-rides auto-parallel within its scope. Of course, separate compilation of procedures could overcome that limit.

It used to be that OpenMP directives would disable multi-level loop optimizations, although the newer compilers should perform those on inner loops, leaving the outer loops under control of OpenMP. This situation argues against specifying an excessive number of loops in a collapse clause. In my experience, ifort is more versatile than other compilers in applying the simd clause to outer loops, which requires some degree of multi-level optimization.

Thanks Tim. Further to my reply also to Jim, I do prefer to do OMP myself. I was just expecting some dependency analysis for loops from the par report all the way out to the outermost loop: this I've seen and used many times in order to guide me as to why I may (or not) be able to parallelsie some loops (or what as a OMP programmer I'd have to address) and also to sometimes help me ensure I have my OpenMP data clauses correct. Hence this part of ifort has been an invaluable tool.

With this current code, with a much larger number of nested levels of DO, I suddenly (it seems) have no such report back for the outer levels. It's why I am not getting this info that I ask. Are you saying that Intel just look how many levels in SPEC, work on that number of levels to ensure really good benchmark stats, and for deeper loops (more levels) works from innermost (as one expects) until hits that number and then does NOT even try to look at the remaining outer loops? That also sounds there is a parameter (eg set to #levels-in-SPEC) that one could amend - if that's possible, that's what I'd like to try

For now, I'll get back to those in the code's discipline to discuss the natural parallelism they believe inherent in their problem to attack this problem from a higher angle in order to expose parallelism (as well as a long print-out to manually determine data dependencies)

Best wishes, Michael

jimdempseyatthecove · ‎10-06-2018

A common problem for the compiler to optimally place auto parallelization is when it cannot determine loop iteration counts as well as size of thread pool at compile time. You do have available to you:

LOOP COUNT

General Compiler Directive: Specifies the iterations (typical trip count) for a DO loop.

!DIR$ LOOP COUNT (n1[,n2]...)

!DIR$ LOOP COUNT= n1[,n2]...

!DIR$ LOOP COUNT MAX(n1), MIN(n1), AVG(n1)

!DIR$ LOOP COUNT MAX=n1, MIN=n1, AVG=n1

n1, n2

Is a non-negative integer constant.

The value of the loop count affects heuristics used in software pipelining, vectorization, and loop-transformations.

Argument Form	Description
`n1` [, `n2`]	Indicates that the next DO loop will iterate `n1`, `n2`, or some other number of times.
MAX, MIN, and AVG	Indicates that the next DO loop has the specified maximum, minimum, and average number (`n1`) of iterations.

Example

Consider the following:

!DIR$ LOOP COUNT (10000)
do i =1,m 
b(i) = a(i) +1 ! This is likely to enable the loop to get software-pipelined 
enddo

Note that you can specify more than one LOOP COUNT directive for a DO loop. For example, the following directives are valid:

!DIR$ LOOP COUNT (10, 20, 30) 
!DIR$ LOOP COUNT MAX=100, MIN=3, AVG=17 
DO 
...

Jim Dempsey

TimP · ‎10-07-2018

A caution with !dir$ loop count:

If you omit the max, min, or avg clause, it is likely to reduce performance for any count other than those specified. So, you should use one of those clauses when the loop count varies. For example, if the loop count varies from 1 to 99, !dir$ loop count avg=50 should work. Without the directive, such a short loop (if vectorized) might be unrolled excessively.

I'm not aware of how this works with auto-parallel, but Jim's suggestion is good. For example, if a loop is asserted to execute at most 5 times, the auto-parallel should avoid making that the only parallel loop. If you use OpenMP, you would need to specify the outer parallel loop and collapse parameter yourself, but the loop count directive might still help.

high_end_c_ · ‎10-09-2018

Useful reminder of "!dir$ loop count", many thanks.

But my main question is why I am getting nothing back in the optrpt for these outer loops, whereas I do for inner loops. If the compiler would tell me there's dependencies or too little work for said loops (like it does for inner ones) then I'd know it was a least checking, but it seems strange there is nothing in optrpt relating to the line numbers of the outer few loops...

hope that's a useful new angle? m

any limit to #nested loops that ifort analyses for parallelism?

LOOP COUNT

Example