Inflated report of vector speedup without Qunroll

TimP · ‎04-21-2016

I found that reporting of vector speedup is more realistic when based on compilation with Qunroll4. This is only partly explained by weak non vector performance of intel compilers without that option.

In that connection, the advice sometimes issued to cut back unrolling when time is spent in remainder loop appears wrong. Vectorized remainder loop perform as well as main loop would without unrolling. Where advisor claims more efficiency for vector loop without unroll, it doesn't look right.

Intel comparisons with gnu compilers seem always to be based on not setting good unroll options, taking advantage of the gnu default being worse than Intel 's In the application I'm characterizing now in advisor, unroll4 gives 4% overall gain even though the top 10 hotspots are vectorizable.

Zakhar_M_Intel1 · ‎04-22-2016

Hi Tim,

Which target ISA and which Advisor version did you work with (my guesstimate was : AVX2 and Advisor 2016 Update 3)?

For your first question (about efficiency vs. unroll) - it's importnat to remember that vectorizer/advisor gain/efficiency metrics are upper bounds, particularly "assuming" the code is not totally memory bound (i.e. Vectorization_Gain = Wallclock_Speedup only if code is VPU bound). Higher unroll factors amortize memory/compute balance, therefore making loop performance closer to its computation-related upper bound and therefore making observed speed-ups closer to efficiency (especially if unroll has been made by vectorizer, not by HLO).

This is just general consideration; real answer will naturally depends on multiple factors, first of all on what kind of source code you deal with.

For you second question/suggestion - I completely agree that unconditionally disabling unrolling is not proven to be effective for already vectorized remainders. That's why in fresh versions of Advisor we do claim this Recommendation as "Low Confidence".

Generally it's not always trivial to achieve significant speed-ups when tweaking vectorized remainders on AVX2. However it's not the case for AVX(with older compilers) and AVX-512. In first case you often deal with Scalar Remainders (which are very inefficient), in second case masked vectorization applies by default too frequently (masked remainders are more powerful in terms of delivering >1x speed-up on large class of codes, which doesn't mean however that it's neccesary high efficiency by default).

Anyway, we are currently in progress of improving quality of peel/remainder-related Recommendations for AVX and AVX-512, and we will additionally focus on cases like yours. Thank you for insightful comment; this is really a pleasure to stay in touch with knowledgable users like you!

Finally, I wouldn't comment anything on intel compiler vs. gnu compiler "comparisons" (this is definitely not what my team work on).

In our product we try to deliver reasonable value to GCC users as well; for example you may sometimes notice that Advisor has some GCC-specific recommendations, like glibc vs. svml and so on.

Thanks again for insightful discussion. If you have couple simple test cases/reproducers - this will also be useful for us when refactoring peel-remainder advice.

TimP · ‎04-23-2016

Yes, the vectorized remainders are most useful with avx2 target.

I did just complete a comparison of arch settings for the application I'm working on. The almost negligible gain for avx2 over avx is due to more frequent use of full 256 bit width operations so the compiler decides correctly to include vector remainder.

I found one case where ifort vectorizes (with permutation) but the vector loop isn't entered at run time. I didn't see a clear indication of this in advisor display, but advisor was counting the time as scalar time. By changing source code to remove permutations and distribution I was able to double performance of a barely significant code section and eliminate versioning.

I have a number of loops where advisor 2017 reports 40% efficiency. This is clearly a significant improvement over scalar. I haven't finished checking whether this is always associated with (justified) warning about memory access pattern.

Advisor 2017 appears to be reporting correctly on cases where I need to make omp simd conditional on avx. The compiler doesn't attempt vectorization without a private clause (it suggests lastprivate) but this overrules "seems inefficient."

I didn't succeed in running gfortran compiled version of this application under advisor 2017. Advisor 2015 could do that sometimes and report on avx2 usage and maybe even point to source of hot loops. It's to be expected that it's not useful without Intel compiler.

Zakhar_M_Intel1 · ‎04-26-2016

Tim,

You mentioned that there was "one case where ifort vectorizes (with permutation) but the vector loop isn't entered at run time. I didn't see a clear indication of this in advisor display". Do you say there were no Advisor recommendations / "vector issues" regarding "ineffective peel/remainder"? If this is the case - this seems to be a bug (I beleive we fixed all such cases in recent Updates).

I also was surprised regarding "didn't succeed in running gfortran compiled version of this application under advisor 2017." We try to make the product useful for gfort users, not only for ifort customers. What kind of error message or wrong behavior do you observe when running Advisor 2017 against gfort-compiled application?

Zakhar

TimP · ‎04-27-2016

In spite of it being given as the last of several low confidence hints, a loop count avg directive was sufficient to resolve the cases where only a remainder loop was executed. I got another 2% performance out of this, but too much use of these directives caused run time crash. I also eliminated the cases where vectorization depended on 32 byte aligned arrays and now there is no advantage in them.