At suggestion of vtune added prec-div- prec-sqrt- to standard-semantics qxavx2 and got significant speedup. Didn't find any documentation on this.
Sorry Tim, I'm not sure what you're asking.
If you're asking "Does the /standard-semantics switch imply /Qprec-div" the answer is "No".
If you're asking "Does /Qprec-div- improve speed?" the answer is "Yes, it could", and yes, that is mentioned in the documentation; I copied the section from the 16.0 documentation below.
If you're asking something else, well, you'll have to ask it again. Or I need more coffee this morning.
This option improves precision of floating-point divides. It has a slight impact on speed.
With some optimizations, such as -msse2 (Linux* OS) or /arch:SSE2 (Windows* OS), the compiler may change floating-point division computations into multiplication by the reciprocal of the denominator. For example, A/B is computed as A * (1/B) to improve the speed of the computation.
However, sometimes the value produced by this transformation is not as accurate as full IEEE division. When it is important to have fully precise IEEE division, use this option to disable the floating-point division-to-multiplication optimization. The result is more accurate, with some loss of performance.
If you specify -no-prec-div (Linux* OS and OS X*) or /Qprec-div- (Windows* OS), it enables optimizations that give slightly less precise results than full IEEE division.
I agree that the documentation appears to imply that standard-semantics and prec-div should work independently.
I tried to make a more complete original post but was dumped 3 times. Vtune reported a divide stall rate of 0.29 under standard-semantics and suggested I unset prec-div. With prec-div- prec-sqrt- it reports 0.20 along with 3% overall speedup even though little time is spent in vectorized loops with divide and sqrt (due to vectorization correctly suppressed by "seems inefficient " elsewhere).
The application has several cases of pre-inversion written in source code to avoid repeated divides. Some of those stop ifort from optimizing loop nesting. It has an "illegal " mixture of single and double precision and was always built with 4R8 so it seems that full accuracy is not a concern.
I don't consider it illogical if standard-semantics sets prec-div prec-sqrt, but it is a surprise if not documented. I was surprised also to see it make so much difference on haswell.
Gfortran might optimize with inversion under -O -ffast-math or -freciprocal-math but I don't think any compiler other than Intel makes this a default.