Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28995 Discussions

Optimized code use of the inverse square root instruction

David_DiLaura1
New Contributor I
863 Views

I have seen the compiler produce optimized code (/O3 with Pentium 4 extensions) that contains the use of the Intel inverse squareroot instruction and a multiply, substituting these two instructions fora square root and a divide. I'm working on engineering code where there are numerous occurances of /sqrt() and in not one of them does the compiler choose to use the inverse squareroot instruction instead. VTuneshows me the code that the compiler generates.

Hints? Suggestions?

David

0 Kudos
3 Replies
Steven_L_Intel1
Employee
863 Views
First, are the compile options exactly the same across the two programs? There is an option /Qprec-sqrt that governs use of this instruction sequence. The default for optimized compile is /Qprec-sqrt- which allows use of these sequences. However if you also use certain options such as /fltconsistency or /Op then this optimization is disabled.

If they are the same, then I suggest you create a test case and submit it to Intel Premier Support so that we can have optimization experts examine it.
0 Kudos
David_DiLaura1
New Contributor I
863 Views

Steve,

The compile options are the same; /Op is not used. I even get the same phenomenon when I explicitly use /Oprec-sqrt- (or even /prec-). Something else must be preventing the optimizer from making these changes. I'll submit something to Premier Support.

David

0 Kudos
TimP
Honored Contributor III
863 Views
The main reason for use of inverse sqrt is to enable more parallel use of the floating point unit. /fp:precise sets /Qprec-sqrt, unless followed by /Qprec-sqrt-. In my experience, ifort uses it mainly in vectorized code (single precision only).
The Penryn CPUs make such an improvement in latency of IEEE sqrt that there is much less reason to accept the more verbose code and corner case problems of the inverse sqrt and iterative improvement scheme. A single inverse sqrt and multiply, such as you describe, gives only about 12 bits precision, so no compiler uses it without the iterative improvement to get about 23 bits precision.
0 Kudos
Reply