topic Double precision Newton-Raphson in IntelĀ® C++ Compiler
https://community.intel.com/t5/Intel-C-Compiler/Double-precision-Newton-Raphson/m-p/938907#M17889
<P>Hello,<BR />I have never seen compilers (GNU or Intel) generating Newton-Raphson (NR) constructs for faster double precision (DP) divides or square roots. I know that there are no DP equivalents of RCPSS, RCPPS, RSQRTSS and RSQRTPS. 3 questions :<BR /> - Why there is no DP equivalents of RCPSS, RCPPS, RSQRTSS and RSQRTPS ?<BR /> - Is it possible, with compiler flags, to generate NR constructs for DP using the existing fast single precision RCP and RSQRT instructions (with a higher number of NR iterations, probably 4 or 5 instead of 2, something like that) ?<BR /> - If not possible, why ? Not efficient ? No demand/interest for faster DP (precision near from DP) divides or square roots ?<SPAN class="no-js"></SPAN></P>
<P><SPAN class="no-js"><!--break--></SPAN><BR /><BR />Thank you in advance</P>Wed, 31 Oct 2012 09:33:41 GMTeoseret2012-10-31T09:33:41ZDouble precision Newton-Raphson
https://community.intel.com/t5/Intel-C-Compiler/Double-precision-Newton-Raphson/m-p/938907#M17889
<P>Hello,<BR />I have never seen compilers (GNU or Intel) generating Newton-Raphson (NR) constructs for faster double precision (DP) divides or square roots. I know that there are no DP equivalents of RCPSS, RCPPS, RSQRTSS and RSQRTPS. 3 questions :<BR /> - Why there is no DP equivalents of RCPSS, RCPPS, RSQRTSS and RSQRTPS ?<BR /> - Is it possible, with compiler flags, to generate NR constructs for DP using the existing fast single precision RCP and RSQRT instructions (with a higher number of NR iterations, probably 4 or 5 instead of 2, something like that) ?<BR /> - If not possible, why ? Not efficient ? No demand/interest for faster DP (precision near from DP) divides or square roots ?<SPAN class="no-js"></SPAN></P>
<P><SPAN class="no-js"><!--break--></SPAN><BR /><BR />Thank you in advance</P>Wed, 31 Oct 2012 09:33:41 GMThttps://community.intel.com/t5/Intel-C-Compiler/Double-precision-Newton-Raphson/m-p/938907#M17889eoseret2012-10-31T09:33:41ZOnce in a while,
https://community.intel.com/t5/Intel-C-Compiler/Double-precision-Newton-Raphson/m-p/938908#M17890
Once in a while, consideration is given to making rcpps et al. sufficiently accurate (as the original AMD version was) to get a double N-R result in 2 iterations (I guess you would count 3). There seems to be consensus that's it's not worth while.
In the Sandy Bridge, you might consider that the lack of an AVX-256 parallel divide leaves an opening. The improvements in Ivy Bridge et al. seem to be a better method to fix this than adoption of N-R.
Maybe you can see your wish partly granted in the Intel(c) Xeon Phi(tm) implementation.Wed, 31 Oct 2012 12:03:18 GMThttps://community.intel.com/t5/Intel-C-Compiler/Double-precision-Newton-Raphson/m-p/938908#M17890TimP2012-10-31T12:03:18ZThanks for your answer.
https://community.intel.com/t5/Intel-C-Compiler/Double-precision-Newton-Raphson/m-p/938909#M17891
Thanks for your answer.
According to <A href="http://www.mersenneforum.org/showthread.php?t=11765" target="_blank">http://www.mersenneforum.org/showthread.php?t=11765</A> strong (up to 3.5x) speedup can be gained by using DP NR. I will try to reproduce them on my own. If I can get speedups greater than 1.2x, I consider it is strongly worth while to make the compiler generate by default (with non precise FP models) NR constructs for divides and square roots, exaclty as for single precision. Vectorization is orthogonal: packed version are available for both IEEE (slow) and non IEEE (fast) instructions even if I know that, in Ivy Bridge, VDIVPS/D (ymm) will be natively 256 bits wide contrary to Sandy Bridge, implying a 2x speedup for this instruction (on Ivy Bridge, comparing to Sandy Bridge).Wed, 31 Oct 2012 13:40:53 GMThttps://community.intel.com/t5/Intel-C-Compiler/Double-precision-Newton-Raphson/m-p/938909#M17891eoseret2012-10-31T13:40:53ZThe much faster division and
https://community.intel.com/t5/Intel-C-Compiler/Double-precision-Newton-Raphson/m-p/938910#M17892
The much faster division and sqrt on Ivy Bridge would greatly alleviate the problem (nearly 2x speedup), but they are still sequenced 128-bit wide operations for now.
According to the URL you posted, about 48 bits accuracy was all that was desired from the "double" division. That would correspond to ICL option /Qimf-accuracy-bits:48 (or maybe 44), in case you have a context where that option is implemented. I couldn't see whether they were considering vectorized code, which is the situation where Intel compilers make use of the lower accuracy options.Wed, 31 Oct 2012 14:37:00 GMThttps://community.intel.com/t5/Intel-C-Compiler/Double-precision-Newton-Raphson/m-p/938910#M17892TimP2012-10-31T14:37:00Z