Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Double precision Newton-Raphson

eoseret
Beginner
510 Views

Hello,
I have never seen compilers (GNU or Intel) generating Newton-Raphson (NR) constructs for faster double precision (DP) divides or square roots. I know that there are no DP equivalents of RCPSS, RCPPS, RSQRTSS and RSQRTPS. 3 questions :
 - Why there is no DP equivalents of RCPSS, RCPPS, RSQRTSS and RSQRTPS ?
 - Is it possible, with compiler flags, to generate NR constructs for DP using the existing fast single precision RCP and RSQRT instructions (with a higher number of NR iterations, probably 4 or 5 instead of 2, something like that) ?
 - If not possible, why ? Not efficient ? No demand/interest for faster DP (precision near from DP) divides or square roots ?

<!--break-->

Thank you in advance

0 Kudos
3 Replies
TimP
Honored Contributor III
510 Views
Once in a while, consideration is given to making rcpps et al. sufficiently accurate (as the original AMD version was) to get a double N-R result in 2 iterations (I guess you would count 3). There seems to be consensus that's it's not worth while. In the Sandy Bridge, you might consider that the lack of an AVX-256 parallel divide leaves an opening. The improvements in Ivy Bridge et al. seem to be a better method to fix this than adoption of N-R. Maybe you can see your wish partly granted in the Intel(c) Xeon Phi(tm) implementation.
0 Kudos
eoseret
Beginner
510 Views
Thanks for your answer. According to http://www.mersenneforum.org/showthread.php?t=11765 strong (up to 3.5x) speedup can be gained by using DP NR. I will try to reproduce them on my own. If I can get speedups greater than 1.2x, I consider it is strongly worth while to make the compiler generate by default (with non precise FP models) NR constructs for divides and square roots, exaclty as for single precision. Vectorization is orthogonal: packed version are available for both IEEE (slow) and non IEEE (fast) instructions even if I know that, in Ivy Bridge, VDIVPS/D (ymm) will be natively 256 bits wide contrary to Sandy Bridge, implying a 2x speedup for this instruction (on Ivy Bridge, comparing to Sandy Bridge).
0 Kudos
TimP
Honored Contributor III
510 Views
The much faster division and sqrt on Ivy Bridge would greatly alleviate the problem (nearly 2x speedup), but they are still sequenced 128-bit wide operations for now. According to the URL you posted, about 48 bits accuracy was all that was desired from the "double" division. That would correspond to ICL option /Qimf-accuracy-bits:48 (or maybe 44), in case you have a context where that option is implemented. I couldn't see whether they were considering vectorized code, which is the situation where Intel compilers make use of the lower accuracy options.
0 Kudos
Reply