Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Fast approximate of transcendental operations

velvia
Beginner
864 Views

Hi,

You might know the "Carmack" trick to have a fast and crude approximation of 1/sqrt(x). It is

float Q_rsqrt( float number )
{
	long i;
	float x2, y;
	const float threehalfs = 1.5F;
 
	x2 = number * 0.5F;
	y  = number;
	i  = * ( long * ) &y;                  
	i  = 0x5f3759df - ( i >> 1 ); 
	y  = * ( float * ) &i;
	y  = y * ( threehalfs - ( x2 * y * y ) );
 
	return y;
}

It takes an integer and see it as a raw sequence of bit. It then shifts to the right all the bits that in the process divide the exposant by 2. Then there are some more elaborate explanation on why the 0x5f3759df which one could find on wikipedia. It is about 10 times faster than a regular pow(x, -0.5) but has a low accuracy (only the first 2 digits are correct).

I've heard that x86-64 has a rsqrt operation that does this kind of trick. Is there a documented way on how to call such a function? Is there any other transcendental function which is also available as fast (and low accuracy) version?

I need it in one of my ordinary differential equation solver where x^(1/3), x^(1/5), ... needs to be computed to choose the best order at run time depending upon the timestep. When the ODE does not have any transcendental function, it becomes the bottleneck and I need to speed it up. I might adapt the "Carmack" trick but I was wondering if there was already one implemented in x86-64 ?

0 Kudos
15 Replies
TimP
Honored Contributor III
864 Views

The imf-precision options of Intel compilers permit acceptance of low accuracy approximations.  You'd have to test whether they apply to the case you have in mind.  I suspect it applies only when fast-transcendentals is set (that's a default). and may be restricted to vectorizable situations.  You didn't show how you intend to apply sqrt approximation to your case.  There is a cube root function in svml (for use when x^(1/3.f) is vectorized).

0 Kudos
Bernard
Valued Contributor I
864 Views

>>>I've heard that x86-64 has a rsqrt operation that does this kind of trick. Is there a documented way on how to call such a function?>>>

I presume that you refer to rsqrtps SSE machine code instruction or to its intrinsics representation _mm_rsqrt_ps(__m128 x). In this case either you can use inline assembly to call rsqrtps instruction (use ICC for 64-bit compilation of inline assembly) or use intrinsic _mm_rsqrt_ps to calculate reciprocal of square root. For non vectorized code you can use rsqrtss instruction or in intrinsics form _mm_rsqrt_ss(__m128 x).

Here is an interesting discussion on stackoverflow.com which is partially related to your question:

http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x

0 Kudos
velvia
Beginner
864 Views

Hi,

 

I don't need vector instructions. The imf-precision did not speed up my code and I needed to use low accuracy functions only for some part of the code, so it was not an option. I ended up adapting the Carmack's trick. Here is an example of a function that takes the third root of a double. It is quite fast and the first 2 digits are correct.

    double fast_third_root(double x) {
        std::uint64_t i;
        double ans;

        i = *(std::uint64_t*) &x;
        i = 0x2a9f8a7be393b600 + (i / 3);
        ans = *(double*) &i;

        return ans;
    }
0 Kudos
KitturGanesh
Employee
864 Views

Here some more info that might help:

>  is about 10 times faster than a regular pow(x, -0.5)

The question of interest is how much faster is it than 1/sqrt(x) or the invsqrt(x) on any platforms? If you have something like  for (i=0; i<N;i++) { ... = ...1/sqrt(x))...}, the Intel compiler will most likely generate an inline sequence for 1/sqrt(x) and is true if the loop is vectorized as well. 

Also,

 > has a low accuracy (only the first 2 digits are correct  

If the above level of accuracy is sufficient, you can also look at using –fimf-accuracy-bits switch to get a lower precision approximation.

_Kittur

 

0 Kudos
Bernard
Valued Contributor I
864 Views

You can pass to _mm_rsqrt_ps()  __m128 vector loaded with one scalar only. Latency is only 5 cycles and throughput is 1 cycle and your code has one division and if not optimized out by the compiler it could be slower.

 

0 Kudos
velvia
Beginner
864 Views

Hi,

In my case, what I really need is to compute x^(1/q) for q = 3, 5, 7, 9, ... So I can't use the rsqrt functions.

I can't use the -fimf-accuracy as I can't use the low accuracy functions for the whole program (even the whole program unit). The places where I only need low accuracy are very specific. On most places, I need full accuracy.

0 Kudos
jimdempseyatthecove
Honored Contributor III
864 Views

How many q's will you need to go?

Is x ever less than 1.0D0? (or is x always less than 1.0D0?), ...

Jim Dempsey

0 Kudos
velvia
Beginner
864 Views

At a given time, I only need one value of q. Usually it is

Some computation -> x^(1/3) -> Some computation -> x^(1/5) -> Stop.

For x, it is >> 1.

 

0 Kudos
Jeff_Arnold
Beginner
864 Views

I can't use the -fimf-accuracy as I can't use the low accuracy functions for the whole program (even the whole program unit). The places where I only need low accuracy are very specific. On most places, I need full accuracy.

The -fimf-accuracy option can be applied to any individual compilation unit so it need not apply to the whole program.  In addition, you can specify exactly the functions to which the option is to be applied.  E.g.,

-fimf-accuracy-bits=23:sinf,cosf,logf

See the compiler documentation for more details.

However, there is no guarantee that a routine which has exactly the accuracy requested exists.  (Obviously, the math libraries supplied with icc don't contain routines for every possible accuracy.)  What is guaranteed is that a routine with at least that accuracy will be used.  Thus, the expectation of a performance benefit by requiring less accuracy may not be met.

0 Kudos
KitturGanesh
Employee
864 Views

Or may be create a file with all those functions requiring low accuracy and compile that file with the option? Another suggestion may be is to use a table lookup (already containing the values  if not too many q's) and the only cost is the cost of indexing perhaps? Just a thought...

_Kittur

0 Kudos
jimdempseyatthecove
Honored Contributor III
864 Views

>> I really need is to compute x^(1/q) for q = 3, 5, 7, 9, ...
>> Some computation -> x^(1/3) -> Some computation -> x^(1/5) -> Stop

The above describes an iterative process (pseudo code)

x=initialValue
for(q=3;;q+=2) {
  someComputation
  if(converged) break;
  x^(1/q)
}

What you might consider is to use a multi-threaded approach (2 threads) where one thread performs the someComputation while the other threads computes the next x^(1/q) using Newton-Raphson to compute the next root ** while using the prior root as the initial upper limit (thus reducing an iteration or two)..

Depending on the size of someComputation, the result of the next x^(1/q) may be ready for use.

Jim Dempsey

0 Kudos
velvia
Beginner
864 Views

Thanks Jim, but the x is different for every q, so there is no way. Anyway, the "John Carmack" trick is fast and gives me enough accuracy.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
864 Views

The "John Carmack" trick shown is for sqrt. Would you mind sharing the 3,5,7,9,... root approximations?

Also, if you have double versions, it would be appreciated if you show these too (or provide a link).

Jim Dempsey

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
864 Views

Velvia,

I understood the trick of manipulating the integer representation where the (i/3) is also dividing the exponent (but interestingly not the missing 1. bit). I suppose the additional bits in the mantissa portion of the magic number, more or less, adjust for the missing 1. bit. I presume that this also requires that x be positive (-2.0 is cube root of -8.0). This could be handled with a save and clear of the sign bit, diddle, then restore the sign bit (valid for both even and odd number roots).

This is a neat hack for approximation. You could then use this as your first approximation of the root, then issue one or more Newton-Raphson steps to add bits to the precision. I think the IPP uses a similar trick for fast sqrt with higher precision. No reason it couldn't be used here too if needed.

Thanks for the post.

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
864 Views

The whole thread got confusing when it started with a discussion of the approximate reciprocal sqrt instruction even though what was apparently wanted were cube and 5th root approximations.

The accuracy options in the Intel compilers and libraries offer implied selection of the number of Newton iterations for sqrt as well as whether infinities Nans, and zeros are handled.  It's not clear whether those options apply to the higher order roots, except that there is library support for cbrt.  In some situations, the compiler could call the parallel cbrt function for a scalar result, but it takes a lot of checking over multiple responses in this thread to guess whether anyone is interested in that.

Higher accuracies than the basic Carmack approximation might be obtained with a short polynomial.  There has been debate about whether the approximate sqrt should be accurate to 11, 12, 13, or 14 bits (AMD made various choices at different times).  Considerations include whether a float precision result is to be obtained with a single Newton iteration, or a double with 2 iterations.

0 Kudos
Reply