x87 FSIN/FCOS mediocre accuracy

Stathis_K · ‎05-16-2010

Hello everyone.

My apologies if this isn't the place to post my question. I couldn't find a forum on general x86 questions. Feel free to relocate it to a more appropriate place.

The x87 FPU specs say that FSIN and FCOS can compute any angle in the range \\pm 2^63, which is roughly \\pm 1E18. My tests showed that even for smaller arguments, say in the order of 1E10, they produce results that are correct only for the first 10 digits and the rest are all wrong.

My question is, how accurate are the computations of FSIN and FCOS? Since they operate on the stack register, which is 80bit wide, shouldn't they at least provide correct double precision results ?

I can provide code samples, if asked.

Best regards,
Stathis

Shane_S_Intel · ‎05-16-2010

Stathis, my recommendation would be to use the double precison sin/cos functions provided along with the Intel compiler. These do a fully accurate argument reduction based on the Payne-Hanek algorithm. The X87 FSIN/FCOS instructions do handle arguments up to 2^63, but use an internal 66 bit value of pi to do the argument reduction. Because only 66 bits of pi are used to perform the reduction, the final computed result may very well be different than if a fully accurate pi is used. This probably explains why you're seeing different results.

djplatt · ‎06-26-2010

Hi Shane

I have the same problem trying to perform mathematically rigorous computations using interval arithmetic. I expected large arguments to the trig functions to cause inaccuracy because of the reduction step but I also see large (in terms of units of last place) errors if arguments are, for example, near Pi/2 in the case of cos.

Assuming I restrict myself to arguments of magnitude <=Pi/4 in the case of fsincos, can anything mathematically rigorous be said about the accuracy of results obtained in normal rounding mode by pushing a double, executing fsincos and popping the results as two doubles. How close are they to the correctly rounded exact answer?

The same question goes for f2xm1, fyl2x, fsqrt and fpatan. What accuracy can I be assured of over what ranges of argument?

What about the library routines sin, cos, atan, exp and log? Do we have provable error bounds for them?

Thanks

Dave

eliosh · ‎06-27-2010

Using precise PI value can aleviate the problem of the range reduction but cannot cure it.

Consider the most simple case of a periodic function with period equal to one. For simplicity, assume also that your computer has 3 significant digits.

First, you want to compute f(0.158) no problem with that. Now, you want to compute f(1.158) which is, of course, should be equal to f(0.158). However, instead of 1.158 you will get 1.16 (three significant digits) and, after the range reduction (which does not introduce any error), you will end up with computing f(0.16) which might be quite different form f(0.158). You can extrapolate the idea to computing f(100.158) which will result in computing f(0) instead of f(0.158).

Shane_S_Intel · ‎06-27-2010

Dave, rigorous error bounds for most of the x87 transcendental functions (f2xm1, fsincos, fyl2x...) were provided when the Pentium processor released. I recall an appendix in the orginal Pentium manual which showed a collection of ulp (unit in the last place) charts - don't recall the upper bound for all of thembut it was something like 1.8 ulps. This would be for double-extended 80 bit fp inputs. Results for the fsin/fcos were for arguments < pi/4becausefor the reduction for larger inputs, a 66-bit value of pi (or pi/2) was used. The fsqrt instruction is correctly rounded (<= .5 ulps round to nearest) since thisis needed for IEEE754 adherence. All Intel provided libm (libimf) functions, both single and double precision, should be "almost" correctly rounded (i.e. errors < .55 ulps). Provable error bounds have also been establised. Note for libimf functions this bound is also true for large inputs to the sin/cos/tan functions because a precise value of pi is used for the reduction step, and the error introducedat that stepis compensated for and taken into account when the overall error bound is calculated.

djplatt · ‎06-27-2010

Shane, thanks very much for this. I have seen John Harrison's paper (Formal Verification of Floating Point
Trigonometric Functions) which gives me all the comfort I need for sin and cos in the Intel library. Is there something equally rigorous for exp, log and atan? If so, I am home and dry.

By the way, the link

http://developer.intel.com/software/opensource/numerics/index.htm

cited by John appears to been moved. Any ideas where too?

Best

Dave

Shane_S_Intel · ‎06-27-2010

I will need to check with John ... not sure that exp (note that itisnt an x87 transcendental operation) was ever formally verified but techniques similar to those described in http://download.intel.com/technology/itj/q41999/pdf/transendental.pdf, and more specifically papers by Peter Tang - nicely described by J. M. Mueller Elementary functions: algorithms and implementation, Birkhaser, 1997 provide an indication of the techniques used to verify accuracy.

djplatt · ‎06-28-2010

Thanks again Shane. Most helpful. By exp/log I meant f2xm1/fyl2x respectively. Where they available on the Pentium? My best course of action now seems to be a good nose at Mueller's book, I think we've got it.

Dave

Shane_S_Intel · ‎06-28-2010

Maybe this is it:http://download.intel.com/software/opensource/numerics/libm.pdfLet me add that John, using higher levelformalverification tools, establised error bounds forthe x87 transcendental operation ( f2xm1, fyl2x,...)implementations on the Itanium processors. If I recall correctly, the original Pentium error bounds were establised using the techniques described in Mueller's book, essentially the combinederror, that is, the arg. redution error + table lookup error +polynomial approximaton error + associated roundoff errors.

davidimcintosh · ‎04-18-2011

Also, the link for the source is http://download.intel.com/software/opensource/numerics/libm.zip.

Thanks Shane for the info. I have looked at the paper you refer to above - very good info. I have also looked at the assembler source, and I am uncertain of one thing. What is the relationship between the built-in intrinsic f2xm1, and the assembler code? The assembler code does not seem to use f2xm1 anywhere. If f2xm1 a machine implementation equivalent to one of the assembler routines in the library?

One other question. The function exp(-x^2/2) is very very commonly needed in both scientific and financial calculations. Because of the x^2 term, there are various techniques for improving the accuracy of this calculation. For example, I have often seen this code (paraphrased):
double xMostSignificantPart = trunc(x * 16.) * (1./16.)
result = exp(-xMostSignificantPart * xMostSignificantPart * 0.5)
* exp((xMostSignificantPart - x) * (x + xMostSignificantPart) * 0.5)
The above code, however, is a little misguided it seems, since the calcualtion of exp(y) is always reduced to calculation of 2^(...). In other words, I would think it best to treat this as the calculation
exp(-x^2/2) = 2^( - (x*sqrt(ln_2(e)/2))^2 )
and split up x*sqrt(ln_2(e)/2 ) into multiple parts for the calculation. Have you or any of the others at Intel every looked at this problem? It is not difficult, but I was wondering if anyone had looked at it and if perhaps someone already had some nice, highly optimized code for doing this calculation. Thanks.

davidimcintosh · ‎04-19-2011

I think the code to optimally calculate exp(-x^2/2) would be something like this. Can any of the Intel guru's give me a bit of a critical look ath this? thanks.:

#pragma warning(push)
#pragma warning(disable: 4035)

#define setExtendedPrecisionOn(nSavedCW) \
short int nSavedCW; \
{ short int nNewCW; \
__asm \
{ \
__asm fstcw nSavedCW /*get the current Control Word to retain all setting bits*/ \
__asm fstcw nNewCW /*get the current Control Word to retain all setting bits*/ \
__asm fwait /*to insure the storage instruction is completed*/ \
__asm or nNewCW, 0x0300 /* set the precision control bits both on */ \
__asm fldcw nNewCW /* load the modified Control Word */ \
} \
}

/*
//Calculating e^(-x*x/2).
//We first write this as
// e^(-x*x/2) = 2^(-ln_2(e)*x*x/2)
// = 2^(- (x*sqrt(ln_2(e)/2) * (x*sqrt(ln_2(e)/2) )
// = 2^(- (x/sqrt(2*ln(2))) * (x/sqrt(2*ln(2))) )
// = 2^(- (x/sqrt(ln(4))) * (x/sqrt(ln(4))) )
// = 2^(- y * y )
//where y = x * 1/sqrt(ln(4)).
//so we calculate y, and then calculate 2^(-y*y).
//to calculate 2^(-y*y), we split y up into a "most significant part", the upper bits, say z, and its lower bits, (y-z),
//so we write y=z+(y-z). then y^2 = z^2 + (y+z)*(y-z), which we further break down as
// y^2 = [z^2] + (z^2 - [z^2]) + (y+z)*(y-z), where is V rounded to the nearest integer.
// If we ensure that the number of significant bits in z is at MOST half the number of significant bits
// in our double representation, then we maxinmize the number of significant bits in (z^2 - [z^2]) + (y+z)*(y-z)
//*/

// const long double over_sqrt_ln_4 = 0.84932180028801904272150283410289; /* = 1/sqrt(ln(4)) 0x3FFED96D274C04529621 */
// const unsigned char s_1_over_sqrt_ln_4[10] = { 0x3F, 0xFE, 0xD9, 0x6D, 0x27, 0x4C, 0x04, 0x52, 0x96, 0x21 }; /* assuming ? endian */
const unsigned char s_1_over_sqrt_ln_4[10] = { 0x21, 0x96, 0x52, 0x04, 0x4C, 0x27, 0x6D, 0xD9, 0xFE, 0x3F }; /* assuming ? endian */

/*sqrt( (16384+64) ) : y magnitude at which the value becomes 0*/
//const double s_d128(128.24975633505117891897083806151);
/*sqrt( (16384+64)*ln(4) ) : x magnitude at which the value becomes 0*/
const double s_dZeroLimit(151.00254849405675372508994769305);//151.00254849405675146123095288122 exactly, 151.00254849405675372508994769305 rounded up to 53 bits (5312929852576791/2^45 )

/*Must be bigger than log_2(s_d128)+1, so bigger than 8, to ensure the magnitude of our "fractional" part is indeed <1 */
/*To ensure our "most significant part" squared does not loose any bits, less than 53/2-log_2(s_d128), so less than 19; */
/* 53 makes it work in double (53 bit) precision - we could choose an upper limit of 64/2-log_2(s_d128)~25, but why bother. */
/* 18 is good for both. */
const double s_dShift(18);

long double exp_X2_2( double x )
{
setExtendedPrecisionOn(nSavedCW);

__asm fld x /* st(0) = x */
__asm fabs /* st(0) = |x| */

__asm fcom s_dZeroLimit /*if x >= sqrt( (16384+64)*ln(4) ), return 0; st(0) = |x|, st(1) = s_dShift */
__asm fstsw ax /*copy the Status Word containing the result to AX */
__asm fwait /*insure the previous instruction is completed */
__asm sahf /*transfer the condition codes to the CPU's flag register */
#ifdef _DEBUG
__asm jpe error_handler /* indeterminate. */
#endif
__asm jb notBig /* if st0 < sqrt( (16384+64)*ln(4) ), jump to notBig */
/* if st0 >= sqrt( (16384+64)*ln(4) ), e(-st0*st0) is below even the smallest */
/* denormalized extended double, so we return 0. If 128*sqrt(ln(4)) <= st0 < sqrt( (16384+64)*ln(4) ) */
/* e(-|x|*|x|) is denormalized, but we will calculate it anyway. */

/*Numerically zero */
__asm ffree st(0) /* if st0 >= sqrt(16384+64), e(-st0*st0) is below even the smallest */
__asm fldz /* denormalized extended double, so we return 0. If 128 <= st0 < sqrt(16384+64) */
__asm jmp done /* e(-st0*st0) is denormalized, but we will calculate it anyway. */

__asm notBig:
__asm fld tbyte ptr [s_1_over_sqrt_ln_4] /* st(0) = 1/sqrt(ln(4)), st(1) = |x| */
__asm fmulp st(1), st(0) /* st(0) = |x|/sqrt(ln(4)) = |x|*sqrt(ln_2(e)/2) = y, and exp(-x^2/2)=2^(-y^2) */
__asm fld1 /* st(0)=1.0, st(1) = y */
__asm fcomp st(1) /*if st(0) >= 1.0, optimize; st(0) = y again */
__asm fstsw ax /*copy the Status Word containing the result to AX */
__asm fwait /*insure the previous instruction is completed */
__asm sahf /*transfer the condition codes to the CPU's flag register */
__asm jb notSmall /* 1 < y */

/*We can do no better than using the built in 2^x-1 */
__asm fmul st(0), st(0) /* st(0) = y^2 */
__asm fchs /* st(0) = -y^2 */
__asm f2xm1 /* st(0) = 2^( -y^2 )-1 */
__asm fld1 /* st(0) = 1., st(1) = 2^( - y^2 ) -1. */
__asm faddp st(1), st(0) /* st(0) = 2^( - y^2 ) */
__asm jmp done

/* optimized case */
__asm notSmall:
__asm fld s_dShift /* st(0) = s_dShift, st(1) = y */
__asm fld st(1) /* st(0) = y, st(1) = s_dShift, st(2)=y */
__asm fscale /* st(0) = 2^s_dShift*y, st(1) = s_dShift, st(2)=y */
__asm frndint /* st(0) = rnd(y*2^s_dShift), st(1) = s_dShift, st(2)=y */
__asm fxch st(1) /* st(0) = s_dShift, st(1) = rnd(y*2^s_dShift), st(2) = y */
__asm fchs /* st(0) = -s_dShift, st(1) = rnd(y*2^s_dShift), st(2) = y */
__asm fxch st(1) /* st(0) = rnd(y*2^s_dShift), st(1) = -s_dShift, st(2) = y */
__asm fscale /* st(0) = z = rnd(y*2^s_dShift)/2^s_dShift, st(1) = -s_dShift, st(2) = y */
__asm fst st(1) /* st(0) = z, st(1) = z, st(2) = y */
__asm fmul st(0), st(1) /* st(0) = z^2, st(1) = z, st(2) = y */
__asm fld st(0) /* st(0) = z^2, st(1) = z^2, st(2) = z, st(3) = y */
__asm frndint /* st(0) = [z^2], st(1) = z^2, st(2) = z, st(3) = y */
__asm fchs /* st(0) = -[z^2], st(1) = z^2, st(2) = z, st(3) = y */
__asm fadd st(1), st(0) /* st(0) = -[z^2], st(1) = z^2-[z^2], st(2) = z, st(3) = y */
__asm fld st(2) /* st(0) = z, st(1) = -[z^2], st(2) = z^2-[z^2], st(3) = z, st(4) = y */
__asm fsub st(0), st(4) /* st(0) = -(y-z) = z-y, st(1) = -[z^2], st(2) = z^2-[z^2], st(3) = z, st(4) = y */
__asm fxch st(3) /* st(0) = z, st(1) = -[z^2], st(2) = z^2-[z^2], st(3) = -(y-z), st(4) = y */
__asm faddp st(4), st(0) /* st(0) = -[z^2], st(1) = z^2-[z^2], st(2) = -(y-z), st(3) = y+z */
__asm fxch st(3) /* st(0) = y+z, st(1) = z^2-[z^2], st(2) = -(y-z), st(3) = -[z^2] */
__asm fmulp st(2), st(0) /* st(0) = z^2-[z^2], st(1) = -(y-z)(y+z), st(2) = -[z^2] */
__asm fsubp st(1), st(0) /* st(0) = -(y-z)(y+z)-(z^2-[z^2]), st(1) = -[z^2] */

#ifdef _DEBUG
__asm fld1 /* st(0) = 1., st(1) = -(y-z)(y+z)-(z^2-[z^2]), st(2) = -[z^2] */
__asm fld st(1) /* st(0) = -(y-z)(y+z)-(z^2-[z^2]), st(1) = 1., st(2) = -(y-z)(y+z)-(z^2-[z^2]), st(3) = -[z^2] */
__asm fabs /* st(0) = |-(y-z)(y+z)-(z^2-[z^2])|, st(1) = 1., st(2) = -(y-z)(y+z)-(z^2-[z^2]), st(3) = -[z^2] */
__asm fcomp st(1) /*if st(0) >= 1.0, error, st(0) = 1., st(1) = -(y-z)(y+z)-(z^2-[z^2]), st(2) = -[z^2] */
__asm fstsw ax /*copy the Status Word containing the result to AX */
__asm fwait /*insure the previous instruction is completed */
__asm sahf /*transfer the condition codes to the CPU's flag register */
__asm ja error_handler /* indeterminate. */
__asm fxch st(1) /* st(0) = -(y-z)(y+z)-(z^2-[z^2]), st(1) = 1., st(2) = -[z^2] */
__asm f2xm1 /* st(0) = 2^( - ((y-z)(y+z)+(z^2-[z^2])) ) -1, st(1) = 1., st(2) = -[z^2] */
#else
__asm f2xm1 /* st(0) = 2^( - ((y-z)(y+z)+(z^2-[z^2])) ) -1, st(1) = -[z^2] */
__asm fld1 /* st(0) = 1., st(1) = 2^( - ((y-z)(y+z)+(z^2-[z^2])) ) -1, st(2) = -[z^2] */
#endif

__asm faddp st(1), st(0) /* st(0) = 2^( - ((y-z)(y+z)+(z^2-[z^2])) ), st(1) = -[z^2] */
__asm fscale /* st(1) = 2^( - ((y-z)(y+z)+(z^2-[z^2])) -[z^2])=2^(-y^2), st(1)= -[z^2] */
__asm ffree st(1)
__asm done:
__asm fldcw nSavedCW /* restore the previous Control Word */
#ifdef _DEBUG
return;
error_handler:
INTERNAL_THROW("Invalud x-value passed to exp_X2_2");
#endif

}

#pragma warning( pop )