standards , range , accuracy trigonometric functions

RBato · ‎08-01-2014

Dear Intel Community

How is standards , range , accuracy trigonometric functions in Intel C++ Compiler ? How is Your advise ?

TimP · ‎08-01-2014

Your question seems to cover a variety of topics, many of which probably don't interest you, if you haven't taken the time to read a few bits of references such as

https://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/copts/common_options/option_fast_transcendentals.htm

https://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler

-fast-transcendentals option involves accepting fewer bits of accuracy in the interest of accepting vectorized math functions. In some corner cases it may increase roundoff error from 0.5 ULP with that option off to nearly 4 ULP with the option set.

Maybe if you would check some of these references about icc you could ask more specific questions.

Bernard · ‎08-02-2014

Probably the most accurate transcendental functions are those implemented in micro-code: fsin , fcos.

TimP · ‎08-02-2014

iliyapolak wrote:

Probably the most accurate transcendental functions are those implemented in micro-code: fsin , fcos.

Those implement long double precision (64-bit precision). They have the somewhat peculiar feature of working only for |x| < (1<<64). Within that range, range reduction doesn't use any extra precision beyond what is provided by the accident of Pi being correct to 66 bits due to the 65th and 66th bits being zero.

Although icc brings its own math libraries for float and double data types, long double relies on glibc provided with gcc on linux and is not fully covered on Windows. You could check this by examining the object files built by the compiler, using nm on linux or dumpbin on Windows (where you would see long double implemented as double). So you could look up references on glibc to learn about the linux long double math functions.

Bernard · ‎08-02-2014

I was under assumption that x87 trigo instruction implementation uses 80-bit precision PI value.

I have found this thread very informative.

https://software.intel.com/en-us/forums/topic/289702

TimP · ‎08-02-2014

Precision of 80-bit format is 64 in hardware default. Normally set down to 53 under windows and by intel fortran main program. Internal pi value is 66 bit precision by good luck. Within fcos fsin fsincos full accuracy should be produced as if in 64-bit precision mode.

Bernard · ‎08-02-2014

Of course precision of long double is 64-bit.

Thanks for correcting me.

Bernard · ‎08-03-2014

Regarding my previous post I wanted to add that if x87 fsin,fcos implementation is based on Taylor approximation with the Horner scheme then also pre-computed coefficients should have 64-bit precision. Intel vectorized implementation if it is based on Horner scheme then coefficients probably have precision of 53-bit. It could be interested to compare the results from both of implementation when the argument is close to PI/2 and -PI/2.

I still think that x87 trigo which is implemented as a micro-code assists probably have hardcoded value of PI which cannot be set by OS to 53-bit. I could be wrong on my assumption.

RBato · ‎08-03-2014

how is precision , and is counted together precision Taylor complex trigonometric function and Horner calculated in x87 ?

TimP · ‎08-03-2014

I do want to discuss the subject of Chebyshev economized polynomials and the like, and the methods for organizing polynomial evaluation faster but as accurate as Horner (published millenia before Horner re-popularized the method), in upcoming blogs. It's not as important a subject as it might otherwise be due to the possibility of gaining better performance yet by tabular interpolation, which of course is 53-bit double for vectorizability.

You may remember the notorious "Pentium fdiv bug" which showed that even x87 division involved some kind of tabular lookup as long as 2 decades ago, with such a big table that ordinary test suites might not expose errors.

As Iliya said, the x87 precision mode doesn't affect the built-in x87 math functions (other than sqrt) which use long double. But Intel gave up on making those performance competitive, beginning with P4.

The x87 doesn't implement any serious protection against accuracy degradation during range reduction such as you would see in Moshier's implementations on netlib.org, so the long double accuracy is retained only over a small interval. There are non-vector math library implementations which should avoid accuracy degradation due to range reduction in sin(). Intel appears to have devoted much effort to finding methods which are sufficiently accurate without costing performance, but you will note that default vectorized math functions are rated at up to 4 ULP error vs. 0.5 ULP for non-vector, and reduced accuracy vector library functions are offered.

Bernard · ‎08-04-2014

>>>Intel appears to have devoted much effort to finding methods which are sufficiently accurate without costing performance, but you will note that default vectorized math functions are rated at up to 4 ULP error vs. 0.5 ULP for non-vector, and reduced accuracy vector library functions are offered>>>

I think that most non hard real-time applications will not benefit from 64-bit precision.

Bernard · ‎08-04-2014

Rafał B. wrote:

how is precision , and is counted together precision Taylor complex trigonometric function and Horner calculated in x87 ?

The quoted title below partly answers your question. Just gogle it this is pdf doc.

"Accurate Floating Point Arithmetic through Hardware Error-Free Transformations"

Consider following scenario where there are two implementation of Taylor approximation: naive and Horner scheme.

The naive implementation simply implements in code Taylor formula and Horner scheme uses pre-calculated coefficient without lengthy div operation. By comparing the results you can expect that Horner scheme will outperform in terms of resulting precision and accuracy the naive implementation.

The culprit of lesser precision can be usage of division operation in naive method.

Btw, I have in MASM assembly both of those implementation and I plan to test for accurracy of the result.

RBato · ‎08-04-2014

x87 is that only method for representation trigonometric functions in Intel Core i5 , Intel Xenon E5 - 2609 0 ?

RBato · ‎08-04-2014

Thank You for Your answer .
Under word vector do You think Taylor method on complex surface 4 ULP and scalar Horner sheme 0,5 ULP ? In result gave 53 bit precision instead 64 bit precision ?

RBato · ‎08-04-2014

How are translated trigonometric functions of x87 on Intel C + +?

Bernard · ‎08-04-2014

Rafał B. wrote:

x87 is that only method for representation trigonometric functions in Intel Core i5 , Intel Xenon E5 - 2609 0 ?

If you mean machine code instructions then yes. On the other hand Intel has special and transcendental functions vector libraries which have lesser accuracy ,but have greater throughput.

Btw, there are three version of those function where HA (high accurate) has precision < 1.0 ULP.

https://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_listfunc.html

Bernard · ‎08-04-2014

Rafał B. wrote:

How are translated trigonometric functions of x87 on Intel C + +?

Compiler emits opcodes for x87 built-in functions like fcos , fsin etc....

IIRC for compilation of 64-bit executables containing floating point code compiler will not emit x87 opcodes.

Bernard · ‎08-04-2014

@Rafal B

If you are interested in AMD K5 implementation of transcendental function google "K5 Transcendental Functions"

TimP · ‎08-06-2014

Complex transcendentals involve combinations such that the 4 ULP claim would be best case applying relative to the nominal 24-bit float or 53-bit double precision for the vector math function library, or close to 0.5 ULP for the non-vector implementations.

The glibc implementation of expl() (presumably used on linux by both icc and gcc) isn't necessarily done carefully enough to quote any specific error estimate relative to the nominal 64-bit precision (if you over-ride any measures your compiler may take to set 53-bit precision). That isn't so difficult to improve upon with your own implementation, but you can expect poor performance in comparison with double vectorized libraries. If you're running under Visual Studio, you don't have satisfactory long double math libraries.

You don't have an option to request x87 based float or double libraries except by running IA32 mode with the 32-bit compilers, where you give up on vectorization. In current Intel Windows C++, -Qlong-double enables use of x87 to support long double even when using SSE2 to support double. According to the doc, __LONG_DOUBLE_SIZE() macro is available to test whether long double is set to 53-bit precision (macro returning 64) or 64-bit (macro returning 80). I don't know whether this should track any change of setting you make by using intrinsics or asm to change it (or does it simply reflect the option used to compile that object). The old option /Qpc80 to set 64-bit precision mode seems to have been deprecated, but I don't see it in the current deprecation list.

MKL offers VSL vector math functions in various accuracies. I haven't had the incentive to use those.

You brought up Horner polynomial as a possible way of implementing math functions, but I'm not getting a sense of your specific interest. Horner-like implementation of economized polynomials (e.g. Chebyshev economization) looks more competitive with likely tabular methods, by using fma implementation on corei7-4, and it offers a little extra precision also by taking advantage of fma. Horner popularized the algorithm in English language literature but it was published long before in other languages. I'm trying to look up my drafts on the subject. It's peripheral to the subject of Intel C++ and I'm having difficulty thinking how much might be topical. This x87 stuff was a diversion from my impression of the original question on the thread.

As we've brought up x87 code and polynomial evaluation (pretty much off forum topic), here's a snippet of old tanh() code (one of the few elementary functions where you want to augment the curve fits buried in the x87 ROM):

__inline_mathcodeNP (tanh, __x, \
if(__fabsl(__x) <= .34657){                           \
        long double __x2 = __x * __x;   \
        long double __x4 = __x2*__x2;   \
          return __x + __x2*__x*(      \
-0.3333333333333333333028L             \
+__x2*(0.133333333333333321200L       \
+__x2*-0.5396825396825207695E-01L      \
+__x4*(0.218694885360028124E-01L      \
+__x2*-0.88632355226515778E-02         \
+__x4*(0.3592127817609080E-02         \
+__x2*-0.14558300258105E-02)           \
+__x4*__x4*(0.5899693119329E-03       \
+__x2*-0.238614526828E-03              \
+__x4*(0.9399418484E-04               \
+__x2*-0.294863013E-04)))));}          \
return 1 - 2 / (expl(__x + __x) + 1))

I suppose what someone called naive would be individual calculation of each of the expanded powers of __x, relying on the compiler to extract any commonality, but not specifying an order of evaluation to control roundoff. The coefficients are derived by Chebyshev economization of a polynomial fit for tanh(__x)/__x over the interval |__x| < .34657 ; where the formula 1-2/(exp(2x)-1) would suffer cancellation error.

The full latency of evaluation of the polynomial series is meant to be around 0.2 of what is involved in a plain Horner series. There would be more instances of underflow than in the Horner series but not nearly as many as in the "naive" evaluation.

and here's an example of how to make expl() consistently accurate to 62 bit or better precision by using the internal curve fit and log base 2(e) while limiting roundoff error :

# define __exp_code \
   long double __value;                                               \
   long double __exponent;                                            \
long double __temp;                                           \
__asm ("fldl2e" : "=t" (__temp) );                                    \
__x -= .693147178739309310913*(__exponent = rintl(__x * __temp));     \
__x = (__x-1.82063599850414622404E-09*__exponent)*__temp;             \
__asm ("f2xm1" : "=t" (__value) : "0" (__x));                         \
__asm ("fscale" : "=t" (__value) : "0" (__value+1), "u" (__exponent)); \
return __value
__inline_mathcode_ (long double, __expl, __x, __exp_code)

The 2 double constants (of opposite sign) should add up to a long double accurate value of natural log of 2 and might be replaced by the corresponding same signed pair M_LN2LO and M_LN2HI seen in some (non-standard?) math.h implementations.

Now maybe you begin to appreciate having some math function work done for you.

Bernard · ‎08-07-2014

>>>You brought up Horner polynomial as a possible way of implementing math functions, but I'm not getting a sense of your specific interest. Horner-like implementation of economized polynomials (e.g. Chebyshev economization) looks more competitive with likely tabular methods,>>>

I plan to implement math functions by using also Chebyshev economization. Later I would like to perform various tests of speed and accuracy.

TimP · ‎08-07-2014

Suitable implementations of Chebyshev economization give a small improvement in accuracy when comparing with a Taylor series, due to improved convergence of the polynomial within the chosen range. As in my posting above, in the case of the leading term linear in x, it's likely to be appropriate to economize based on fitting f(x)/x, using enough terms that the leading coefficient comes out to 1.0 in the data type.

More important for accuracy is arranging the series for accurate order of evaluation. Horner's method accomplishes that inherently for the most part, at least in the case where the magnitude of terms decreases with order (the reduced interval must be chosen to accomplish that). For the case where the leading coefficient is 1.0 or even 2.0, improved accuracy may be obtained by adding the entire leading term separately at the end (more so with fma, where the roundoff after the last multiplication is eliminated). It's convenient when this also is faster.

At the risk of repeating myself, a pure Horner evaluation is uncompetitive for performance, given that current CPUs offer pipeline capability to evaluate several terms in parallel. I could imagine an algorithm to optimize polynomial evaluation far better than compilers do, under Fortran style rules for re-association.

The question of accuracy comparison between Chebyshev economized polynomial and a true minimax polynomial is more difficult but perhaps inconsequential. A minimax polynomial should have slightly less maximum error than a Chebyshev economized one of the same number of terms, but a good Chebyshev polynomial is more accurate over most of the range. Even at the point where the minimax is more accurate if evaluated in extra precision, it may not be so when evaluated in target precision. Also, a satisfactory minimax polynomial requires a more accurate reference implementation of the function, and the extra calculation may be tedious if using a multiple precision library, where ordinary long double may be sufficient to fit a double data type Chebyshev polynomial, and ancient utilities like bc work as well. I translated published Chebyshev economization software to bc myself.

gnu mpc and mpfr have greatly improved the selection of choices since I started playing with this, and the chances of getting a high quality math library by default with your compiler are better as well (with possible exception of simd math libraries, which are a big part of the value proposition for Intel compilers).

I've never found an easily accessible documentation of the practical side of this stuff, although it's clear that experts have worked on it for decades, so I feel somewhat justified in writing about it myself, if I can find out what people will find useful. Plauger's "The Standard C Library" was an excellent reference but didn't spark any subsequent interest in discussing the subject. Things were more difficult back when IBMers had to allow for hex floating point and Plauger had to deal with the mixed byte order of VAX.