 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
Dear Intel Community
How is standards , range , accuracy trigonometric functions in Intel C++ Compiler ? How is Your advise ?
Link Copied
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
Your question seems to cover a variety of topics, many of which probably don't interest you, if you haven't taken the time to read a few bits of references such as
fasttranscendentals option involves accepting fewer bits of accuracy in the interest of accepting vectorized math functions. In some corner cases it may increase roundoff error from 0.5 ULP with that option off to nearly 4 ULP with the option set.
Maybe if you would check some of these references about icc you could ask more specific questions.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
Probably the most accurate transcendental functions are those implemented in microcode: fsin , fcos.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
iliyapolak wrote:
Probably the most accurate transcendental functions are those implemented in microcode: fsin , fcos.
Those implement long double precision (64bit precision). They have the somewhat peculiar feature of working only for x < (1<<64). Within that range, range reduction doesn't use any extra precision beyond what is provided by the accident of Pi being correct to 66 bits due to the 65th and 66th bits being zero.
Although icc brings its own math libraries for float and double data types, long double relies on glibc provided with gcc on linux and is not fully covered on Windows. You could check this by examining the object files built by the compiler, using nm on linux or dumpbin on Windows (where you would see long double implemented as double). So you could look up references on glibc to learn about the linux long double math functions.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
I was under assumption that x87 trigo instruction implementation uses 80bit precision PI value.
I have found this thread very informative.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
Of course precision of long double is 64bit.
Thanks for correcting me.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
Regarding my previous post I wanted to add that if x87 fsin,fcos implementation is based on Taylor approximation with the Horner scheme then also precomputed coefficients should have 64bit precision. Intel vectorized implementation if it is based on Horner scheme then coefficients probably have precision of 53bit. It could be interested to compare the results from both of implementation when the argument is close to PI/2 and PI/2.
I still think that x87 trigo which is implemented as a microcode assists probably have hardcoded value of PI which cannot be set by OS to 53bit. I could be wrong on my assumption.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
how is precision , and is counted together precision Taylor complex trigonometric function and Horner calculated in x87 ?
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
I do want to discuss the subject of Chebyshev economized polynomials and the like, and the methods for organizing polynomial evaluation faster but as accurate as Horner (published millenia before Horner repopularized the method), in upcoming blogs. It's not as important a subject as it might otherwise be due to the possibility of gaining better performance yet by tabular interpolation, which of course is 53bit double for vectorizability.
You may remember the notorious "Pentium fdiv bug" which showed that even x87 division involved some kind of tabular lookup as long as 2 decades ago, with such a big table that ordinary test suites might not expose errors.
As Iliya said, the x87 precision mode doesn't affect the builtin x87 math functions (other than sqrt) which use long double. But Intel gave up on making those performance competitive, beginning with P4.
The x87 doesn't implement any serious protection against accuracy degradation during range reduction such as you would see in Moshier's implementations on netlib.org, so the long double accuracy is retained only over a small interval. There are nonvector math library implementations which should avoid accuracy degradation due to range reduction in sin(). Intel appears to have devoted much effort to finding methods which are sufficiently accurate without costing performance, but you will note that default vectorized math functions are rated at up to 4 ULP error vs. 0.5 ULP for nonvector, and reduced accuracy vector library functions are offered.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
>>>Intel appears to have devoted much effort to finding methods which are sufficiently accurate without costing performance, but you will note that default vectorized math functions are rated at up to 4 ULP error vs. 0.5 ULP for nonvector, and reduced accuracy vector library functions are offered>>>
I think that most non hard realtime applications will not benefit from 64bit precision.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
Rafał B. wrote:
how is precision , and is counted together precision Taylor complex trigonometric function and Horner calculated in x87 ?
The quoted title below partly answers your question. Just gogle it this is pdf doc.
"Accurate Floating Point Arithmetic through Hardware ErrorFree Transformations"
Consider following scenario where there are two implementation of Taylor approximation: naive and Horner scheme.
The naive implementation simply implements in code Taylor formula and Horner scheme uses precalculated coefficient without lengthy div operation. By comparing the results you can expect that Horner scheme will outperform in terms of resulting precision and accuracy the naive implementation.
The culprit of lesser precision can be usage of division operation in naive method.
Btw, I have in MASM assembly both of those implementation and I plan to test for accurracy of the result.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
x87 is that only method for representation trigonometric functions in Intel Core i5 , Intel Xenon E5  2609 0 ?
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
Thank You for Your answer .
Under word vector do You think Taylor method on complex surface 4 ULP and scalar Horner sheme 0,5 ULP ? In result gave 53 bit precision instead 64 bit precision ?
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content

How are translated trigonometric functions of x87 on Intel C + +?
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
Rafał B. wrote:
x87 is that only method for representation trigonometric functions in Intel Core i5 , Intel Xenon E5  2609 0 ?
If you mean machine code instructions then yes. On the other hand Intel has special and transcendental functions vector libraries which have lesser accuracy ,but have greater throughput.
Btw, there are three version of those function where HA (high accurate) has precision < 1.0 ULP.
https://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_listfunc.html
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
Rafał B. wrote:
How are translated trigonometric functions of x87 on Intel C + +?
Compiler emits opcodes for x87 builtin functions like fcos , fsin etc....
IIRC for compilation of 64bit executables containing floating point code compiler will not emit x87 opcodes.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
@Rafal B
If you are interested in AMD K5 implementation of transcendental function google "K5 Transcendental Functions"
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
Complex transcendentals involve combinations such that the 4 ULP claim would be best case applying relative to the nominal 24bit float or 53bit double precision for the vector math function library, or close to 0.5 ULP for the nonvector implementations.
The glibc implementation of expl() (presumably used on linux by both icc and gcc) isn't necessarily done carefully enough to quote any specific error estimate relative to the nominal 64bit precision (if you override any measures your compiler may take to set 53bit precision). That isn't so difficult to improve upon with your own implementation, but you can expect poor performance in comparison with double vectorized libraries. If you're running under Visual Studio, you don't have satisfactory long double math libraries.
You don't have an option to request x87 based float or double libraries except by running IA32 mode with the 32bit compilers, where you give up on vectorization. In current Intel Windows C++, Qlongdouble enables use of x87 to support long double even when using SSE2 to support double. According to the doc, __LONG_DOUBLE_SIZE() macro is available to test whether long double is set to 53bit precision (macro returning 64) or 64bit (macro returning 80). I don't know whether this should track any change of setting you make by using intrinsics or asm to change it (or does it simply reflect the option used to compile that object). The old option /Qpc80 to set 64bit precision mode seems to have been deprecated, but I don't see it in the current deprecation list.
MKL offers VSL vector math functions in various accuracies. I haven't had the incentive to use those.
You brought up Horner polynomial as a possible way of implementing math functions, but I'm not getting a sense of your specific interest. Hornerlike implementation of economized polynomials (e.g. Chebyshev economization) looks more competitive with likely tabular methods, by using fma implementation on corei74, and it offers a little extra precision also by taking advantage of fma. Horner popularized the algorithm in English language literature but it was published long before in other languages. I'm trying to look up my drafts on the subject. It's peripheral to the subject of Intel C++ and I'm having difficulty thinking how much might be topical. This x87 stuff was a diversion from my impression of the original question on the thread.
As we've brought up x87 code and polynomial evaluation (pretty much off forum topic), here's a snippet of old tanh() code (one of the few elementary functions where you want to augment the curve fits buried in the x87 ROM):
__inline_mathcodeNP (tanh, __x, \
if(__fabsl(__x) <= .34657){ \
long double __x2 = __x * __x; \
long double __x4 = __x2*__x2; \
return __x + __x2*__x*( \
0.3333333333333333333028L \
+__x2*(0.133333333333333321200L \
+__x2*0.5396825396825207695E01L \
+__x4*(0.218694885360028124E01L \
+__x2*0.88632355226515778E02 \
+__x4*(0.3592127817609080E02 \
+__x2*0.14558300258105E02) \
+__x4*__x4*(0.5899693119329E03 \
+__x2*0.238614526828E03 \
+__x4*(0.9399418484E04 \
+__x2*0.294863013E04)))));} \
return 1  2 / (expl(__x + __x) + 1))
I suppose what someone called naive would be individual calculation of each of the expanded powers of __x, relying on the compiler to extract any commonality, but not specifying an order of evaluation to control roundoff. The coefficients are derived by Chebyshev economization of a polynomial fit for tanh(__x)/__x over the interval __x < .34657 ; where the formula 12/(exp(2x)1) would suffer cancellation error.
The full latency of evaluation of the polynomial series is meant to be around 0.2 of what is involved in a plain Horner series. There would be more instances of underflow than in the Horner series but not nearly as many as in the "naive" evaluation.
and here's an example of how to make expl() consistently accurate to 62 bit or better precision by using the internal curve fit and log base 2(e) while limiting roundoff error :
# define __exp_code \
long double __value; \
long double __exponent; \
long double __temp; \
__asm ("fldl2e" : "=t" (__temp) ); \
__x = .693147178739309310913*(__exponent = rintl(__x * __temp)); \
__x = (__x1.82063599850414622404E09*__exponent)*__temp; \
__asm ("f2xm1" : "=t" (__value) : "0" (__x)); \
__asm ("fscale" : "=t" (__value) : "0" (__value+1), "u" (__exponent)); \
return __value
__inline_mathcode_ (long double, __expl, __x, __exp_code)
The 2 double constants (of opposite sign) should add up to a long double accurate value of natural log of 2 and might be replaced by the corresponding same signed pair M_LN2LO and M_LN2HI seen in some (nonstandard?) math.h implementations.
Now maybe you begin to appreciate having some math function work done for you.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
>>>You brought up Horner polynomial as a possible way of implementing math functions, but I'm not getting a sense of your specific interest. Hornerlike implementation of economized polynomials (e.g. Chebyshev economization) looks more competitive with likely tabular methods,>>>
I plan to implement math functions by using also Chebyshev economization. Later I would like to perform various tests of speed and accuracy.
 Mark as New
 Bookmark
 Subscribe
 Mute
 Subscribe to RSS Feed
 Permalink
 Report Inappropriate Content
Suitable implementations of Chebyshev economization give a small improvement in accuracy when comparing with a Taylor series, due to improved convergence of the polynomial within the chosen range. As in my posting above, in the case of the leading term linear in x, it's likely to be appropriate to economize based on fitting f(x)/x, using enough terms that the leading coefficient comes out to 1.0 in the data type.
More important for accuracy is arranging the series for accurate order of evaluation. Horner's method accomplishes that inherently for the most part, at least in the case where the magnitude of terms decreases with order (the reduced interval must be chosen to accomplish that). For the case where the leading coefficient is 1.0 or even 2.0, improved accuracy may be obtained by adding the entire leading term separately at the end (more so with fma, where the roundoff after the last multiplication is eliminated). It's convenient when this also is faster.
At the risk of repeating myself, a pure Horner evaluation is uncompetitive for performance, given that current CPUs offer pipeline capability to evaluate several terms in parallel. I could imagine an algorithm to optimize polynomial evaluation far better than compilers do, under Fortran style rules for reassociation.
The question of accuracy comparison between Chebyshev economized polynomial and a true minimax polynomial is more difficult but perhaps inconsequential. A minimax polynomial should have slightly less maximum error than a Chebyshev economized one of the same number of terms, but a good Chebyshev polynomial is more accurate over most of the range. Even at the point where the minimax is more accurate if evaluated in extra precision, it may not be so when evaluated in target precision. Also, a satisfactory minimax polynomial requires a more accurate reference implementation of the function, and the extra calculation may be tedious if using a multiple precision library, where ordinary long double may be sufficient to fit a double data type Chebyshev polynomial, and ancient utilities like bc work as well. I translated published Chebyshev economization software to bc myself.
gnu mpc and mpfr have greatly improved the selection of choices since I started playing with this, and the chances of getting a high quality math library by default with your compiler are better as well (with possible exception of simd math libraries, which are a big part of the value proposition for Intel compilers).
I've never found an easily accessible documentation of the practical side of this stuff, although it's clear that experts have worked on it for decades, so I feel somewhat justified in writing about it myself, if I can find out what people will find useful. Plauger's "The Standard C Library" was an excellent reference but didn't spark any subsequent interest in discussing the subject. Things were more difficult back when IBMers had to allow for hex floating point and Plauger had to deal with the mixed byte order of VAX.
 Subscribe to RSS Feed
 Mark Topic as New
 Mark Topic as Read
 Float this Topic for Current User
 Bookmark
 Subscribe
 Printer Friendly Page