Solved: Optimization of sine function's taylor expansion - Page 2

Bernard · ‎05-24-2012

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

bronxzv · ‎06-08-2012

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

View solution in original post

SergeyKostrov · ‎05-25-2012

Quoting iliyapolak

...The best idea is try to study various approximation methodsandto compare them for speed and accuracy.

I'll provide you with performance numbers, compared to CRT-function 'sin', for:

Normalized Series ( up to 7th term )
Normalized Series ( up to 9th term )
Chebyshev Polynomial ( up to 7th term )
Chebyshev Polynomial ( up to 9th term )
Normalized Chebyshev Polynomial ( up to 7th term )
Normalized Chebyshev Polynomial ( up to 9th term )
Normalized Taylor Series ( up to 7th term )
Normalized Taylor Series ( up to 9th term )

MyAPI isimplemented in C as functions and macros.

Best regards,
Sergey

Bernard · ‎05-25-2012

Thanks Sergey you will save me a lot of coding :).
As far i know CRT- sin uses taylor approximation(polynomial multiplication)up to 14th term combined with excessive range reduction and overflow checking so we have a lot of branches in that code.
Sadly my implementation of sin(written in java)is based on taylor series it is simply for loop with pow library function called from inside the loop so it is not so fast as others algorithms I need to rewrite it.
Sergey while comparing various approximations i think that good idea is to code it in assembly language in order to remove compiler-generated code when compiled from high-level language.

bronxzv · ‎05-26-2012

I'll be interested to hear about your findings (speed vs. precision) of the different methods

AFAIK Taylor seriesaremuch worse thanother approximations such as Legendre/Chebyshev/Jacobi polynomials, or best of allto minimize the absolute error: Minimax approximations

Bernard · ‎05-26-2012

In my library i was able to achieve an accuracy beetwen 0.001 and 0.000001 depends on formula being used .I remember that the worse accuracy was acnhieved with gnalternating sign polynomial Probably because of catastrophjc cancellation.Results of calculations were compared to Mathematica 8

TimP · ‎05-26-2012

That's the reason for range reduction so that Horner's method gives you sum in order of increasing magnitude, using the well known precaution of evaluating
x + x**3 * (a3 + x**2 * (a5 + .....
for the case a0==0 a1=1 a2==0 ....
As others indicated, Cheybyshev or mini-max economization of the polynomial gives you better convergence (and accuracy).

You've demonstrated fully that beginning by concentrating on intrinsics without getting the algorithm right is pointless.

Bernard · ‎05-26-2012

Thanks TimP for your help.I must say that i put to much faith in taylor expansion Need to learn other methods.

SergeyKostrov · ‎05-26-2012

Hi Iliya,

Quoting iliyapolak

...As far i know CRT- sin uses taylor approximation(polynomial multiplication)up to 14th term...

I wonder how is it possible?All powers of the Taylorseries forsin(X)are odd:

sin(X) ~= X - X^3/3! + X^5/5! - X^7/7! + X^9/9! - X^11/11! + X^13/13! - X^15/15! + X^17/17! - ...

Unless it is partially normalized:

sin(X) ~= X* (1 - X^2/3! + X^4/5! - X^6/7! + X^8/9! - X^10/11! + X^12/13! - X^14/15! )

Bernard · ‎05-26-2012

Sorry made a mistake my english is still not perfect.I mean count of terms composing the taylor series and not 14th power of x.

SergeyKostrov · ‎05-27-2012

Quoting Sergey Kostrov

Quoting iliyapolak
...The best idea is try to study various approximation methodsandto compare them for speed and accuracy.

I'll provide you with performance numbers, compared to CRT-function 'sin', for:

Normalized Series ( up to 7th term )
Normalized Series ( up to 9th term )
Chebyshev Polynomial ( up to 7th term )
Chebyshev Polynomial ( up to 9th term )
Normalized Chebyshev Polynomial ( up to 7th term )
Normalized Chebyshev Polynomial ( up to 9th term )
Normalized Taylor Series ( up to 7th term )
Normalized Taylor Series ( up to 9th term )

MyAPI isimplemented in C as functions and macros.

Here are results for C functions:

[plain]Normalized Chebyshev Polynomial 7t Sin(45)=0.70710740923532034 - 140 ticks AE=+0.00000062804877288 BT Normalized Series 7t Sin(45)=0.70710647085826561 - 141 ticks AE=-0.00000031032828185 Normalized Taylor Series 7t Sin(45)=0.70710646957517809 - 141 ticks AE=-0.00000031161136937 Normalized Taylor Series 9t Sin(45)=0.70710678293686713 - 156 ticks AE=+0.00000000175031967 Normalized Chebyshev Polynomial 9t Sin(45)=0.70710678268237914 - 172 ticks AE=+0.00000000149583168 BA Normalized Series 9t Sin(45)=0.70710678421995354 - 187 ticks AE=+0.00000000303340608 Interpolated Sin(45)=0.70710676908493042 - 187 ticks AE=-0.00000001210161704 Chebyshev Polynomial 7t Sin(45)=0.70710740923532023 - 188 ticks AE=+0.00000062804877277 Chebyshev Polynomial 9t Sin(45)=0.70710678268237925 - 219 ticks AE=+0.00000000149583179 CRT Sin(45)=0.70710678118654746 - 312 ticks AE=N/A [/plain]

Notes:

1. Visual Studio 2005 / Microsoft C++ compiler / Release configuration / All Optimizations Disabled
2. All tests for a Double-Precision data type ( 'double' ) / FPU precision was set to 53-bits
3. Every test was executed 4,194,304 ( 2^22 ) times with a Process priority set to Normal
4. These functions without verifications ofinput parameters
5. A '7t'means up to 7th term
6. Interpolated Sine function uses a 1.00 degree step for a LUT of Sine Values generated by a CRT-function 'sin'
7. AE ( Absolute Error )= User-function-Value - CRT-function-Value
8. BT - Best Time / BA - Best Accuracy

Best regards,
Sergey

SergeyKostrov · ‎05-27-2012

Quoting iliyapolak

...The best idea is try to study various approximation methodsandto compare them for speed and accuracy...

Hi Iliya,

Take into account that accuracy of approximation if sine valuesvaries. This is how it looks like for a range from 0 to 90 degrees
for a Normalized Chebyshev Polynomial up to 9th term:

[cpp]... Normalized Chebyshev Polynomial 9t CRT-function 'sin' Sin( 0) = 0.00000000000000000000 = 0.00000000000000000000 - AE = 0.000000000000000000 Sin( 1) = 0.01745240642776235700 = 0.01745240643728351200 - AE = -0.000000000009521155 Sin( 2) = 0.03489949668347069200 = 0.03489949670250096900 - AE = -0.000000000019030277 Sin( 3) = 0.05233595621442847100 = 0.05233595624294383500 - AE = -0.000000000028515364 Sin( 4) = 0.06975647370616093500 = 0.06975647374412530200 - AE = -0.000000000037964368 Sin( 5) = 0.08715574270029286000 = 0.08715574274765816600 - AE = -0.000000000047365306 Sin( 6) = 0.10452846321094722000 = 0.10452846326765347000 - AE = -0.000000000056706251 Sin( 7) = 0.12186934333917229000 = 0.12186934340514748000 - AE = -0.000000000065975184 Sin( 8) = 0.13917310088490520000 = 0.13917310096006544000 - AE = -0.000000000075160239 Sin( 9) = 0.15643446495598134000 = 0.15643446504023087000 - AE = -0.000000000084249524 Sin(10) = 0.17364817757369913000 = 0.17364817766693033000 - AE = -0.000000000093231201 Sin(11) = 0.19080899527445139000 = 0.19080899537654480000 - AE = -0.000000000102093417 Sin(12) = 0.20791169070693508000 = 0.20791169081775934000 - AE = -0.000000000110824266 Sin(13) = 0.22495105422445330000 = 0.22495105434386500000 - AE = -0.000000000119411703 Sin(14) = 0.24192189547182422000 = 0.24192189559966773000 - AE = -0.000000000127843514 Sin(15) = 0.25881904496641395000 = 0.25881904510252074000 - AE = -0.000000000136106793 Sin(16) = 0.27563735567281122000 = 0.27563735581699916000 - AE = -0.000000000144187939 Sin(17) = 0.29237170457066503000 = 0.29237170472273677000 - AE = -0.000000000152071744 Sin(18) = 0.30901699421520712000 = 0.30901699437494740000 - AE = -0.000000000159740277 Sin(19) = 0.32556815428998409000 = 0.32556815445715670000 - AE = -0.000000000167172609 Sin(20) = 0.34202014315132739000 = 0.34202014332566871000 - AE = -0.000000000174341319 Sin(21) = 0.35836794936408850000 = 0.35836794954530027000 - AE = -0.000000000181211768 Sin(22) = 0.37460659322817519000 = 0.37460659341591201000 - AE = -0.000000000187736826 Sin(23) = 0.39073112829542028000 = 0.39073112848927377000 - AE = -0.000000000193853489 Sin(24) = 0.40673664287632461000 = 0.40673664307580021000 - AE = -0.000000000199475603 Sin(25) = 0.42261826153621362000 = 0.42261826174069944000 - AE = -0.000000000204485817 Sin(26) = 0.43837114658035231000 = 0.43837114678907740000 - AE = -0.000000000208725093 Sin(27) = 0.45399049952756826000 = 0.45399049973954675000 - AE = -0.000000000211978490 Sin(28) = 0.46947156257193301000 = 0.46947156278589081000 - AE = -0.000000000213957796 Sin(29) = 0.48480962003205796000 = 0.48480962024633706000 - AE = -0.000000000214279094 Sin(30) = 0.49999999978756438000 = 0.49999999999999994000 - AE = -0.000000000212435569 Sin(31) = 0.51503807470229079000 = 0.51503807491005416000 - AE = -0.000000000207763362 Sin(32) = 0.52991926403380607000 = 0.52991926423320490000 - AE = -0.000000000199398831 Sin(33) = 0.54463903482879883000 = 0.54463903501502708000 - AE = -0.000000000186228255 Sin(34) = 0.55919290330392202000 = 0.55919290347074690000 - AE = -0.000000000166824887 Sin(35) = 0.57357643621167165000 = 0.57357643635104605000 - AE = -0.000000000139374401 Sin(36) = 0.58778525219088762000 = 0.58778525229247314000 - AE = -0.000000000101585518 Sin(37) = 0.60181502310146640000 = 0.60181502315204827000 - AE = -0.000000000050581872 Sin(38) = 0.61566147534288251000 = 0.61566147532565829000 - AE = 0.000000000017224222 Sin(39) = 0.62932039115611915000 = 0.62932039104983739000 - AE = 0.000000000106281761 Sin(40) = 0.64278760990861683000 = 0.64278760968653925000 - AE = 0.000000000222077579 Sin(41) = 0.65605902936184934000 = 0.65605902899050728000 - AE = 0.000000000371342068 Sin(42) = 0.66913060692114923000 = 0.66913060635885824000 - AE = 0.000000000562290992 Sin(43) = 0.68199836086740095000 = 0.68199836006249848000 - AE = 0.000000000804902478 Sin(44) = 0.69465837157023880000 = 0.69465837045899725000 - AE = 0.000000001111241543 Sin(45) = 0.70710678268237914000 = 0.70710678118654746000 - AE = 0.000000001495831681 Sin(46) = 0.71933980231473449000 = 0.71933980033865108000 - AE = 0.000000001976083408 Sin(47) = 0.73135370419195334000 = 0.73135370161917046000 - AE = 0.000000002572782876 Sin(48) = 0.74314482878804677000 = 0.74314482547739424000 - AE = 0.000000003310652530 Sin(49) = 0.75470958444175995000 = 0.75470958022277201000 - AE = 0.000000004218987937 Sin(50) = 0.76604444845135888000 = 0.76604444311897801000 - AE = 0.000000005332380870 Sin(51) = 0.77714596814850889000 = 0.77714596145697090000 - AE = 0.000000006691537990 Sin(52) = 0.78801076195092989000 = 0.78801075360672201000 - AE = 0.000000008344207880 Sin(53) = 0.79863552039351415000 = 0.79863551004729283000 - AE = 0.000000010346221324 Sin(54) = 0.80901700713761193000 = 0.80901699437494745000 - AE = 0.000000012762664481 Sin(55) = 0.81915205995818574000 = 0.81915204428899180000 - AE = 0.000000015669193942 Sin(56) = 0.82903759170854807000 = 0.82903757255504174000 - AE = 0.000000019153506337 Sin(57) = 0.83867059126240617000 = 0.83867056794542405000 - AE = 0.000000023316982123 Sin(58) = 0.84804812443294242000 = 0.84804809615642596000 - AE = 0.000000028276516462 Sin(59) = 0.85716733486866747000 = 0.85716730070211233000 - AE = 0.000000034166555141 Sin(60) = 0.86602544492579625000 = 0.86602540378443860000 - AE = 0.000000041141357654 Sin(61) = 0.87461975651689605000 = 0.87461970713939574000 - AE = 0.000000049377500311 Sin(62) = 0.88294765193557789000 = 0.88294759285892688000 - AE = 0.000000059076651016 Sin(63) = 0.89100659465699772000 = 0.89100652418836779000 - AE = 0.000000070468629931 Sin(64) = 0.89879413011395126000 = 0.89879404629916704000 - AE = 0.000000083814784224 Sin(65) = 0.90630788644835647000 = 0.90630778703664994000 - AE = 0.000000099411706533 Sin(66) = 0.91354557523791691000 = 0.91354545764260087000 - AE = 0.000000117595316040 Sin(67) = 0.92050499219778437000 = 0.92050485345244037000 - AE = 0.000000138745343992 Sin(68) = 0.92718401785703108000 = 0.92718385456678742000 - AE = 0.000000163290243660 Sin(69) = 0.93358061820976956000 = 0.93358042649720174000 - AE = 0.000000191712567821 Sin(70) = 0.93969284534074993000 = 0.93969262078590832000 - AE = 0.000000224554841610 Sin(71) = 0.94551883802529291000 = 0.94551857559931674000 - AE = 0.000000262425976172 Sin(72) = 0.95105682230340971000 = 0.95105651629515353000 - AE = 0.000000306008256179 Sin(73) = 0.95630511202798463000 = 0.95630475596303544000 - AE = 0.000000356064949192 Sin(74) = 0.96126210938689549000 = 0.96126169593831889000 - AE = 0.000000413448576597 Sin(75) = 0.96592630539896773000 = 0.96592582628906831000 - AE = 0.000000479109899421 Sin(76) = 0.97029628038365956000 = 0.97029572627599647000 - AE = 0.000000554107663087 Sin(77) = 0.97437070440439277000 = 0.97437006478523525000 - AE = 0.000000639619157528 Sin(78) = 0.97814833768545340000 = 0.97814760073380558000 - AE = 0.000000736951647817 Sin(79) = 0.98162803100239626000 = 0.98162718344766398000 - AE = 0.000000847554732286 Sin(80) = 0.98480872604589964000 = 0.98480775301220802000 - AE = 0.000000973033691620 Sin(81) = 0.98768945575902911000 = 0.98768834059513777000 - AE = 0.000001115163891341 Sin(82) = 0.99026934464787952000 = 0.99026806874157036000 - AE = 0.000001275906309162 Sin(83) = 0.99254760906557782000 = 0.99254615164132198000 - AE = 0.000001457424255835 Sin(84) = 0.99452355746963761000 = 0.99452189536827329000 - AE = 0.000001662101364319 Sin(85) = 0.99619659065267530000 = 0.99619469809174555000 - AE = 0.000001892560929750 Sin(86) = 0.99756620194650159000 = 0.99756405025982420000 - AE = 0.000002151686677387 Sin(87) = 0.99863197739962362000 = 0.99862953475457383000 - AE = 0.000002442645049783 Sin(88) = 0.99939359592819632000 = 0.99939082701909576000 - AE = 0.000002768909100559 Sin(89) = 0.99985082944048176000 = 0.99984769515639127000 - AE = 0.000003134284090489 Sin(90) = 1.00000354293488590000 = 1.00000000000000000000 - AE = 0.000003542934885914 ... [/cpp]

SergeyKostrov · ‎05-27-2012

Precision Control of FPU also needs to be taken into account and here is example ( for 'double' data type ):

[cpp] 24-bit accuracy CRT Sin( 45.0 ) = 0.7071067966408575200000 Normalized: Chebyshev Polynomial 9t Sin( 45.0 ) = 0.7071067690849304200000 53-bit accuracy CRT Sin( 45.0 ) = 0.7071067811865474600000 Normalized: Chebyshev Polynomial 9t Sin( 45.0 ) = 0.7071067826823791400000 64-bit accuracy CRT Sin( 45.0 ) = 0.7071067811865474600000 Normalized: Chebyshev Polynomial 9t Sin( 45.0 ) = 0.7071067826823791400000 Default accuracy CRT Sin( 45.0 ) = 0.7071067811865474600000 Normalized: Chebyshev Polynomial 9t Sin( 45.0 ) = 0.7071067826823791400000 [/cpp]

As you can see 53-bit, 64-bit and Default results are identical. I always recommend to useDefault settings
for FPU and in that case results on different platforms with different C/C++ compilers will be highly consistent!

Best regards,
Sergey

TimP · ‎05-27-2012

I can't get much information from what you just said. I don't know whether you mean hardware or software default, in the case of x87 FPU.
The hardware default for x87 FPU is 64-bit precision mode, but Windows and Intel compiler practice is to set (by "default") 53-bit precision (which gives same results as SSE2 double, aside from exponent range). Microsoft libraries don't support the Intel options /Qlong-double -pc80, so their results aren't predictable for that case; printf will see only 64-bit doubles, never 80-bit long doubles.
gnu compilers don't set 53-bit precision mode in their run-time initialization, so there you get 64-bit precision mode. linux glibc libraries have full support for long-double 64-bit precision mode. mingw compilers rely on Microsoft libraries, thus no library support for long double.
When printf formats a double, IEEE standards support zeroing out the insignificant decimal digits, but nothing requires that various compilers make this same choice.

Bernard · ‎05-28-2012

I mustsay that you have made a great job thanks for posting that. I have a few question regarding your methodology.
How did you make a running-time measurement? Did you use KeGetTickCount function,or maybe an inline rdtsc?
For interpolated sine what interpolation did you use linear interpolation or lagrange interpolation(quadratic or cubic)?
For normalized Taylor Series did you use polynomial fit or maybe for loop and how you calculated powers of an rad argument (many times i used calls to library function pow , but it is non efficient method ).
Were chebyshev polynomial's coefficients pre-calculated , or did your measuremant take into account calculation of chebyshev coefficients ( many calls to library cos).

Bernard · ‎05-28-2012

Sergey
CRT sin implementation is more than two times slower than your library normalized taylor series sin.I think that extensive range reduction ,input checking and overflow checking are responsible for that.

Bernard · ‎05-28-2012

Take into account that accuracy of approximation if sine valuesvaries. This is how it looks like for a range from 0 to 90 degrees
for a Normalized Chebyshev Polynomial up to 9th term

If I 'am not wrong CRT sin function uses or is based on FDLIBM implementation.I can see that AE rises when an angle is close to 90 degrees.When I waschecking my librarysine function for an accuracy I also saw
such a behaviour when an argument was closer toPi/2 error was largerwhen compared tosmaller angles closer to 0.
I suppose that this is taylor series inducedinaccuracy.

bronxzv · ‎05-28-2012

we can seethat the error vary wildly within a range of values, that's why when comparing algorithmsthestandardmethodsare tocompare themaximum absolute error or theaverage absolute errorover thedomain

Bernard · ‎05-28-2012

I suppose that Sergey did some averaging of error's values.He wrote that he ran his tests 2^22 times, but he did not say how many values of 2^22 tests were averaged.

TimP · ‎05-28-2012

As the convergence properties of Taylor series for sin(x) are poor at Pi/2, this isn't a satisfactory method. Chebyshev or minimax economization corrects this. However, given that range reduction is needed in general, it is likely to be used to reduce the range of series evaluation to |x| < Pi/4.
For quick and dirty usage, a rational approximation may be better if it is to cover the entire range |x| < Pi/2. As tan() lends itself to a rational approximation (with odd polynomial in numerator and even in denominator), the trigonometric formulae can be used to deduce the appropriate form for cos() sin()....
Minimax economization of numerator and denominator polynomials is needed to make this competitive.

Bernard · ‎05-28-2012

It seems that I need to rewrite my library completely because I did not take into account taylor expansion
inacurracies.

Bernard · ‎05-28-2012

TimP
Slightly off-topic question.Is it possible for Intel profiler (the one that is part of Parallel Studio) to profile java class file?I 'am interested in finding hot-spots ,function calls count and cpu-time spent while calculating various pieces of code.

bronxzv · ‎05-28-2012

also don't forget to use the Horner's scheme to evaluate your polynomials as suggested by TimP, see an exampleof sin(x) below(coefficients taken from Abramowitz & Stegun, 4.3.97 p.76), the maximum absolute error is around 2.1e-9 with only 5 terms (based on actual measurements)

[cpp]inline double EvalPoly5(double a0, double a1, double a2, double a3, double a4, double a5, double x) // Returns a5*x^5 + a4*x^4 + a3*x^3 + a2*x^2 + a1*x + a0 { return ((((a5*x+a4)*x+a3)*x+a2)*x+a1)*x+a0; } inline double SinApprox(double x) // 0 <= x <= Pi/2 { const double a0 = 1.0, a2 = -.1666666664, a4 = 0.0083333315, a6 = -.0001984090, a8 = .0000027526, a10 = -.0000000239; return EvalPoly5(a0,a2,a4,a6,a8,a10,x*x)*x; }[/cpp]
as you can see this kind ofcode is straightforward to vectorize and the polynomial evaluation is a perfect fit for FMA