Solved: Optimization of sine function's Taylor expansion - Page 3

Bernard · ‎05-24-2012

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

bronxzv · ‎06-08-2012

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

View solution in original post

Bernard · ‎05-28-2012

I took different approach whith my sin function definitely not fast one.
I used for loop with compound statement calculating factorials and Math.pow library calls to calcualte powers of an argument.The results were summed in array.As i said very slow compared to yours method.

For example here is code for Gamma Stirling approximation (formula and coefficientstaken from Abramovitz and Stegun) written in Java.

[bash]public double GammaStirlingApprox(double x){ double sum = 0; double result = 0; if(x > Double.MAX_VALUE){ return Double.NaN; }else{ double ln; double pow; double pi_sqrt; double two_pi = 2* Math.PI; double []coef = {12.0,288.0,51840.0,2488320.0,209018880.0,75246796800.0,902961561600.0,86684309913600.0 }; double[]coef2 = {1.0,1.0,139.0,571.0,163879.0,5246819.0,534703531.0,4483131259.0}; double[]array = new double[coef.length]; double[]array2 = new double[coef.length]; int i; int j; int k = 0; for(i = 0;i = Math.pow(x, k); } for(j = 0;j = array*coef; array2 = coef2/array2; } ln = Math.exp(-x); pow = Math.pow(x, x-0.5); pi_sqrt = Math.sqrt(two_pi); sum = ln*pow*pi_sqrt; result = 1+array2[0]+array2[1]-array2[2]-array2[3]+array2[4]+array2[5]-array2[6]-array2[7]; } return sum*result; } [/bash]

SergeyKostrov · ‎05-28-2012

Hi everybody, This is my consolidated response on posts fromIliyapolak:

>>...How did you make a running-time measurement?

In codes it looks like:

[cpp] ... // Sub-Test 1.1 - CRT { ///* CrtPrintf( RTU("Sub-Test 1.1n") ); g_uiTicksStart = ::GetTickCount(); for( RTint t = 0; t < NUMBER_OF_TESTS; t++ ) { dResult = CrtSin( dA ); } CrtPrintf( RTU("Completed in %ld ticksn"), ( RTint )( ::GetTickCount() - g_uiTicksStart ) ); CrtPrintf( RTU("tCRT Sin( % 6.1f ) = % 24.22fn"), dA, dResult ); //*/ } ... [/cpp]

>>...Did you use KeGetTickCount function, or maybe an inline rdtsc?

Please see above. I use a Win32 API function 'GetTickCount'.

>>...For interpolated sine what interpolation did you use Linear interpolation...

Yes.

>>Lagrange interpolation?..

No, unfortunately, it is on hold for a long time.

>>...For Normalized Taylor Series did you use polynomial fit or maybe for loop and how you calculated powers...

In codes it looks like:

[cpp]... // Normalized Series ( up to 7th term ) // Sin(x) = x*( a - x^2*( b - x^2*( c - x^2*d ))) // a = 1.570796329000 // b = 0.645964096000 // c = 0.079692625990 // d = 0.004681754102 ... RTdouble CalcSinNS7( RTdouble dX ) { ... return ( RTdouble )( dX * ( 1.570796329L - dX * dX * ( 0.645964096000L - dX * dX * ( 0.0796926259900L - dX * dX * ( 0.004681754102L ) ) ) ) * iSign ); } ... [/cpp]

>>...Were Chebyshev polynomial's coefficients pre-calculated...

Yes.

>>...did your measuremant take into account calculation of Chebyshev coefficients...

No, because all Chebyshev coefficients are pre-calculated and saved in constants.

>>...( many calls to library cos)...

No, a different method ( Telescoping a Series)was used.

>>...CRT sin implementation is more than two times slower than your library Normalized Taylor series sin
>>I think that extensive range reduction, input checking and overflow checking are responsible for that...

Yes, that's correct. A special set of macrosto calculate sine, cosine, etc, even more faster because all
overheads ( like make a call, createthestack, clean ups, return )are minimazed.

>>...I suppose that Sergey did some averaging of error's values. He wrote that he ran his tests 2^22 times,
>>but he did not say how many values of 2^22 tests were averaged...

No. For example, for all calls like'CalcNTS9(45)' an Absolute Error is the same.

>>...I used for loop with compound statement calculating factorials and Math.pow library calls to calculate
>>powers of an argument. The results were summed in array. AsI said very slow compared to yours method...

I agree that it affects performance of your calculations. As another example this is how a macro declaration
looks like:

[cpp]... RTdouble g_dNF3 = 1.0L / 6.0L; // 3! = 6 RTdouble g_dNF5 = 1.0L / 120.0L; // 5! = 120 RTdouble g_dNF7 = 1.0L / 5040.0L; // 7! = 5040 RTdouble g_dNF9 = 1.0L / 362880.0L; // 9! = 362880 ... #define CrtSinNTS9( A ) ( A - A * A * A * ( g_dNF3 - A * A * ( g_dNF5 - A * A * ( g_dNF7 - g_dNF9 * A * A ) ) ) ) ... [/cpp]

SergeyKostrov · ‎05-28-2012

Hi everybody, This is my consolidated response on a post fromBronxzy:

>>...We can see that the error vary wildly within a range of values, that's why when comparing
>>algorithms the standard methods are to compare the maximum absolute error or the average
>>absolute error over the domain...

It could be a set of different statistical methods like:

- Min error
- Max error
- Average error
- Standard Deviation of all errors

and it depends on how important that analysis for some project. It is no question that formission-critical
applications it must be done.

SergeyKostrov · ‎05-28-2012

Hi everybody, This is my consolidated response onposts fromTimP:

>>...I can't get much information from what you just said. I don't know whether you mean hardware or
>>software default, in the case of x87 FPU...

I control precision with CRT-function '_control87' and this is because '_control87' changes the control words
for the x87 and the SSE2 units. In codes it looks like:

[cpp] ... // uiControlWordx87 = CrtControl87( _RTFPU_PC_24, _RTFPU_MCW_PC ); uiControlWordx87 = CrtControl87( _RTFPU_PC_53, _RTFPU_MCW_PC ); // uiControlWordx87 = CrtControl87( _RTFPU_PC_64, _RTFPU_MCW_PC ); // uiControlWordx87 = CrtControl87( _RTFPU_CW_DEFAULT, _RTFPU_MCW_PC ); ... [/cpp]

>>...MinGW compilers rely on Microsoft libraries...

Please take a look at a case with some inconsistency... I missed it about 1.5 year ago during intensive testing and
detected it recently.

I have a couple of tangent methods in a template class and here are results when a 'float' data type
is used:

[cpp]MSC ( VS2005 SP1 / Windows CE / MSVCR80.DLL for ARMV4 CPU used ) - Microsoft C++ compiler CRT CrtTan(45) = 1.0000000000000000000000 Normalized TS msF.TanNTS9(45) = 1.0000000000000000000000 Normalized TS msF.TanNTS11(45) = 1.0000000000000000000000 MSC ( VS2005 SP1 / Windows XP / MSVCR80.DLL used ) - Microsoft C++ compiler CRT CrtTan(45) = 1.0000000000000000000000 Normalized TS msF.TanNTS9(45) = 0.9999999403953552200000 <- Incorrect! Normalized TS msF.TanNTS11(45) = 1.0000000000000000000000 MGW ( v3.4.2 / Windows XP / Uses MSVCRT.DLL ) - MinGW C++ compiler CRT CrtTan(45) = 1.0000000000000000000000 Normalized TS msF.TanNTS9(45) = 1.0000000000000000000000 Normalized TS msF.TanNTS11(45) = 1.0000000000000000000000 BCC ( v5.5.1 / Windows XP / Doesn't Use Any Microsoft DLLs ) - Borland C++ compiler CRT CrtTan(45) = 1.0000000000000000000000 Normalized TS msF.TanNTS9(45) = 1.0000000000000000000000 Normalized TS msF.TanNTS11(45) = 1.0000000000000000000000 TCC ( v3.0.0 / MS-DOS / Doesn't Use Any Microsoft DLLs ) - Turbo C++ compiler CRT CrtTan(45) = 1.0000000000000000000000 Normalized TS msF.TanNTS9(45) = 1.0000000000000000000000 Normalized TS msF.TanNTS11(45) = 1.0000000000000000000000 [/cpp]

Summary is as follows:

A Win32 console application that uses MSVCR80.DLL calculated Tan(45) using Normalized Taylor Series ( up to 9th term ) method:

msF.TanNTS9(45) = 0.9999999403953552200000

Compiled by Microsoft C++ compiler on Windows XP Desktop 32-bit platform.

A Win32 console application that uses MSVCRT.DLL calculated Tan(45) using Normalized Taylor Series ( up to 9th term ) method:

msF.TanNTS9(45) = 1.0000000000000000000000

Compiled byMinGW C++ compiler on Windows XP Desktop 32-bit platform.

At the moment I can't explain why the result of Tan(45)is not equal to'1.0'. Isuppose this is due to some differences of
theCRT-function '_control87' in these two DLLs, but I could be wrong.

Bernard · ‎05-28-2012

Sergey!
Does your library contain alsovarious implementations ofspecial functions too?
For example here is my code which calculates Bessel function of first kind J0(x).Plynomial approximation is taken from Abramovitz and Stegun p. 369 ,9.4.1
First part of code contains for loops with arrays multiplications , but second part of code is based on polynomial multiplications.I would like to ask you how to accurate measure running-time of these two parts of code?

[bash]public double BesselJ0(double x){ double sum = 0; double result = 0; if(x <= -3 || x <= 3){ double third = x/3; int k = 0; int exp = 0; int i,j; double[]coef = {-2.2499997, 1.2656208, -0.3163866, 0.0444479, -0.0039444, 0.0002100 }; double[]array = new double[coef.length]; double small = 4.5*1e-8; double b = 0; for(i = 0;i = b; array = array*coef; System.out.println("exp t" + exp +"array t" + array + " "); } for(j = 0;j; } sum = sum+small; return 1+sum; }else if(x > 3.0 || x < Double.MAX_VALUE){ double sqr = Math.sqrt(x); sqr = 1/sqr; double f = 0; double cos = 0; double theta = 0; double three = 3.0/x; double term = 0.79788456; double term2 = 0.78539816; double[]coef2 = {-0.00000077, -0.00552740, - 0.00009512, 0.00137237, -0.00072805, 0.00014476 }; double[]coef3 = {-0.04166397, -0.00003954, 0.00262573, -0.00054125, -0.00029333, 0.00013558 }; double c = 0; double d = 0; double three2 = 3.0/x; double small = 1.5*1e-8; c = three*coef2[0]+three*(coef2[1]+three*(coef2[2]+three*(coef2[3]+three*(coef2[4]+three*(coef2[5]))))); d = three2*coef3[0]+three2*(coef3[1]+three2*(coef3[2]+three2*(coef3[3]+three2*(coef3[4]+three2*(coef3[5]))))); System.out.println("c t" + c + "d t" + d); f = term - c; theta = x-term2-d; cos = Math.cos(theta); result = sqr*f*cos; System.out.println("sqr t" + sqr +"f t" + f + "cos t" + cos); } return result; } [/bash]

TimP · ‎05-29-2012

MSVC and ICL compiled Win32 applications include the equivalent of a _control87 function call to set 53-bit precision mode. You can reset with your own _control87 function. Mingw compiled win32 (but not X64) applications would be expected to start up in 64-bit precision mode. Any result formatted by printf will have been rounded to double, so digits beyond 17 would be meaningless if they had not been set to zero.
Your first quoted result looks as if it had been calculated in float (24-bit precision). There is a _control87 setting which does that.

SergeyKostrov · ‎05-29-2012

Quoting iliyapolak

...Does your library contain alsovarious implementations ofspecial functions too?...

No

...I would like to ask you how to accurate measure running-time of these two parts of code?

You could consider two versions and here are pseudo codes:

Version 1 - Uses Win32 API function GetTickCount

...
Declare Start, End, Diff1, Diff2vatiables ( int32 )

Function()
{
Start = GetTickCount()
//Codes for Part1...
End = GetTickCount()
Diff1 = End - Start

Start = GetTickCount()
//Codes for Part2...
End = GetTickCount()
Diff2 = End - Start

return
}
...

Version 2 - Uses RDTSC instruction ( more accurate measurements )

...
Declare Start, End, Diff1, Diff2vatiables ( int64 )

Function()
{
Start =RDTSC
//Codes for Part1...
End =RDTSC
Diff1 = End - Start

Start =RDTSC
//Codes for Part2...
End =RDTSC
Diff2 = End - Start

return
}
...

Bernard · ‎05-29-2012

Sergey!
Can not use directlythosefunctions mentioned by you because my code is written in java ,but the time difference calculation can be performed exactly as you wrote.Afaik java system library contains high -precision time measurement method which I think is directly translated by the jvm.dll to rdtsc,but overhead of such a translation could besignificant only if jvm jit-compiler will not cache the translation(did not recognize it as a hot-spot code).

SergeyKostrov · ‎05-29-2012

Quoting iliyapolak

...such a behaviour when an argument was closer toPi/2 error was largerwhen compared tosmaller angles closer to 0.
I suppose that this is taylor series inducedinaccuracy...

I would alsomention these two problems:

Limitations of IEEE 754 standard
These Series and Polynomials are onlyapproximating a curve of some function

Consider an equation for Normalized Series up to 9th term that calculatessine:

Sin(x) ~= x*( a - x^2*( b - x^2*( c - x^2*( d - x^2*e )))) ( S )

All coefficients are related to PI, that is, a = PI/2, b = ((PI/2)^3)/3!, and so on.

In the equationStwo sets of coefficients could be used. For example:

Set of Coefficients 1 Set of Coefficients 2
less accurate ( SC1 ) very accurate ( SC2 )

a = 1.5707963290000a = 1.57079632679489661923132169163980000
b = 0.6459640960000 b = 0.64596409750624625365575656389795000
c = 0.0796926259900 c = 0.07969262624616704512050554949047800
d = 0.0046817541020 d = 0.00468175413531868810068546393395340
e = 0.0001604411842 e = 0.00016044118478735982187266087016347

Then, for a double-precision data type and 53-bit FPU precisionsine of 30 degrees is:

Normalized Series up to 9th term with SC1 Sin(30) = 0.5000000008100623500000 ( V1 )
Normalized Series up to 9th term with SC2 Sin(30) = 0.5000000000202800000000 ( V2 )

A true value forsine of 30 degrees is 0.5:

AE1 = V1 - 0.5 = 0.00000000081006235
AE2 = V2 - 0.5 = 0.00000000002028000

You see that Absolute Error AE2 is less than AE1 in ~40 times:

AE1 / AE2 = 0.00000000081006235 / 0.00000000002028000 = 39.94

Normalized Series up to 11th term with a Set of Coefficients ( SC3 )

a = 1.57079632679489661923132169163980000
b = 0.64596409750624625365575656389795000
c = 0.07969262624616704512050554949047800
d = 0.00468175413531868810068546393395340
e = 0.00016044118478735982187266087016347
f = 0.00000359253000000000000000000000000

does a magic:

Sin(30) = 0.5000000000000000000000

and I verified it with all C/C++ compilers that I use.

Best regards,
Sergey

SergeyKostrov · ‎05-29-2012

Here is a screenshot ( results forMicrosoft C++ compiler ):

Bernard · ‎05-29-2012

We have to deal with curve approximation inacurracies and finite precision of those numbers that repressent and approximate the curve. As gou showed in your last post when more terms are added to normalized series the accuracy is closer to the real value but sometimes there are left the traces of rounding and inexact representation of various numbdrs. When writing my library i did some analysis of methods taken from Avramovitz and Stegun and got mixed results because limit of covergence was different when formula was evaluated programaticaly and compared to exact mathematical statementand many times i was stuck with widely varying divergence when my implementation(as exactand faithful as possible when compared to book's formulas)was evaluated software simplyfailed to produce the exact value .I blamed more java compiler implementation of IEEE 754 standard andhardware inacurate representation of real numbersthan formula accuracy(which is accurate mathematicaly as book stated). For example from mathematical point of view taylor series of sine converges almost completely withhigher precision if more terms are added, but try to approximate programmaticalywith x>3 and it will be a failure partly because of catastrophic cancellation induced by alternating sign series.

bronxzv · ‎05-30-2012

It could be a set of different statistical methods like:

- Min error
- Max error
- Average error
- Standard Deviation of all errors

Sure but I don't really seea realworld use for the min error, for example avery roughLUT based solution may have severalexact results (up to LSB) over the domain (i.e. min error = 0.0) and a good minimax polynomial no single exact result (i.e. min error > 0.0). I'll be interested to see your error analysis here: http://software.intel.com/en-us/forums/showpost.php?p=185933

displayed as max AE and/or average AE over [0.0;90.0] deg if it's a simple option of your software

SergeyKostrov · ‎05-30-2012

Quoting iliyapolak

...When writing my library i did some analysis of methods taken from Avramovitz and Stegun and got mixed results
formula when implemented in software failed to produce the exact value. I blamed more java compiler implementation of Ieee 754 standard
than formula accuracy...

I recommend to run a set of very simple tests in order to verify quality of implementation of IEEE 754 Standard for Java, like:

[cpp] ... Verification: ( 0.1f*0.1f ) for 'float': 24-bit : [ 0.1 * 0.1 = 0.01000000070780516 ] Default : [ 0.1 * 0.1 = 0.01000000029802323 ] Verification: ( 0.1L*0.1L ) for 'double': 24-bit : [ 0.1 * 0.1 = 0.00999999977648258 ] 53-bit : [ 0.1 * 0.1 = 0.01000000000000000 ] 64-bit : [ 0.1 * 0.1 = 0.01000000000000000 ] Default : [ 0.1 * 0.1 = 0.01000000000000000 ] ... [/cpp]
Can you reproduce these numbers?

SergeyKostrov · ‎05-30-2012

Quoting iliyapolak

...implementation of Ieee 754 standard...

Discussions about consistency of floating-point results will never stop. Please take a look at a very good article regarding
that subject:

Consistency of Floating-Point Results using the Intel Compiler
by Dr. Martyn J. Corden & David Kreitzer

Best regards,
Sergey

PS: Ihave very consistent results across different platforms and C/C++ compilers.

Bernard · ‎05-30-2012

Can you reproduce these numbers?

Here are my results for multiplication of x = 0.1f and y = 0.1f both of them declared as a float values:

x

0.10000000149011612000000000000000

y

0.10000000149011612000000000000000

Here is the multiplication result:

product

0.01000000070780515700000000000000

Here are the results for multiplication of java double values x = 0.1d and y = 0.1d

x1

0.10000000000000000000000000000000

y1

0.10000000000000000000000000000000

product1

0.01000000000000000200000000000000

Sergey how can I use 53-bit precision if java only supports standart IEEE 754 float and double values

Read please this pdf written by W Kahan about awful performance of floating point java implementation
www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf

TimP · ‎05-30-2012

IEEE 64-bit double has 53-bit precision by definition. Decimal conversion functions can conform to IEEE754 while zeroing out digits beyond the significant 17; maybe yours is rounding earlier so as to hide the differences. Also, the strictest standards of IEEE754 don't apply to values < 0.1.

Bernard · ‎05-30-2012

In order to fully analyze the results of floating point calculations which often can require thousands or millions operationsone need to know how the calculations were affected by round off error its propagation and accumulations. For millions of iterative operations it nearly impossible to predict what the error will be in some arbitrary step out of millions needed to perform calculation.

SergeyKostrov · ‎05-30-2012

Hi Iliya,

I see thatyour result( product of 0.1*0.1 / single-precision)isconsistent with myresultinPost #56:

[cpp] Java : 0.01000000070780515700000000000000 Java (rounded): 0.01000000070780516 C/C++ : 0.01000000070780516 [/cpp]
and the same applies to the double-precision data type.

I could assume that Java VM initialized theFPU to 24-bit precision in the 1st case, and to53-bit ( default ) precision
in the 2nd case.

So, in both casesresults are consistent.

My concern is as follows: in the 1st case I would expect initialization of theFPU toDefault settings instead of 24-bit.

Do you need another test ( a matrix multiplication )to verify Java's floating point support?

>>...How can I use 53-bit precision if java only supports standart IEEE 754 float and double values...

In C/C++ applications a developer could control a precision with functions '_control87' or
'_controlfp'. Microsoft added several extensions, like '_control87_2' or '_controlfp_s', but
these two functions areMicrosoft specific.

In Java API you should look for some method similar to '_control87' function. If Java
doesn't have a method that controlssettings of theFPUI would consider a Java Native Interface ( JNI ) DLL
that implements a wrapper around '_control87' function.

Note: In case of using inline assemblerwith SSE2 instructions in a JNI DLL aprecision could
be controlled with a set of macros, like '_MM_SET_ROUNDING_MODE', or
intrinsic functions, like '_mm_setcsr'. Take a look at attached'xmmintrin.h' header file.
Remember, that '_control87' function controls both units (x87 andSSE2 ).

Best regards,
Sergey

SergeyKostrov · ‎05-30-2012

Here are Reference Data for sine function to verify Java's floating point support.

Two files are enclosed and this is a short example:

[cpp]Microsoft C++ compiler Normalized Series up to 11 term sine vs. CRT sine 53-bit precision NS11t CRT Absolute Error ... Sin( 29.0 ) = 0.4848096202463386 = 0.4848096202463371 - AE = 0.0000000000000016 Sin( 30.0 ) = 0.5000000000000000 = 0.4999999999999999 - AE = 0.0000000000000001 Sin( 31.0 ) = 0.5150380749100507 = 0.5150380749100542 - AE = -0.0000000000000034 ... Borland C++ compiler Normalized Series up to 11 term sine vs. CRT sine 53-bit precision NS11t CRT Absolute Error ... Sin( 29.0 ) = 0.4848096202463387 = 0.4848096202463371 - AE = 0.0000000000000016 Sin( 30.0 ) = 0.5000000000000000 = 0.4999999999999999 - AE = 0.0000000000000001 Sin( 31.0 ) = 0.5150380749100507 = 0.5150380749100542 - AE = -0.0000000000000034 ... [/cpp]

Bernard · ‎05-30-2012

Sergey!
I think that FPU unit's initializations are performed by jvm automatically(maybe hardcoded and done at jvm.dll startup or when compiling fp code.I need to disassemble jvm.dll and look for instructions which accessMXCSR register)and the programmer is not given the option to manipulate the settings as it is done by Microsoft _control87 function.
Soon I will switch from java to c with inline assembly programming for my math library.

-I would consider a Java Native Interface ( JNI ) DLL-
The better options is to write in c language.

-Do you need another test ( a matrix multiplication )to verify Java's floating point support-
Yes

-My concern is as follows: in the 1st case I would expect initialization of theFPU toDefault settings instead of 24-bit-
Because the results obtained from C/C++ and java calculations are the same when rounded I suppose that that CPU (both of us have Intel CPU) rounding algorithm is responsible for this.

bronxzv · ‎05-31-2012

SSE2 instructions in a JNI DLL a precision could
be controlled

note that there isn'tthenotion of precision with SSEx/AVXx anymore, all SS/PS instructions are with 24-bit precision and SD/PD instructions with 53-bit precision, MXCSRhas 2rounding mode bits but no fp precision control (only the precision exception flag and mask) unlike x87

Optimization of sine function's taylor expansion