Solved: Optimization of sine function's taylor expansion - Page 7

Bernard · ‎05-24-2012

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

bronxzv · ‎06-08-2012

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

View solution in original post

Bernard · ‎06-15-2012

Here are the results of fastexp() function which is almost the same as fastsin()
1e6 iteration called with loop counter.

start value of fastexp() is: 20642489
end value of fastexp() is: 20642520
delta of fastexp() is: 31 millisec

bronxzv · ‎06-15-2012

soafter all it looks like your compiler unroll it 10x (see mov eax, 1000000 in the ASM) since you sum the results of the polynomial evaluation and your argument series is perfectly regular andknown at compile timeit's actually "optimized" and defeat your measurements => one more time, I'lladvise to use my (now one week old!) example instead: it was carefully designed to avoid such compiler issues

Bernard · ‎06-15-2012

I will give try to your method and use it as a model to defeat compiler unrolling and inlining.But my other methods who are also implemented as poly approximation was not inlined and unrolled.I will post asm code of those metod.Does it not seem strange that only in the case of fastsin compiler used such a optimalization. Bronxzv thanks for your help I learned many interested things by reading your answers.

SergeyKostrov · ‎06-16-2012

Hi everybody,

Iliya, I just completed reading all latest posts from the Page #9 and I'll try to do some testing with your 'FastSin' function
implemented in C. So, in what thread do you want to continue discussion? At the moment you have already two threads...

Best regards,
Sergey

Bernard · ‎06-16-2012

Hi Sergey
Glad to see you back.:)
I prefer to continue discussion in this thread because this thread is already known and most viewed.
I'am uploading very interesting document named "K5 Transcendental Functions".
This pdf describes testing and implementation in hardware (AMD K5 CPU) of transcendental functions.
From this doc you can clearly see that engineers used Mathematica program as a reference model to test their fixed-precision implementation against.Also you can learn that taylor expansion with pre-calculated coefficients was used.

bronxzv · ‎06-17-2012

Also you can learn that taylor expansion with pre-calculated coefficients was used

actually they say the coefficients were generated on-the-fly (page 165) thus the choice of Taylor's series which is easy and fast

AFAIK modern methodsuseprecomputed coefficientswith minimax polynomials computed offline, for a software implementation the memory used by the coefficients is a non-issue anyway

also note that the range reduction algorithm they report isn't easily amenable to a software implementation since it works only with multiple precision arithmetic, see Elementary Functions Algorithms and Implementation 2nd edition (Jean-Michel Muller, Birkhuser) pages 173-191 forother range reduction methods like Cody and Waite or Payne and Hanek

Bernard · ‎06-17-2012

Hi everybody!

Another set of tests this time gamma() function stirling approximation.
There are two implementations.
One of them is based on 1d arrays ,library pow() calls and division inside for-loop
to calculate coefficients and to sum the series expansion.Both functions use 14 terms of stirling expansion.
Second method is faster because array'scoefficients fillingand summation is eliminated
and also pow() library calls are not used.Instead the function relies on simple calculation of
argument x ofconsecutivepowers mulitplied by constant denominator and divided by numerator.
Pre-calculation of coefficients was not possible because of dependency of denominator on argument x.
Accuracy when compared to Mathematica code is between 15-16 decimal places.

Code for C implementation of stirling gamma function
slower version

[bash]inline double gamma(double x){ double sum,result; sum = 0; result = 0; if( x > gamma_huge){ return (x-x)/(x-x); //NaN }else if( x < one){ return (x-x)/(x-x); }else{ double ln,power,pi_sqrt,two_pi,arg; two_pi = 2*Pi; double coef1[] = {1.0,1.0,139.0,571.0,163879.0, 5246819.0,534703531.0,4483131259.0, 432261921612371.0, 6232523202521089.0, 25834629665134204969.0,1579029138854919086429.0, 746590869962651602203151.0,1511513601028097903631961.0 }; double coef2[] = {12.0,288.0,51840.0,2488320.0,209018880.0, 75246796800.0, 902961561600.0,86684309913600.0, 514904800886784000.0, 86504006548979712000.0, 13494625021640835072000.0,9716130015581401251840000.0, 116593560186976815022080000.0,2798245444487443560529920000.0 }; int const size = 14; double temp[size]; double temp2[size]; double temp3[size]; int i,j; long k; k = 0; for(i = 0;i = pow(x,k); } for(j = 0;j = (temp*coef2); } for(int j1 = 0; j1
Results of 1e6 iterations gamma() called with an argument incremented by 0.000001.Microsoft Optimizied compiler

gamma() start value is: 13214329
gamma() end value is: 13214688
execution time of gamma() 1e6 iterations is: 359 millisec
gamma 1.999999997453735200000000

Intel Compiler optimized for speed

gamma() start value is: 24295767
gamma() end value is: 24296016
execution time of gamma() 1e6 iterations is: 249 millisec
gamma 1.999999997453735200000000

Intel compiler settings

/Zi /nologo /W3 /MP /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /GS- /Gy /arch:SSE2 /fp:precise /QxHost /Zc:wchar_t /Zc:forScope /Fp"Release\inline.c.pch" /FAcs /Fa"Release" /Fo"Release" /Fd"Release\vc100.pdb" /Gd /TP

Bernard · ‎06-17-2012

Tested today fastsin() 1e6 iterations and the result was 15 millisec i.e ~33.39 cycles per iterationfor my CPU.Finally the results are close to yours.

results

start val of fastsin() 13214314
end val of fastsin() 13214329
running time of fastsin() release code is: 15 millisec
fastsin() is: 0.891207360591512180000000

bronxzv
"AFAIK modern methodsuseprecomputed coefficientswith minimax polynomials computed offline, for a software implementation the memory used by the coefficients is a non-issue anyway".

When I studied this book "Numerical Recipes in C" I saw that many of their implementations are based on
for-loops coefficient calculation .In order to optimize the speed of execution the authors do not use (for powers) calculations library pow() calls.Few times I also saw when division was used instead of reciprocal multiplication.

Bernard · ‎06-17-2012

Results for the faster version of gamma stirling approximation which uses 10 stirling series coefficients.Was not able to use Horner scheme directly because of an accumulated error.

Code for faster version

[bash]inline double fastgamma(double x){ double sum,result; sum = 0; result = 0; if( x > gamma_huge){ return (x-x)/(x-x); }else if( x < one){ return (x-x)/(x-x); }else{ double ln,power,pi_sqrt,two_pi,arg; two_pi = 2*Pi; double nom1,nom2,nom3,nom4,nom5,nom6,nom7,nom8,nom9,nom10; double denom1,denom2,denom3,denom4,denom5,denom6,denom7,denom8,denom9,denom10; double coef1,coef2,coef3,coef4,coef5,coef6,coef7,coef8,coef9,coef10; double z1,z2,z3,z4,z5,z6,z7,z8,z9,z10; nom1 = 1.0; nom2 = 1.0; nom3 = 139.0; nom4 = 571.0; nom5 = 163879.0; nom6 = 5246819.0; nom7 = 534703531.0; nom8 = 4483131259.0; nom9 = 432261921612371.0; nom10 = 6232523202521089.0; denom1 = 12.0; denom2 = 288.0; denom3 = 51840.0; denom4 = 2488320.0; denom5 = 209018880.0; denom6 = 75246796800.0; denom7 = 902961561600.0; denom8 = 86684309913600.0; denom9 = 514904800886784000.0; denom10 = 86504006548979712000.0; z1 = x; z2 = z1*z1;//x^2 z3 = z2*z1;//x^3 z4 = z3*z1;//x^4 z5 = z4*z1;//x^5 z6 = z5*z1;//x^6 z7 = z6*z1;//x^7 z8 = z7*z1;//x^8 z9 = z8*z1;//x^9 z10 = z9*z1;//x^10 ln = exp(-x); arg = x-0.5; power = pow(x,arg); pi_sqrt = sqrt(two_pi); sum = ln*power*pi_sqrt; coef1 = nom1/(z1*denom1); coef2 = nom2/(z2*denom2); coef3 = nom3/(z3*denom3); coef4 = nom4/(z4*denom4); coef5 = nom5/(z5*denom5); coef6 = nom6/(z6*denom6); coef7 = nom7/(z7*denom7); coef8 = nom8/(z8*denom8); coef9 = nom9/(z9*denom9); coef10 = nom10/(z10*denom10); result = one+coef1+coef2-coef3-coef4+coef5+coef6-coef7-coef8+coef9+coef10; } return sum*result; }[/bash]
Results of 1e6 iterations of fastgamma() called with an argument 2.0 incremented by 0.000001

fastgamma() start value is 13214688
fastgamma() end value is 13214875
execution time of fastgamma() 1e6 iterations is: 187 millisec
fastgamma 2.000000016396600500000000

The same test compiled with Intel compiler

fastgamma() start value is 24296016
fastgamma() end value is 24296157
execution time of fastgamma() 1e6 iterations is: 141 millisec
fastgamma 2.000000016396600500000000

Compiler settings

/Zi /nologo /W3 /MP /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /GS- /Gy /arch:SSE2 /fp:precise /QxHost /Zc:wchar_t /Zc:forScope /Fp"Release\inline.c.pch" /FAcs /Fa"Release" /Fo"Release" /Fd"Release\vc100.pdb" /Gd /TP

SergeyKostrov · ‎06-17-2012

I found a couple of notes related to some concerns about Performance and Security Cookie Checking ( or Buffer Security Checks )...

1.In Post #132 ( current thread )
// ...Still as you can see native code is slower maybe probably because of overhead of security cookie beign checked on the stack...

2. InPost #142 ( current thread )
// ...indeed starting a function with a useless "rep stosd" is not the best thing to do if you want fast code...

3. In a thread 'Java polynomial approx. faster than C code polynomial approx.':

http://software.intel.com/en-us/forums/showthread.php?t=105973&o=a&s=lr

In Post #5:

Zi /nologo /W3 /WX- /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE"
/Gm- /EHsc /GS- /Gy /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope
/Fp"Release\inline.c.pch" /FAcs /Fa"Release" /Fo"Release" /Fd"Release\vc100.pdb" /Gd /analyze-/errorReport:queue

Isee thata compiler optionBuffer Security Checkis disabled: '/GS-'.

Best regards,
Sergey

SergeyKostrov · ‎06-17-2012

Quoting iliyapolak

Results for the faster version of gamma stirling approximation which uses 10 stirling series coefficients...

Hi Iliya,

Wouldn't it better to have a different thread for it? For example, all discussions could be separated as follows:

'FastSin' in a Thread A
'FastGamma' in a Thread B
'FastExp' in a Thread C

etc.

I think in that case acontent of threads in the forumwill be more consistent and useful for everybody.

Best regards,
Sergey

Bernard · ‎06-17-2012

Wouldn't it better to have a different thread for it? For example, all discussions could be separated as follows:

'FastSin' in a Thread A
'FastGamma' in a Thread B
'FastExp' in a Thread C

Sergey!
I think that this is a good idea to separate discussion in different threads.
Now I'am starting with gamma function optimization and speed of execution testing.

Bernard · ‎06-17-2012

It was not when I compiled in Debug mode.Hence those pesky rep stosd instruction at the functions's prologue were present.

SergeyKostrov · ‎06-18-2012

I've made a comment regarding Buffer Security Checks ( compiler option /GS ) in a Post #160. Please take a look.

Bernard · ‎06-18-2012

Hi Sergey!

I created new thread on gamma stirling approximation Can you try to test those functions.

Bernard · ‎06-18-2012

I've made a comment regarding Buffer Security Checks ( compiler option /GS ) in a Post #160. Please take a look.

The code was compiled with release mode with buffer checks disabled.Earlier compilation were in Debug mode.

SergeyKostrov · ‎06-19-2012

Hi Iliya,

I'd like to make a comment regarding your method of error processing, that is:
...
return ( x-x / x-x );
...

For any 'x' expression ( x-x/x-x ) equals to -1.#IND00000000000. Here is a Test-Case:
...
double x = -25.0L;
double y = ( x-x ) / ( x-x );
printf( "%.15f\n", y );
...
Output:
-1.#IND00000000000

CRT trigonometric functions 'sin' or 'cos' always return some value because they do a range reduction. Why did you decide
to use( x-x / x-x )?

Best regards,
Sergey

Bernard · ‎06-19-2012

Hi Sergey I did not implement range reduction in fastsin.The function simply returns NaN value when an input is out of range I know that taylor series for sin gives wrong results when an argument in radian is larger than 3.So I used simple input checking method To calculate only range 0<|x|Btw please take look at fastgamma() various implementation's speed of execution testing you can see that Horner scheme approach can be 4x faster than iterative method.
P.S
In fastsin() function posted here the input checking is wrongly written.It should be

if(x > (Pi/2)){
return (x-x)/(x-x); //return NaN
}else if(-x < (-Pi/2)){
return (-x+x)/(-x+x);//return NaN
}else{
// fastsin() code here .....

SergeyKostrov · ‎06-20-2012

Quoting iliyapolak

...I did not implement range reduction in fastsin.

[SergeyK] Almost the same ( 75% ) for trigonometric API isin my case.

The function simply returns NaN value when an input is out of range...

Hi Iliya,

I don't know if you have some requirements for your API or you're doing some R&D. So, I simply would like to express
my opinion and it is as follows:

Let me mention a well known fact that the sine trigonometric function repeats itself every 2*PI radians. In case of
large angles a subtraction of a multiple of 2*PI could be done and a new value of the angle will be
between 0 and 2*PI. Then,let's think about practical applications ofsome API. What if somebody will need to calculate
the sine of 750 degrees ( PI/6 or 30 degrees )? It is possible to make a note that it is mandatory for a user
to do its own range reduction. A CRT-function 'sin' would be very impractical if it wouldn't have a range reduction and
everybody would be forced to do it. It would create a "perfect" case for inconsistencies of FP-operations
with tirgonometric functions and everybody would complain about it.

Best regards,
Sergey

SergeyKostrov · ‎06-20-2012

Quoting Sergey Kostrov

...let's think about practical applications ofsome API. What if somebody will need to calculate the sine of 750 degrees ( PI/6 or 30 degrees )?..

I decided to apply a more flexible approach when working on implementation of trigonometric API:

- For a set of macros:

input value is in radians
there is no range reduction
always returns some value
speed rating ( ***** )

- For a set of methods ina template class:

input value could be in radians or in degrees
there is no range reduction
always returns some value
speed rating ( **** )

- For a set of non-template based external functions:

input value could be in radians or in degrees
there is a range reduction
always returns some value
speed rating ( *** )

All these sets are implemented in C, absolutely portable, faster then standard CRT-functions ( speed rating *** ),
less accurate then standard CRT-functions.

Best regards,
Sergey

SergeyKostrov · ‎06-20-2012

Quoting Sergey Kostrov

...All these sets are implemented in C, absolutely portable, faster then standard CRT-functions...

And one more comment regarding performance...

Performance of differentSine functions compared to CRT-function 'sin' when tested on different platforms:

- Normalized Taylor Series up to 7th term is the Fastest implementation
for Windows Desktop Platforms

- Normalized Series up to 7th term is the Fastest implementation
for Windows CE Platforms

- Normalized Chebyshev Polynomial up to 9th term is the Fastest implementation
for 16-bit Platforms ( like, MS-DOS )