Solved: Optimization of sine function's taylor expansion - Page 8

Bernard · ‎05-24-2012

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

bronxzv · ‎06-08-2012

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

View solution in original post

Bernard · ‎06-20-2012

Performance of differentSine functions compared to CRT-function 'sin' when tested on different platforms:

Interesting, but did you test standard taylor expansion for sine()exactly as I have written for speed and for an accuracy?
My result was 16 millisec for 1e6 iterations.Exactly 35.616 cycles per ns for fastsin() without the input checking.As I remenber your results were higher something like 140 cycles per iteration.

Bernard · ‎06-21-2012

I don't know if you have some requirements for your API or you're doing some R&D. So, I simply would like to express
my opinion and it is as follows

Hi Sergey
I'am doing it for fun and for sake of learning.I have only 8 months of serious programming experience and I'am interested mainly in numerical analysis and computational physics(doing first steps).So I thought that the best way to learn is use such a book like "Handbook of Mathematical Functions" and try todevelope my own implementation to the various formulas.I have already written 52 functions(If you are interested I can send you a Java source file) with varying degree of an accuracy and speed of execution.

>>Let me mention a well known fact that the sine trigonometric function repeats itself every 2*PI

I know this pretty well andas I stated above it is not serious API code whichwill not be distributed amongst the users.Not now but maybe later. :)

Bernard · ‎06-21-2012

I decided to apply a more flexible approach when working on implementation of trigonometric API

This is a very good reference model of how to implement and develope code.

>>less accurate then standard CRT-functions

Why do not You implement your sine functions exactly as fastsin() which has AE of 0 when compared to CRT sin()?

Do you have an experience in Java programming?
Have you considered a development ofan API which is based only on complex numbers and complex functions.

SergeyKostrov · ‎06-21-2012

Quoting iliyapolak

Performance of differentSine functions compared to CRT-function 'sin' when tested on different platforms:
Interesting, but did you test standard taylor expansion for sine()exactly as I have written for speed and for an accuracy?..

It is already in progress and I'll post results soon.

Best regards,
Sergey

SergeyKostrov · ‎06-21-2012

Hi Iliya,

Quoting iliyapolak

...So I thought that the best way to learn is use such a book like "Handbook of Mathematical Functions" and try todevelope
my own implementation to the various formulas. I have already written 52 functions(If you are interested I can send you a Java source file)
with varying degree of an accuracy and speed of execution...

That would be awesome if you share your work. Since it is in Javaa forum"Android Applications on Intel Architecture" or "MKL"could bethe most suitable. Please confirm it with Moderators of Intel Forums.

Thank you!

Best regards,
Sergey

SergeyKostrov · ‎06-21-2012

Quoting iliyapolak

...
>>less accurate then standard CRT-functions

Why do not You implement your sine functions exactly as fastsin() which has AE of 0 when compared to CRT sin()?

[SergeyK] It was decided that for aReal-Time environment a number of terms in serieshas to be reduced to 7and 9.

Do you have an experience in Java programming?

[SergeyK]My experience with Java is insignificant. I would say: 95%is C/C++ and 5%all the rest programminglanguages,
like FoxPro, Java, Gupta, Assembler, Visual Basic, Pascal.

Have you considered a development ofan API which is based only on complex numbers and complex functions.

[SergeyK] No.

Best regards,
Sergey

Bernard · ‎06-21-2012

Thank You Sergey!
I decided to postthe source code of my Functions class.It is still in beta ,but all methods work.You can find there many esoteric functions like :Bessel , Kelvin, Sine Integral ,Cosine Integral ,various Gamma functions and many more.As I stated earlier in my post all of those methods are not written for speed of execution ,moreover they are very slow look herehttp://software.intel.com/en-us/forums/showthread.php?t=106032
You can see how pre-calculation of coefficients and Horner scheme can improve speed of execution(in case of my gamma functionit was ~4x faster).
Please feel free to use my code to port it to C/C++, to improve it and to test it.And do not forget to give your opinion on my work. :)
P.S
You can open the java source file in any editor even notepad ,but I recommned you to use Eclipse Indigo IDE.

Bernard · ‎06-21-2012

Since it is in Javaa forum"Android Applications on Intel Architecture" or "MKL"could bethe most suitable. Please confirm it with Moderators of Intel Forums.

Hi Sergey

Now I'am porting java "Functions" class to C++ static library.Where is it possible I will optimize for speed.I have already ported a few functions.When my job will be done I will post the source file.

>>or "MKL"

I think that this forum is too much math-centric and math-oriented and more suited for professional mathematicians than for programmers.

Bernard · ‎06-21-2012

It is already in progress and I'll post results soon

If you are interested in testing more complicated function than sin() please look here http://software.intel.com/en-us/forums/showthread.php?t=106032

Bernard · ‎06-21-2012

My experience with Java is insignificant. I would say: 95%is C/C++ and 5%all the rest programminglanguages,
like FoxPro, Java, Gupta, Assembler, Visual Basic, Pascal

I think that learning x86 assembly should be obligatory for professional programmers.It should be 70% C/C++ and 30% assembly.

SergeyKostrov · ‎06-22-2012

Quoting Sergey Kostrov

Quoting iliyapolak
Performance of differentSine functions compared to CRT-function 'sin' when tested on different platforms:
Interesting, but did you test standard taylor expansion for sine()exactly as I have written for speed and for an accuracy?..

It is already in progress and I'll post results soon...

Comments:

- Number of Iterations - 2^22
- Time of executionof a single call has to be calculated as follows: ( Completed in XXX ticks )divided by a ( Number of iterations ), for example: 297 / 2^22
- Microsoft C++ compiler
- All optimizations disabled
- Release configuration

...
Application - ScaLibTestApp - WIN32_MSC
Tests: Start
> Test1067 Start <
Sub-Test 1.1
Completed in 297 ticks
CRT Sin( 30.0 ) = 0.4999999999999999400000

Sub-Test 2.1
Completed in 234 ticks
Normalized Series 7t Sin( 30.0 ) = 0.4999999918690232700000

Sub-Test 3.1
Completed in 234 ticks
Normalized Series 9t Sin( 30.0 ) = 0.5000000000202800000000

Sub-Test 4.1
Completed in 266 ticks
Normalized Series 11t Sin( 30.0 ) = 0.5000000000000000000000

Sub-Test 5.1
Completed in 266 ticks
Chebyshev Polynomial 7t Sin( 30.0 ) = 0.4999999476616695500000

Sub-Test 6.1
Completed in 328 ticks
Chebyshev Polynomial 9t Sin( 30.0 ) = 0.4999999997875643800000

Sub-Test 7.1
Completed in 219 ticks
Normalized:
Chebyshev Polynomial 7t Sin( 30.0 ) = 0.4999999476616694400000

Sub-Test 8.1
Completed in 234 ticks
Normalized:
Chebyshev Polynomial 9t Sin( 30.0 ) = 0.4999999997875643800000

Sub-Test 9.1
Completed in 203 ticks
Normalized:
Taylor Series 7t Sin( 30.0 ) = 0.4999999918690232200000

Sub-Test 10.1
Completed in 234 ticks
Normalized:
Taylor Series 9t Sin( 30.0 ) = 0.5000000000202798900000

Sub-Test 11.1
Completed in 532 ticks
Normalized:
Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 - FastSinV1 - Not Optimized

Sub-Test 12.1
Completed in 453 ticks
Normalized:
Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 - FastSinV2 - Optimized

Sub-Test 13.1
Completed in 109 ticks
Normalized:
Taylor Series 11t Sin( 30.0 ) = 0.4999999999999643100000 - C Macro

Sub-Test 14.1
Completed in 266 ticks
1.00 deg step for a LUT of Sine Values:
Interpolated Sin( 30.0 ) = 0.5000000000000000000000

Sub-Test 15.1
Completed in 265 ticks
1.00 deg step for a LUT of Sine Values:
Interpolated Cos( 30.0 ) = 0.8660254037844386000000

> Test1067 End <
Tests: Completed
...

I'll post modified codes of your 'FastSin' functionlater. I need to do my regular a project related stuff.

Best regards,
Sergey

SergeyKostrov · ‎06-22-2012

Quoting iliyapolak

...I would like to ask you how can i rewrite this code in order to gain speed of execution improvment...

Hi Iliya, I would say this is your "sacred" question and here are a couple of versions of your 'FastSin' function.

[cpp]... // Normalized Taylor Series ( up to 23rd term ) - V1 - Not Optimized RTdouble FastSinV1( RTdouble dX ); RTdouble FastSinV1( RTdouble dX ) { RTdouble dSum = 0.0L; if( dX > ( MC_PI / 2.0L ) ) // Checks for a range 0 < x < Pi/2 { return ( dX - dX )/( dX - dX ); // Returns NaN } else if( -dX < ( -MC_PI / 2.0L ) ) { return (-dX + dX )/(-dX + dX ); // Returns NaN } else { RTdouble dCoef1, dCoef2, dCoef3, dCoef4, dCoef5, dCoef6, dCoef7, dCoef8, dCoef9, dCoef10, dCoef11, dRad, dSqr; dCoef1 = -0.16666666666666666666666666666667000; // 1/3! dCoef2 = 0.00833333333333333333333333333333000; // 1/5! dCoef3 = -1.9841269841269841269841269841270e-04; // 1/7! dCoef4 = 2.7557319223985890652557319223986e-06; // 1/9! dCoef5 = -2.5052108385441718775052108385442e-08; // 1/11! dCoef6 = 1.6059043836821614599392377170155e-10; // 1/13! dCoef7 = -7.6471637318198164759011319857881e-13; // 1/15! dCoef8 = 2.8114572543455207631989455830103e-15; // 1/17! dCoef9 = -8.2206352466243297169559812368723e-18; // 1/19! dCoef10 = 1.9572941063391261230847574373505e-20; // 1/21! dCoef11 = -3.8681701706306840377169119315228e-23; // 1/23! dRad = dX; dSqr = dX * dX; // dX^2 dSum = dRad + dRad*dSqr*( dCoef1 + dSqr*( dCoef2 + dSqr*( dCoef3 + dSqr*( dCoef4 + dSqr*( dCoef5 + dSqr*( dCoef6 + dSqr*( dCoef7 + dSqr*( dCoef8 + dSqr*( dCoef9 + dSqr*( dCoef10 + dSqr*( dCoef11 ))))))))))); } return ( RTdouble )dSum; } ... [/cpp]

SergeyKostrov · ‎06-22-2012

[cpp]... // Normalized Taylor Series ( up to 23rd term ) - V2 - Optimized RTdouble dCoef1 = -0.16666666666666666666666666666667000; // 1/3! RTdouble dCoef2 = 0.00833333333333333333333333333333000; // 1/5! RTdouble dCoef3 = -1.9841269841269841269841269841270e-04; // 1/7! RTdouble dCoef4 = 2.7557319223985890652557319223986e-06; // 1/9! RTdouble dCoef5 = -2.5052108385441718775052108385442e-08; // 1/11! RTdouble dCoef6 = 1.6059043836821614599392377170155e-10; // 1/13! RTdouble dCoef7 = -7.6471637318198164759011319857881e-13; // 1/15! RTdouble dCoef8 = 2.8114572543455207631989455830103e-15; // 1/17! RTdouble dCoef9 = -8.2206352466243297169559812368723e-18; // 1/19! RTdouble dCoef10 = 1.9572941063391261230847574373505e-20; // 1/21! RTdouble dCoef11 = -3.8681701706306840377169119315228e-23; // 1/23! RTdouble FastSinV2( RTdouble dX ); RTdouble FastSinV2( RTdouble dX ) { if( dX > ( MC_PI / 2.0L ) ) // Checks for a range 0 < x < Pi/2 { return (dX-dX)/(dX-dX); // Returns NaN } else if( -dX < ( -MC_PI / 2.0L ) ) { return (-dX+dX)/(-dX+dX); // Returns NaN } return ( RTdouble )dX + dX*dX*dX*( dCoef1 + dX*dX*( dCoef2 + dX*dX*( dCoef3 + dX*dX*( dCoef4 + dX*dX*( dCoef5 + dX*dX*( dCoef6 + dX*dX*( dCoef7 + dX*dX*( dCoef8 + dX*dX*( dCoef9 + dX*dX*( dCoef10 + dX*dX*( dCoef11 ))))))))))); } ... [/cpp]

SergeyKostrov · ‎06-22-2012

A version of 'FastSinV2' without checks for a range 0 < x < Pi/2 will be your fastest version
in C language:

[cpp]... // Normalized Taylor Series ( up to 23rd term ) - V3 - Optimized - Without Checks for a range 0 < x < Pi/2 RTdouble dCoef1 = -0.16666666666666666666666666666667000; // 1/3! RTdouble dCoef2 = 0.00833333333333333333333333333333000; // 1/5! RTdouble dCoef3 = -1.9841269841269841269841269841270e-04; // 1/7! RTdouble dCoef4 = 2.7557319223985890652557319223986e-06; // 1/9! RTdouble dCoef5 = -2.5052108385441718775052108385442e-08; // 1/11! RTdouble dCoef6 = 1.6059043836821614599392377170155e-10; // 1/13! RTdouble dCoef7 = -7.6471637318198164759011319857881e-13; // 1/15! RTdouble dCoef8 = 2.8114572543455207631989455830103e-15; // 1/17! RTdouble dCoef9 = -8.2206352466243297169559812368723e-18; // 1/19! RTdouble dCoef10 = 1.9572941063391261230847574373505e-20; // 1/21! RTdouble dCoef11 = -3.8681701706306840377169119315228e-23; // 1/23! RTdouble FastSinV3( RTdouble dX ); RTdouble FastSinV3( RTdouble dX ) { return ( RTdouble )dX + dX*dX*dX*( dCoef1 + dX*dX*( dCoef2 + dX*dX*( dCoef3 + dX*dX*( dCoef4 + dX*dX*( dCoef5 + dX*dX*( dCoef6 + dX*dX*( dCoef7 + dX*dX*( dCoef8 + dX*dX*( dCoef9 + dX*dX*( dCoef10 + dX*dX*( dCoef11 ))))))))))); } ... [/cpp]

SergeyKostrov · ‎06-22-2012

Here is another set of performance numbers:

Application - ScaLibTestApp - WIN32_MSC
Tests: Start
> Test1067 Start <

Sub-Test 1.1
Completed in 297 ticks
CRT Sin( 30.0 ) = 0.4999999999999999400000

Sub-Test 2.1
Completed in 235 ticks
Normalized Series 7t Sin( 30.0 ) = 0.4999999918690232700000

Sub-Test 3.1
Completed in 234 ticks
Normalized Series 9t Sin( 30.0 ) = 0.5000000000202800000000

Sub-Test 4.1
Completed in 266 ticks
Normalized Series 11t Sin( 30.0 ) = 0.5000000000000000000000

Sub-Test 5.1
Completed in 265 ticks
Chebyshev Polynomial 7t Sin( 30.0 ) = 0.4999999476616695500000

Sub-Test 6.1
Completed in 328 ticks
Chebyshev Polynomial 9t Sin( 30.0 ) = 0.4999999997875643800000

Sub-Test 7.1
Completed in 219 ticks
Normalized:
Chebyshev Polynomial 7t Sin( 30.0 ) = 0.4999999476616694400000

Sub-Test 8.1
Completed in 219 ticks
Normalized:
Chebyshev Polynomial 9t Sin( 30.0 ) = 0.4999999997875643800000

Sub-Test 9.1
Completed in 203 ticks
Normalized:
Taylor Series 7t Sin( 30.0 ) = 0.4999999918690232200000

Sub-Test 10.1
Completed in 234 ticks
Normalized:
Taylor Series 9t Sin( 30.0 ) = 0.5000000000202798900000

Sub-Test 11.1
Completed in 516 ticks
Normalized:
Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 - FastSinV1 - Not Optimized

Sub-Test 12.1
Completed in 406 ticks
Normalized:
Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 - FastSinV2 - Optimized

Sub-Test 13.1
Completed in 360 ticks
Normalized:
Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 - FastSinV3 - Optimized / No Checks

Sub-Test 14.1
Completed in 109 ticks
Normalized:
Taylor Series 11t Sin( 30.0 ) = 0.4999999999999643100000 - C Macro

Sub-Test 15.1
Completed in 266 ticks
1.00 deg step for a LUT of Sine Values:
Interpolated Sin( 30.0 ) = 0.5000000000000000000000

Sub-Test 16.1
Completed in 265 ticks
1.00 deg step for a LUT of Sine Values:
Interpolated Cos( 30.0 ) = 0.8660254037844386000000

> Test1067 End <
Tests: Completed

SergeyKostrov · ‎06-22-2012

Here are performance numbers for 'sin'C-Macros:

...
Completed in 62 ticks
Normalized:
Taylor Series 7t Sin( 30.0 ) = 0.4999999918690232200000 - C Macro
...
Completed in 94 ticks
Normalized:
Taylor Series 9t Sin( 30.0 ) = 0.5000000000202798900000 - C Macro
...
Completed in 109 ticks
Normalized:
Taylor Series 11t Sin( 30.0 ) = 0.4999999999999643100000 - C Macro
...

Bernard · ‎06-23-2012

Sub-Test 11.1
Completed in 532 ticks
Normalized:
Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 - FastSinV1 - Not Optimized

Sub-Test 12.1
Completed in 453 ticks
Normalized:
Taylor Series 23t Sin( 30.0 ) = 0.4999999999999999400000 - FastSinV2 - Optimized

Hi Sergey!
Thanks for testing.I hjave a few question regarding the results and method of testing.
Bronxzv in one of his responses told me do not test with a constant value because of possible compiler optimization here isthe quote from his post "also, as previously noted, I don't think that calling it with a constant is the best idea, insteadI'll advise to samplethe domain like in the example I provided the other day".

What does XXX Ticks stand for? Is it nanoseconds or CPU cycles?

Here is my test for fastsin() written exactly as your Unoptimized version:

Tested today fastsin() 1e6 iterations and the result was 15 millisec i.e ~33.39 cycles per iterationfor my CPU.

results

start val of fastsin() 13214314
end val of fastsin() 13214329
running time of fastsin() release code is: 15 millisec
fastsin() is: 0.891207360591512180000000

Why your results are so large almost as my slow implementation of gamma stirling approximation.Please compare here linkhttp://software.intel.com/en-us/forums/showthread.php?t=105474

Bernard · ‎06-23-2012

Seregey!

Did You download Java source file which I uploaded yesterday?
There you have a plenty room for implementing Horner scheme with coeficients pre-calculation.
One of the example optimization of gamma stirling approximation please look here http://software.intel.com/en-us/forums/showpost.php?p=188061

Here is the code and test results

[bash]inline double fastgamma3(double x){ double result,sum,num,denom; result = 0; sum = 0; if(x >= 0.01f && x <= one){ double const coef1 = 6.69569585833067770821885e+6; double const coef2 = 407735.985300921332020398; double const coef3 = 1.29142492667105836457693e+6; double const coef4 = 1.00000000000000000000000000e+00; double const coef5 = 6.69558099277749024219574e+6; double const coef6 = 4.27571696102861619139483e+6; double const coef7 = -2.89391642413453042503323e+6; double const coef8 = 317457.367152592609873458; num = coef1+x*(coef2+x*(coef3));//MiniMaxApproximation calculated by Mathematica 8 denom = coef4+x*(coef5+x*(coef6+x*(coef7+x*(coef8))));//MiniMaxApproximation calculated by Mathematica 8 return num/denom; }else if( x >= one && x <= gamma_huge){ double const coef_1 = 0.08333333333333333333333333; double const coef_2 = 0.00347222222222222222222222; double const coef_3 = -0.00268132716049382716049383; double const coef_4 = -0.000229472093621399176954733; double const coef_5 = 0.000784039221720066627474035; double const coef_6 = 0.0000697281375836585777429399; double const coef_7 = -0.000592166437353693882864836; double const coef_8 = -0.0000517179090826059219337058; double const coef_9 = 0.000839498720672087279993358; double const coef_10 = 0.0000720489541602001055908572; double ln,power,pi_sqrt,two_pi,arg; two_pi = 2*Pi; double invx = 1/x; ln = exp(-x); arg = x-0.5; power = pow(x,arg); pi_sqrt = sqrt(two_pi); sum = ln*power*pi_sqrt; result = one+invx*(coef_1+invx*(coef_2+invx*(coef_3+invx*(coef_4+invx*(coef_5+invx*(coef_6+invx*(coef_7+invx*(coef_8+invx*(coef_9+invx*(coef_10)))))))))); } return sum*result; } Speed of execution for first branch (MiniMaxApproximation) 1e6 iterationsfastgamma3() start value is 25488363fastgamma3() end value is 25488379execution time of fastgamma3() 1e6 iterations is: 16 millisecfastgamma3() is: 1.489191725597434100000000[/bash]

IDZ_A_Intel · ‎06-25-2012

Quoting iliyapolak

...Results of loop-overhead testing
As you can see I cannotmeasure loop overheadmoreover Ialso checked with debugger that empty for-loop is executed.Priority wasset with the help of Process Explorer.Assembly instructions can be counted so overheadis sum of a few x86 instr like"jae[target], add 1 cmp,some_value" andshould be not more than a few cycles per iteration.

...
start value of loop_overhead : 5600529
end value of loop_overhead : 5600529
delta of loop_overhead is : 0
...

Hi Iliya,

This is a follow up on two Posts #117 and #114. I think you need to disable ALL optimizations in order to measure an overhead of
an empty 'for' statement. Intel C++ compiler could easily "remove" it. Since itdidn't andyour result was 0 something else
was wrong. I'll take a look at it some time this week.

Best regards,
Sergey

SergeyKostrov · ‎06-25-2012

Quoting iliyapolak

...What does XXX Ticks stand for? Is it nanoseconds or CPU cycles?

[SergeyK] XXX means anumber of ticks. A Win32 API function 'GetTickCount' returns a value in
milliseconds, or 'ticks'. In my tests every 'sine' function is called 2^22 times and a time
for one call is calculated by dividing a 'TicksNumber' by '2^22'. For example:

78 ms (ticks ) / 2^22 = 0.000018596649169921875 ms ~= 0.0000186 ms

...

SergeyKostrov · ‎06-25-2012

Quoting iliyapolak

...Why your results are so large almost as my slow implementation...

[SergeyK] Because your computer is faster.

Iliya, you're trying to compare uncomparable values. Let's assume that for the same function:

- Value VA in mswas obtained on a computer A with CPU frequency FA
-Value VB in mswas obtained on a computer B with CPU frequency FB
- If value VB is less then value VA the computer B is faster then computer A

All performance numbers I usually post are for reference only.

You need to compare your value(s) against another value(s) ( a "reference" ) obtained on the same
computer with a similar CRT-fucntion.

Once again, I don't measure absolute performance of some function in milliseconds, nanoseconds or
clock cycles. I always measure a relative performance. Let's say I've set a target to outperform some
"reference function" from CRT library. If my function is faster a target is achieved. I know that accuracy is
affected since 7, or 9, or 11 terms are used but that is another issue.

Best regards,
Sergey