Solved: Optimization of sine function's taylor expansion - Page 4

Bernard · ‎05-24-2012

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

bronxzv · ‎06-08-2012

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

View solution in original post

Bernard · ‎05-31-2012

Hi Sergey
I did very interesting comparision of three methods for sin(x) calculation.
First method is based entirely on coefficients calculation(nothing has been pre-calculated) this method implements a for-loop ,array access and Math.pow calls to obtain the result of taylor series division ofx^n/n!
There is a very large Absolute Error when the result is compared to library Math.sin function.I suppose that the source of error is due to accumulated roundoff error of for loop division.Time measurement ofthose methode was based on Java System.nano which returns time difference in nanoseconds.The result returned by System.nano was whopping 274415 nanoseconds.I suppose that the reason for this is very inefficient code which has two array's accesses ,for loop overhead ,many calls to external library Math.pow divisions and
assignment inside of loop.
Second method was much improved when compared to the first one.I simply pre-calculated reciprocals of factorials and did polynomial multiplication.The running time was 907 nanoseconds compared to Math.sin call (2264 nanoseconds).Absolute Error was 0.0 when compared to library Math.sin.
Here are the results of three sin(0.45)functions compared for speed of execution and accuracy:

value of sine inefficient implementation of taylor series

0.43724781952696220000000000000000

time of calculation of sine is: 274452 nanoseconds

Math.sin

0.43496553411123023000000000000000

time of calculation of Math.sin java library is : 2264 nanoseconds

poly fit taylor series

0.43496553411123023000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.002282285415731944

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

Bernard · ‎05-31-2012

note that there isn'tthenotion of precision with SSEx/AVXx anymore, all SS/PS instructions are with 24-bit precision and SD/PD instructions with 53-bit precision, MXCSRhas 2rounding mode bits but no fp precision control (only the precision exception flag and mask) unlike x87

For high precision calculation SSE can not beat scalar x87 FPU.

Bernard · ‎05-31-2012

Here is code for those two methods.
Please compare the results in post #65.

[bash]public class SineFunc { //sin(x)~ x-x^3/3!+x^5/5!-x^7/7! /** * @param args */ public static void main(String[] args) { // TODO Auto-generated method stub double val = Double.parseDouble(args[0]); int j = 1;//variable initialization int exp =( 2*j++)+1; double[] array = new double [11]; int k = 1; int denom=1; long start = System.nanoTime(); for (int i = 0;i = Math.pow(val,exp)/denominator; } double sine = val- array[0]+array[1]-array[2]+array[3]-array[4]+array[5]-array[6]+array[7]-array[8]+array[9]-array[10]; long end = System.nanoTime(); System.out.printf("%15s n" , "value of sine inefficient implementation of taylor series "); System.out.printf(" %.32f n", sine); System.out.println("time of calculation of sine is: " +(end-start) + " t nanoseconds" ); System.out.println(); double val2 = 0; long start2 = System.nanoTime(); val2 = Math.sin(val); long end2 = System.nanoTime(); System.out.printf("%15s n", "Math.sin "); System.out.printf("%.32f n", val2); System.out.println("time of calculation of Math.sin java library is : " + (end2-start2) + "t nanoseconds"); long start3 = System.nanoTime(); double coef1,coef2,coef3,coef4,coef5,coef6,coef7,coef8,coef9,coef10,coef11,radian,sin_result,x1; sin_result = 0; coef1 = -0.16666666666666666666666666666667; // reciprocal value of factorial 3! coef2 = 0.00833333333333333333333333333333; // reciprocal value of factorial 5! coef3 = -1.984126984126984126984126984127e-4; // reciprocal value of factorial 7! coef4 = 2.7557319223985890652557319223986e-6; // reciprocal value of factorial 9! coef5 = -2.5052108385441718775052108385442e-8; // reciprocal value of factorial 11! coef6 = 1.6059043836821614599392377170155e-10; // reciprocal value of factorial 13! coef7 = -7.6471637318198164759011319857881e-13; // reciprocal value of factorial 15! coef8 = 2.8114572543455207631989455830103e-15 ; // reciprocal value of factorial17! coef9 = -8.2206352466243297169559812368723e-18; // reciprocal value of factorial 19! coef10 = 1.9572941063391261230847574373505e-20; // reciprocal value of factorial 21! coef11 = -3.8681701706306840377169119315228e-23; // reciprocal value of factorial 23! radian = val; x1 = val*val; //x^2 sin_result = radian+radian*x1*(coef1+x1*(coef2+x1*(coef3+x1*(coef4+x1*(coef5+x1*(coef6+x1*(coef7+x1*(coef8+x1*(coef9+x1*(coef10+x1*(coef11))))))))))); //polynomial fit of Taylor Series long end3 = System.nanoTime(); System.out.printf("%15s n", "poly fit taylor series"); System.out.printf("%.32f n", sin_result); System.out.println("time of calculation polynomial fit of sin is : " + (end3-start3) + " tnanoseconds "); System.out.println("Absolute Error compared to Java Math.sin for inefficient method is : " + (sine-val2)); System.out.println("Absolute Error compared to Java Math.sin for polynomial fit method is :" + (sin_result-val2)); } } [/bash]

SergeyKostrov · ‎05-31-2012

...
AE = 0.43496553411123023000000000000000 - 0.43496553411123023000000000000000= 0.0
...

Did you get ZeroAE for a complete range of input values? For example, from -180 degreesto +180 degrees?

I can see from your resultthat expansion of the Taylor Series to the 21st termor 23rd term could be used in orderto
get a reference table of sine values.

In my case I know a source of errors and it isrelated tocoefficients that currently use.

Bernard · ‎05-31-2012

Now I'am posting my results for ranges 0,P1/2
sin(0.1) , step 0.1

value of sine inefficient implementation of taylor series

0.09986005837615321000000000000000

time of calculation of sine is: 285774 nanoseconds

Math.sin

0.09983341664682815000000000000000

time of calculation of Math.sin java library is : 2717 nanoseconds

poly fit taylor series

0.09983341664682815000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 2.66417293250526E-5

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(0.2)

value of sine inefficient implementation of taylor series

0.19888046700922576000000000000000

time of calculation of sine is: 264488 nanoseconds

Math.sin

0.19866933079506122000000000000000

time of calculation of Math.sin java library is : 2717 nanoseconds

poly fit taylor series

0.19866933079506122000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 2.1113621416454786E-4

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

sin(0.3)

value of sine inefficient implementation of taylor series

0.29622157615613700000000000000000

time of calculation of sine is: 265394 nanoseconds

Math.sin

0.29552020666133955000000000000000

time of calculation of Math.sin java library is : 2265 nanoseconds

poly fit taylor series

0.29552020666133955000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 7.013694947974325E-4

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0

sin(0.4)

value of sine inefficient implementation of taylor series

0.39104373607380610000000000000000

time of calculation of sine is: 326082 nanoseconds

Math.sin

0.38941834230865050000000000000000

time of calculation of Math.sin java library is : 2717 nanoseconds

poly fit taylor series

0.38941834230865050000000000000000

time of calculation polynomial fit of sin is : 453 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.0016253937651555805

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(0.5)

value of sine inefficient implementation of taylor series

0.48250729701915250000000000000000

time of calculation of sine is: 302532 nanoseconds

Math.sin

0.47942553860420300000000000000000

time of calculation of Math.sin java library is : 2265 nanoseconds

poly fit taylor series

0.47942553860420300000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.003081758414949509

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(0.6)

value of sine inefficient implementation of taylor series

0.56977260924909550000000000000000

time of calculation of sine is: 360501 nanoseconds

Math.sin

0.56464247339503540000000000000000

time of calculation of Math.sin java library is : 2265 nanoseconds

poly fit taylor series

0.56464247339503540000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.0051301358540600805

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(0.7)

value of sine inefficient implementation of taylor series

0.65200002302055430000000000000000

time of calculation of sine is: 329252 nanoseconds

Math.sin

0.64421768723769100000000000000000

time of calculation of Math.sin java library is : 2265 nanoseconds

poly fit taylor series

0.64421768723769100000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.007782335782863248

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(0.8)

value of sine inefficient implementation of taylor series

0.72834988859044850000000000000000

time of calculation of sine is: 273546 nanoseconds

Math.sin

0.71735609089952280000000000000000

time of calculation of Math.sin java library is : 4529 nanoseconds

poly fit taylor series

0.71735609089952280000000000000000

time of calculation polynomial fit of sin is : 453 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.010993797690925677

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(0.9)

value of sine inefficient implementation of taylor series

0.79798255621569720000000000000000

time of calculation of sine is: 323817 nanoseconds

Math.sin

0.78332690962748340000000000000000

time of calculation of Math.sin java library is : 3623 nanoseconds

poly fit taylor series

0.78332690962748340000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.014655646588213833

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(1.0)

value of sine inefficient implementation of taylor series

0.86005837615321990000000000000000

time of calculation of sine is: 268111 nanoseconds

Math.sin

0.84147098480789650000000000000000

time of calculation of Math.sin java library is : 3170 nanoseconds

poly fit taylor series

0.84147098480789650000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.01858739134532339

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(1.1)

value of sine inefficient implementation of taylor series

0.91373769865993570000000000000000

time of calculation of sine is: 366842 nanoseconds

Math.sin

0.89120736006143540000000000000000

time of calculation of Math.sin java library is : 3623 nanoseconds

poly fit taylor series

0.89120736006143540000000000000000

time of calculation polynomial fit of sin is : 1359 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.022530338598500288

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(1.2)

value of sine inefficient implementation of taylor series

0.95818087399276370000000000000000

time of calculation of sine is: 420736 nanoseconds

Math.sin

0.93203908596722630000000000000000

time of calculation of Math.sin java library is : 3623 nanoseconds

poly fit taylor series

0.93203908596722630000000000000000

time of calculation polynomial fit of sin is : 905 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.02614178802553746

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(1.3)

value of sine inefficient implementation of taylor series

0.99254825240862400000000000000000

time of calculation of sine is: 267658 nanoseconds

Math.sin

0.96355818541719300000000000000000

time of calculation of Math.sin java library is : 3623 nanoseconds

poly fit taylor series

0.96355818541719300000000000000000

time of calculation polynomial fit of sin is : 453 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.02899006699143103

Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(1.556) very close to Pi/2

value of sine inefficient implementation of taylor series

1.02879965351909490000000000000000

time of calculation of sine is: 379523 nanoseconds

Math.sin

0.99989053635379590000000000000000

time of calculation of Math.sin java library is : 3623 nanoseconds

poly fit taylor series

0.99989053635379600000000000000000

time of calculation polynomial fit of sin is : 906 nanoseconds

Absolute Error compared to Java Math.sin for inefficient method is : 0.02890911716529898

Absolute Error compared to Java Math.sin for polynomial fit method is :1.1102230246251565E-16

As you can see only in last test we have a difference albeit small.Afaik Java Math.sin calls java strictMath class whos algorithms are based on FDLIBM 5.3 library.
FDLIBM library for sin calculation uses polynomial of degree 13.
Please look also at very long execution time for non-efficient method the difference is more than 200th times.I think that array accesses are responsible for this.

SergeyKostrov · ‎05-31-2012

Quoting iliyapolak

...For millions of iterative operations it nearly impossible to predict what the error will be in some arbitrary step out of millions
needed to perform calculation...

TheAccumulated Error could be detected and analyzed. In a very simple test, which I used some time ago, a constant 0.001 is
added 1000 times to a single-precision variable:

[cpp] ... RTuint uiControlWordx87 = 0; RTint i = 0; uiControlWordx87 = CrtControl87( _RTFPU_CW_DEFAULT, _RTFPU_MCW_PC ); RTfloat fSum = 0.0f; for( i = 0; i < 1000; i++ ) { fSum += 0.001f; CrtPrintf( RTU("Step: %4ld Current Value = %.7f True Value = %.7f AE = % .10fn"), ( RTint )i+1, ( RTfloat )fSum, ( RTfloat )( i+1 ) / 1000.0f, ( RTfloat )fSum - ( ( RTfloat )( i+1 ) / 1000.0f ) ); } CrtPrintf( RTU("Result = %.16f AE = %.16fn"), fSum, ( fSum - 1.0f ) ); ... [/cpp]

Results are as follows and you could see that errors started almost from the beginning:

[cpp]... Step: 1 Current Value=0.0010000 True Value=0.001000 AE= 0.0000000000 Step: 2 Current Value=0.0020000 True Value=0.002000 AE= 0.0000000001 Step: 3 Current Value=0.0030000 True Value=0.003000 AE= 0.0000000000 Step: 4 Current Value=0.0040000 True Value=0.004000 AE= 0.0000000002 Step: 5 Current Value=0.0050000 True Value=0.005000 AE= 0.0000000004 ... Step: 995 Current Value=0.9949908 True Value=0.995000 AE= -0.0000092340 Step: 996 Current Value=0.9959908 True Value=0.996000 AE= -0.0000092468 Step: 997 Current Value=0.9969907 True Value=0.997000 AE= -0.0000092597 Step: 998 Current Value=0.9979907 True Value=0.998000 AE= -0.0000092726 Step: 999 Current Value=0.9989907 True Value=0.999000 AE= -0.0000092854 Step: 1000 Current Value=0.9999907 True Value=1.000000 AE= -0.0000092983 Test Result = 0.9999907016754150 AE = -0.0000092983245850 True Result = 1.0000000000000000 ...[/cpp]

In order to see an Exact Error at some step the Binary Representations ( in IEEE 754 format ) for
the Current and True Values have to be used. In another words, a bit-to-bit comparision must be done.

The same test with double-precision variables calculated result as follows:

[cpp]... Step: 1 Current Value=0.0010000000000000 True Value=0.0010000000000000 AE=0.0000000000000000 Step: 2 Current Value=0.0020000000000000 True Value=0.0020000000000000 AE=0.0000000000000000 Step: 3 Current Value=0.0030000000000000 True Value=0.0030000000000000 AE=0.0000000000000000 Step: 4 Current Value=0.0040000000000000 True Value=0.0040000000000000 AE=0.0000000000000000 Step: 5 Current Value=0.0050000000000000 True Value=0.0050000000000000 AE=0.0000000000000000 ... Step: 69 Current Value=0.0690000000000000 True Value=0.0690000000000000 AE=0.0000000000000000 Step: 70 Current Value=0.0700000000000000 True Value=0.0700000000000000 AE=0.0000000000000000 Step: 71 Current Value=0.0710000000000000 True Value=0.0710000000000000 AE=0.0000000000000001

[/cpp]

Bernard · ‎05-31-2012

TheAccumulated Error could be detected and analyzed. In a very simple test, which I used some time ago, a constant 0.001 is
added 1000 times to a single-precision

Sergey
You are right , but only when we take into account a simple calculation be it summation or product of two binary inexact number representated.
Please take a lookat my code which calculates the sine function(the" inefficient method"),analyze thealgorithm which uses for-loop to calculate sine series coefficients.As you can see the AE is very large and this code at the beginning only involved 4 loop cycles and the running-time of this function is 100th times slower than library function.It is not soeasy as with simple summation to spot immediately the source of an error and explain the very slow runing time.
Now think about the some physical process which an algorithm tries to describe with time resolution measured in microsecond and calculation being performed inside maybe nested for-loop with dozens operation per cycle.
How you can effectivelyfind and track the source of anerror calculation and various contribution to an error and calculate the error's propagation and accumulation.I think that this would be very daunting and tedious task.
Btw the last sentence was taken from the book "A First Course in Numerical Analysis" by P.Rabinovitz and A. Ralston

bronxzv · ‎06-01-2012

I think that this would be very daunting and tedious task

not so tedious using interval arithmetic

http://en.wikipedia.org/wiki/Interval_arithmetic

it's easy for example to design an interval arithmetic framework,best in a language that support operators overloading like C++ (so that it's just as readable as classical arithmetic) and to see how the intervals are tighter or wider based on the code arrangement, it works well even for very complex formulas and a lot of iterations

Bernard · ‎06-01-2012

In my numerical analysis book nothing has been said about the interval arithmetics. The authors describe various statistical methods to measure dispersion of an error

SergeyKostrov · ‎06-04-2012

Hi Iliya,

A whileago I posted some results for a Test-Case that multiplies 0.1 by 0.1. I just detected some
inconsistency of results for a Single-Precision ( 'float' )data type. Here is an example:

A test ( 0.1 * 0.1 ) executed from a DLL:

24-bit : [ 0.1 * 0.1 = 0.01000000070780516 ]
53-bit : [ 0.1 * 0.1 = 0.01000000029802323 ]
64-bit : [ 0.1 * 0.1 = 0.01000000029802323 ]
Default : [ 0.1 * 0.1 = 0.01000000029802323 ]

A test ( 0.1 * 0.1 ) executed from an EXE-program:

24-bit : [ 0.1 * 0.1 = 0.01000000070780516 ]
53-bit : [ 0.1 * 0.1 = 0.01000000070780516 ]
64-bit : [ 0.1 * 0.1 = 0.01000000070780516 ]
Default : [ 0.1 * 0.1 = 0.01000000070780516 ]

My questions: Is it related to a '_control87' or 'printf' CRT-functions, or something else?

Best regards,
Sergey

Bernard · ‎06-04-2012

My questions: Is it related to a '_control87' or 'printf' CRT-functions, or something else

Hi Sergey
Hard to answer without therunning codeexample.Does your DLL call printf function?,or you are running it under VS debugger?.If it does call printf put the breakpoint on the printf call(printf is implemented in MSVCRT.DLL)and single-step into a function
check carefully floating-point registers values or memory arguments passed to printf and look for AE in processed values.You can also do the same procedure with your EXE code.
In order to fully track or follow printf call chain kernel debugger must be used to inspect memory-mapped I/O or I/O ports used by video card to read the valuesinto frame buffer and display it,but I think this would be an overkill.
You can also write in assembly or inline assemblysimple procedure involving 0.1 multiplicationand look at fpu registers context directly without formatting and displaying an output by doing this you can narrow down your AE and eliminate contribution(modification of the result)from printf function.

Bernard · ‎06-06-2012

[SergeyK] An example is in a Post #45. Please take a look.

I wrote my time-measurement code almost exactly as You,but the difference calculated by the calls to GetTickCount() function was 0. I read about the time resolution or granularity of this function and on the MSDN site was clearly stated that the resolution is returned in miliseconds.Therefore I switched to using inline RDTSC for more accurate measurement.So howexactly Did you calculate the time delta and acieved nanoseconds accuracy?

SergeyKostrov · ‎06-06-2012

Quoting iliyapolak

[SergeyK] An example is in a Post #45. Please take a look.

I wrote my time-measurement code almost exactly as You,but the difference calculated by the calls to GetTickCount() function was 0. I read about the time resolution or granularity of this function and on the MSDN site was clearly stated that the resolution is returned in miliseconds.

[SergeyK] Yes, that is correct. I'm always measuring a relative performance because many platforms
are supported.I simply need to know what is a performanceratio between different
functions executed on the same platform,or ondifferent platforms.

Iliya, I'll do a test with RDTSC instruction and will post my results as soon as they are ready.

Therefore I switched to using inline RDTSC for more accurate measurement. So howexactly Did you calculate the time delta and acieved nanoseconds accuracy?

I've executedcalls more than 4 million times ( 2^22 ). As I told I didn't try to measure a performance in nanoseconds.
A pseudo code looks like:

...
NUM_OF_TEST = 2^22

StartTime = ::GetTickCount();
for( i=0;i < NUM_OF_TEST; i++)
{
Callto aSine Function
}
EndTime = ::GetTickCount();

Printf( "Delta: %d", ( EndTime - StartTime ) );
...

Best regards,
Sergey

Bernard · ‎06-07-2012

Hi Sergey
Spent a few hours on performance testing of my elementary functions library(all functions implemented as a polynomial fit).Performance of every functions was measured with inline RDTSC.Did not calculate average values of many thousand runs(I will do it later).
For example the same implementation of sin(x) function written in java and tested with Java System.nanoTime() method returned on average of 1000 runs~1000 nanoseconds that's mean ~2226 CPU cycles(My Cpu runs at speed of2.226Ghz).In comparision sin(x) function written in C as a console app and measured with inline RDTSC achieved speed of execution which equals ~862(hex) cycles translated to nanoseconds it will be 964.24. Please bear in mindthat abovecalculations are not accurate and cannot be relied upon to draw conclusions , because it is well known that on multi-core CPU's with varying frequency as a consequence of speedstep technonlogy usage accurate time measurement with help of RDTSC can not be guaranted.
Here is my implementation of inlineRDTSC.
Running time of CRT sin(x) was 36396 cyclesin hex after translation to decimal units and I got whopping ~99796 nanoseconds this is the cost of CRT library calls.

[bash]double fastsin(double x){ double sum = 0; double half_pi,zero_arg; half_pi = Pi/2; zero_arg = Zero; if(x > half_pi){ return (x-x)/(x-x) ; }else if (x < zero_arg){ return (x-x)/(x-x); }else{ int start_lo,start_hi,time_lo,time_hi; _asm{ xor eax,eax xor edx,edx cpuid rdtsc mov start_lo,eax mov start_hi,edx } double coef1,coef2,coef3,coef4,coef5,coef6,coef7,coef8,coef9,coef10,coef11,rad,sqr; coef1 = -0.16666666666666666666666666666667; coef2 = 0.00833333333333333333333333333333; coef3 = -1.984126984126984126984126984127e-4; coef4 = 2.7557319223985890652557319223986e-6; coef5 = -2.5052108385441718775052108385442e-8; coef6 = 1.6059043836821614599392377170155e-10; coef7 = -7.6471637318198164759011319857881e-13; coef8 = 2.8114572543455207631989455830103e-15 ; coef9 = -8.2206352466243297169559812368723e-18; coef10 = 1.9572941063391261230847574373505e-20; coef11 = -3.8681701706306840377169119315228e-23; rad = x; sqr = x*x; //x^2 sum = rad+rad*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11))))))))))); _asm{ xor eax,eax xor edx,edx cpuid rdtsc sub eax,start_lo mov time_lo,eax sub edx,start_hi mov time_hi,edx } printf("%15s n","running time of fastsin() start_lo value :"); printf("%10d n",start_lo); printf("%15s n","running time of fastsin(): start_hi value:"); printf("10d n",start_hi); printf("%15s n"," running time of fastsin() :end_lo value:"); printf("%10d n",time_lo); printf("%15s n","running time of fastsin() :end_hi value:"); printf("%10d n",time_hi); printf(" Sum:%.24f n", sum); } return sum; } [/bash]

and here is console screen from one of the many tests
fastsin(x)
running time of fastsin() start_lo value :
1874052280 // starting value of eax
running time of fastsin(): start_hi value:
10d // starting value of edx
running time of fastsin() :end_lo value:
788 // eax value after subtraction
running time of fastsin() :end_hi value:
0 // edx value after subtraction
Sum:0.434965534111230230000000
fastcos(x)
running time of fastcos() start_lo value :
1876461176
running time of fastcos(): start_hi value:
10d
running time of fastcos() :end_lo value:
LoDelta : 1084
running time of fastcos() :end_hi value:
HiDelta : 0
Sum: 0.852524522059505680000000
fasttan()
running time of fasttan() start_lo value :
1878359514
running time of fasttan(): start_hi value:
10d
running time of fasttan() :end_lo value:
LoDelta : 668
running time of fasttan() :end_hi value:
HiDelta : 0
Sum :0.483055065616578410000000
time measurement of Library sin()
Delta is: 36396 // LIBRARY CRT call
Result :0.434965534111230230000000

SergeyKostrov · ‎06-07-2012

Quoting iliyapolak

...that's mean ~2226 CPU cycles(My Cpu runs at speed of2.226Ghz).In comparision sin(x) function written in C as a console app and measured with inline RDTSC achieved speed of execution which equals ~862(hex) cycles...

862(hex) = 2146(dec) and it consistent with a2226 number. In order toimprove accuracyyou can dothe following:

- Do all tests on the 2nd CPU
- Switch a priority of the test process to Real-time ( thatreally shows better numbers for ~1% to ~2% )

Best regards,
Sergey

bronxzv · ‎06-07-2012

CPUID and RDTSC are very long latency instructions so you are not really measuring the time to compute your polynomial at the moment, the actual time for 12 mul + 11 add mustbe muchlower than what you report, something like 15-30 cycles with 100% cache hit

the best will be to follow Sergey advice: call a lot of times (for ex. 1 milliion times)the polynomial approximation in between your CPUID+RDTSC calls, you must be careful that your compiler don't optimize it by a single call

if you insist calling it only onceyou must remove a delta value from your timings, you can for example comment out all the code between the ASM block to do some calibration, you will see how much "time" it takes to do nothing

Bernard · ‎06-07-2012

- Do all tests on the 2nd CPU
- Switch a priority of the test process to Real-time ( thatreally shows better numbers for ~1% to ~2% )

I ran the tests on second CPU and also set priority to 24(Real-Time) and did not observe any significant change in the terms of speed of execution.Measured performance of CRT sin(x) with 2^20 iteration loop and average value was 1919.0991 cycles.
Disassembled msvcrt.dll inspected sin() function from what i have been able to understand CRT sin() has argument comparision to 0 and if/else statement.If arg >0 sin() jumps to subprocedure which calculates sine with the help of built-in instruction fsin.Afaik fsin has an throughput(cpi) of ~ 40-100 cpi.I suppose that fsin also performes range-reduction and overflow checking.

TimP · ‎06-07-2012

This situation would show fma to best advantage, but I don't know of a current architecture in which fma has a total latency of 2 clock cycles. On a core i7 the total latency of multiply and add in series is more like 11+ cycles.

Bernard · ‎06-07-2012

bronxzv

if you insist calling it only onceyou must remove a delta value from your timings, you can for example comment out all the code between the ASM block to do some calibration, you will see how much "time" it takes to do nothing

Did exactly as youradvise.I performed a few tests with commented out polynomial approximation.The result was oscillating around400cycles.RDTSC and CPUIDtogether have cpi of ~224 cyclesandexecuted two times these instructions completely"destroyed" my measurement.Looked at assembly and saw that compiler used MOVSD,ADDSD,MULSD instructions for polynomial calculation.From the beginning I suspected that a handfull simple XMM instructions cannot be responsible for those hundreds of cycles.
When I tested my assembly code implementation of taylor expansion for sin(x) I ran my code (single term) inside of 1000 000 iterations loop average value was ~4.25cycles per 4 XMM instructions.

bronxzv · ‎06-07-2012

indeed for the totallatency my values are too optimitic,my estimates aremore for the rcp throughput that one will get when processing polynomials in a loop

bronxzv · ‎06-07-2012

I encourage you to read this insightful papertosee howyou can get the best fromRDTSCPor RDTSC:
http://download.intel.com/embedded/software/IA/324264.pdf

please don't forget to tell us at the end how much the optimized version is faster than you first naive implementation