- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRCcalls 1e6 times fastsin() the result in millisecond is 63
if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did very interesting comparision of three methods for sin(x) calculation.
First method is based entirely on coefficients calculation(nothing has been pre-calculated) this method implements a for-loop ,array access and Math.pow calls to obtain the result of taylor series division ofx^n/n!
There is a very large Absolute Error when the result is compared to library Math.sin function.I suppose that the source of error is due to accumulated roundoff error of for loop division.Time measurement ofthose methode was based on Java System.nano which returns time difference in nanoseconds.The result returned by System.nano was whopping 274415 nanoseconds.I suppose that the reason for this is very inefficient code which has two array's accesses ,for loop overhead ,many calls to external library Math.pow divisions and
assignment inside of loop.
Second method was much improved when compared to the first one.I simply pre-calculated reciprocals of factorials and did polynomial multiplication.The running time was 907 nanoseconds compared to Math.sin call (2264 nanoseconds).Absolute Error was 0.0 when compared to library Math.sin.
Here are the results of three sin(0.45)functions compared for speed of execution and accuracy:
value of sine inefficient implementation of taylor series
0.43724781952696220000000000000000
time of calculation of sine is: 274452 nanoseconds
Math.sin
0.43496553411123023000000000000000
time of calculation of Math.sin java library is : 2264 nanoseconds
poly fit taylor series
0.43496553411123023000000000000000
time of calculation polynomial fit of sin is : 906 nanoseconds
Absolute Error compared to Java Math.sin for inefficient method is : 0.002282285415731944
Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
note that there isn'tthenotion of precision with SSEx/AVXx anymore, all SS/PS instructions are with 24-bit precision and SD/PD instructions with 53-bit precision, MXCSRhas 2rounding mode bits but no fp precision control (only the precision exception flag and mask) unlike x87
For high precision calculation SSE can not beat scalar x87 FPU.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please compare the results in post #65.
[bash]public class SineFunc { //sin(x)~ x-x^3/3!+x^5/5!-x^7/7! /** * @param args */ public static void main(String[] args) { // TODO Auto-generated method stub double val = Double.parseDouble(args[0]); int j = 1;//variable initialization int exp =( 2*j++)+1; double[] array = new double [11]; int k = 1; int denom=1; long start = System.nanoTime(); for (int i = 0;i
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AE = 0.43496553411123023000000000000000 - 0.43496553411123023000000000000000= 0.0
...
Did you get ZeroAE for a complete range of input values? For example, from -180 degreesto +180 degrees?
I can see from your resultthat expansion of the Taylor Series to the 21st termor 23rd term could be used in orderto
get a reference table of sine values.
In my case I know a source of errors and it isrelated tocoefficients that currently use.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sin(0.1) , step 0.1
value of sine inefficient implementation of taylor series
0.09986005837615321000000000000000
time of calculation of sine is: 285774 nanoseconds
Math.sin
0.09983341664682815000000000000000
time of calculation of Math.sin java library is : 2717 nanoseconds
poly fit taylor series
0.09983341664682815000000000000000
time of calculation polynomial fit of sin is : 906 nanoseconds
Absolute Error compared to Java Math.sin for inefficient method is : 2.66417293250526E-5
Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(0.2)
value of sine inefficient implementation of taylor series
0.19888046700922576000000000000000
time of calculation of sine is: 264488 nanoseconds
Math.sin
0.19866933079506122000000000000000
time of calculation of Math.sin java library is : 2717 nanoseconds
poly fit taylor series
0.19866933079506122000000000000000
time of calculation polynomial fit of sin is : 906 nanoseconds
Absolute Error compared to Java Math.sin for inefficient method is : 2.1113621416454786E-4
Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(0.3)
value of sine inefficient implementation of taylor series
0.29622157615613700000000000000000
time of calculation of sine is: 265394 nanoseconds
Math.sin
0.29552020666133955000000000000000
time of calculation of Math.sin java library is : 2265 nanoseconds
poly fit taylor series
0.29552020666133955000000000000000
time of calculation polynomial fit of sin is : 906 nanoseconds
Absolute Error compared to Java Math.sin for inefficient method is : 7.013694947974325E-4
Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(0.4)
value of sine inefficient implementation of taylor series
0.39104373607380610000000000000000
time of calculation of sine is: 326082 nanoseconds
Math.sin
0.38941834230865050000000000000000
time of calculation of Math.sin java library is : 2717 nanoseconds
poly fit taylor series
0.38941834230865050000000000000000
time of calculation polynomial fit of sin is : 453 nanoseconds
Absolute Error compared to Java Math.sin for inefficient method is : 0.0016253937651555805
Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(0.5)
value of sine inefficient implementation of taylor series
0.48250729701915250000000000000000
time of calculation of sine is: 302532 nanoseconds
Math.sin
0.47942553860420300000000000000000
time of calculation of Math.sin java library is : 2265 nanoseconds
poly fit taylor series
0.47942553860420300000000000000000
time of calculation polynomial fit of sin is : 906 nanoseconds
Absolute Error compared to Java Math.sin for inefficient method is : 0.003081758414949509
Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(0.6)
value of sine inefficient implementation of taylor series
0.56977260924909550000000000000000
time of calculation of sine is: 360501 nanoseconds
Math.sin
0.56464247339503540000000000000000
time of calculation of Math.sin java library is : 2265 nanoseconds
poly fit taylor series
0.56464247339503540000000000000000
time of calculation polynomial fit of sin is : 906 nanoseconds
Absolute Error compared to Java Math.sin for inefficient method is : 0.0051301358540600805
Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(0.7)
value of sine inefficient implementation of taylor series
0.65200002302055430000000000000000
time of calculation of sine is: 329252 nanoseconds
Math.sin
0.64421768723769100000000000000000
time of calculation of Math.sin java library is : 2265 nanoseconds
poly fit taylor series
0.64421768723769100000000000000000
time of calculation polynomial fit of sin is : 906 nanoseconds
Absolute Error compared to Java Math.sin for inefficient method is : 0.007782335782863248
Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(0.8)
value of sine inefficient implementation of taylor series
0.72834988859044850000000000000000
time of calculation of sine is: 273546 nanoseconds
Math.sin
0.71735609089952280000000000000000
time of calculation of Math.sin java library is : 4529 nanoseconds
poly fit taylor series
0.71735609089952280000000000000000
time of calculation polynomial fit of sin is : 453 nanoseconds
Absolute Error compared to Java Math.sin for inefficient method is : 0.010993797690925677
Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(0.9)
value of sine inefficient implementation of taylor series
0.79798255621569720000000000000000
time of calculation of sine is: 323817 nanoseconds
Math.sin
0.78332690962748340000000000000000
time of calculation of Math.sin java library is : 3623 nanoseconds
poly fit taylor series
0.78332690962748340000000000000000
time of calculation polynomial fit of sin is : 906 nanoseconds
Absolute Error compared to Java Math.sin for inefficient method is : 0.014655646588213833
Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(1.0)
value of sine inefficient implementation of taylor series
0.86005837615321990000000000000000
time of calculation of sine is: 268111 nanoseconds
Math.sin
0.84147098480789650000000000000000
time of calculation of Math.sin java library is : 3170 nanoseconds
poly fit taylor series
0.84147098480789650000000000000000
time of calculation polynomial fit of sin is : 906 nanoseconds
Absolute Error compared to Java Math.sin for inefficient method is : 0.01858739134532339
Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(1.1)
value of sine inefficient implementation of taylor series
0.91373769865993570000000000000000
time of calculation of sine is: 366842 nanoseconds
Math.sin
0.89120736006143540000000000000000
time of calculation of Math.sin java library is : 3623 nanoseconds
poly fit taylor series
0.89120736006143540000000000000000
time of calculation polynomial fit of sin is : 1359 nanoseconds
Absolute Error compared to Java Math.sin for inefficient method is : 0.022530338598500288
Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(1.2)
value of sine inefficient implementation of taylor series
0.95818087399276370000000000000000
time of calculation of sine is: 420736 nanoseconds
Math.sin
0.93203908596722630000000000000000
time of calculation of Math.sin java library is : 3623 nanoseconds
poly fit taylor series
0.93203908596722630000000000000000
time of calculation polynomial fit of sin is : 905 nanoseconds
Absolute Error compared to Java Math.sin for inefficient method is : 0.02614178802553746
Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(1.3)
value of sine inefficient implementation of taylor series
0.99254825240862400000000000000000
time of calculation of sine is: 267658 nanoseconds
Math.sin
0.96355818541719300000000000000000
time of calculation of Math.sin java library is : 3623 nanoseconds
poly fit taylor series
0.96355818541719300000000000000000
time of calculation polynomial fit of sin is : 453 nanoseconds
Absolute Error compared to Java Math.sin for inefficient method is : 0.02899006699143103
Absolute Error compared to Java Math.sin for polynomial fit method is :0.0
sin(1.556) very close to Pi/2
value of sine inefficient implementation of taylor series
1.02879965351909490000000000000000
time of calculation of sine is: 379523 nanoseconds
Math.sin
0.99989053635379590000000000000000
time of calculation of Math.sin java library is : 3623 nanoseconds
poly fit taylor series
0.99989053635379600000000000000000
time of calculation polynomial fit of sin is : 906 nanoseconds
Absolute Error compared to Java Math.sin for inefficient method is : 0.02890911716529898
Absolute Error compared to Java Math.sin for polynomial fit method is :1.1102230246251565E-16
As you can see only in last test we have a difference albeit small.Afaik Java Math.sin calls java strictMath class whos algorithms are based on FDLIBM 5.3 library.
FDLIBM library for sin calculation uses polynomial of degree 13.
Please look also at very long execution time for non-efficient method the difference is more than 200th times.I think that array accesses are responsible for this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
needed to perform calculation...
TheAccumulated Error could be detected and analyzed. In a very simple test, which I used some time ago, a constant 0.001 is
added 1000 times to a single-precision variable:
Results are as follows and you could see that errors started almost from the beginning:
In order to see an Exact Error at some step the Binary Representations ( in IEEE 754 format ) for
the Current and True Values have to be used. In another words, a bit-to-bit comparision must be done.
The same test with double-precision variables calculated result as follows:
[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
TheAccumulated Error could be detected and analyzed. In a very simple test, which I used some time ago, a constant 0.001 is
added 1000 times to a single-precision
Sergey
You are right , but only when we take into account a simple calculation be it summation or product of two binary inexact number representated.
Please take a lookat my code which calculates the sine function(the" inefficient method"),analyze thealgorithm which uses for-loop to calculate sine series coefficients.As you can see the AE is very large and this code at the beginning only involved 4 loop cycles and the running-time of this function is 100th times slower than library function.It is not soeasy as with simple summation to spot immediately the source of an error and explain the very slow runing time.
Now think about the some physical process which an algorithm tries to describe with time resolution measured in microsecond and calculation being performed inside maybe nested for-loop with dozens operation per cycle.
How you can effectivelyfind and track the source of anerror calculation and various contribution to an error and calculate the error's propagation and accumulation.I think that this would be very daunting and tedious task.
Btw the last sentence was taken from the book "A First Course in Numerical Analysis" by P.Rabinovitz and A. Ralston
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think that this would be very daunting and tedious task
not so tedious using interval arithmetic
http://en.wikipedia.org/wiki/Interval_arithmetic
it's easy for example to design an interval arithmetic framework,best in a language that support operators overloading like C++ (so that it's just as readable as classical arithmetic) and to see how the intervals are tighter or wider based on the code arrangement, it works well even for very complex formulas and a lot of iterations
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A whileago I posted some results for a Test-Case that multiplies 0.1 by 0.1. I just detected some
inconsistency of results for a Single-Precision ( 'float' )data type. Here is an example:
A test ( 0.1 * 0.1 ) executed from a DLL:
24-bit : [ 0.1 * 0.1 = 0.01000000070780516 ]
53-bit : [ 0.1 * 0.1 = 0.01000000029802323 ]
64-bit : [ 0.1 * 0.1 = 0.01000000029802323 ]
Default : [ 0.1 * 0.1 = 0.01000000029802323 ]
A test ( 0.1 * 0.1 ) executed from an EXE-program:
24-bit : [ 0.1 * 0.1 = 0.01000000070780516 ]
53-bit : [ 0.1 * 0.1 = 0.01000000070780516 ]
64-bit : [ 0.1 * 0.1 = 0.01000000070780516 ]
Default : [ 0.1 * 0.1 = 0.01000000070780516 ]
My questions: Is it related to a '_control87' or 'printf' CRT-functions, or something else?
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi SergeyMy questions: Is it related to a '_control87' or 'printf' CRT-functions, or something else
Hard to answer without therunning codeexample.Does your DLL call printf function?,or you are running it under VS debugger?.If it does call printf put the breakpoint on the printf call(printf is implemented in MSVCRT.DLL)and single-step into a function
check carefully floating-point registers values or memory arguments passed to printf and look for AE in processed values.You can also do the same procedure with your EXE code.
In order to fully track or follow printf call chain kernel debugger must be used to inspect memory-mapped I/O or I/O ports used by video card to read the valuesinto frame buffer and display it,but I think this would be an overkill.
You can also write in assembly or inline assemblysimple procedure involving 0.1 multiplicationand look at fpu registers context directly without formatting and displaying an output by doing this you can narrow down your AE and eliminate contribution(modification of the result)from printf function.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wrote my time-measurement code almost exactly as You,but the difference calculated by the calls to GetTickCount() function was 0. I read about the time resolution or granularity of this function and on the MSDN site was clearly stated that the resolution is returned in miliseconds.Therefore I switched to using inline RDTSC for more accurate measurement.So howexactly Did you calculate the time delta and acieved nanoseconds accuracy?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wrote my time-measurement code almost exactly as You,but the difference calculated by the calls to GetTickCount() function was 0. I read about the time resolution or granularity of this function and on the MSDN site was clearly stated that the resolution is returned in miliseconds.
[SergeyK] Yes, that is correct. I'm always measuring a relative performance because many platforms
are supported.I simply need to know what is a performanceratio between different
functions executed on the same platform,or ondifferent platforms.
Iliya, I'll do a test with RDTSC instruction and will post my results as soon as they are ready.
Therefore I switched to using inline RDTSC for more accurate measurement. So howexactly Did you calculate the time delta and acieved nanoseconds accuracy?
I've executedcalls more than 4 million times ( 2^22 ). As I told I didn't try to measure a performance in nanoseconds.
A pseudo code looks like:
...
NUM_OF_TEST = 2^22
StartTime = ::GetTickCount();
for( i=0;i < NUM_OF_TEST; i++)
{
Callto aSine Function
}
EndTime = ::GetTickCount();
Printf( "Delta: %d", ( EndTime - StartTime ) );
...
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Spent a few hours on performance testing of my elementary functions library(all functions implemented as a polynomial fit).Performance of every functions was measured with inline RDTSC.Did not calculate average values of many thousand runs(I will do it later).
For example the same implementation of sin(x) function written in java and tested with Java System.nanoTime() method returned on average of 1000 runs~1000 nanoseconds that's mean ~2226 CPU cycles(My Cpu runs at speed of2.226Ghz).In comparision sin(x) function written in C as a console app and measured with inline RDTSC achieved speed of execution which equals ~862(hex) cycles translated to nanoseconds it will be 964.24. Please bear in mindthat abovecalculations are not accurate and cannot be relied upon to draw conclusions , because it is well known that on multi-core CPU's with varying frequency as a consequence of speedstep technonlogy usage accurate time measurement with help of RDTSC can not be guaranted.
Here is my implementation of inlineRDTSC.
Running time of CRT sin(x) was 36396 cyclesin hex after translation to decimal units and I got whopping ~99796 nanoseconds this is the cost of CRT library calls.
[bash]double fastsin(double x){ double sum = 0; double half_pi,zero_arg; half_pi = Pi/2; zero_arg = Zero; if(x > half_pi){ return (x-x)/(x-x) ; }else if (x < zero_arg){ return (x-x)/(x-x); }else{ int start_lo,start_hi,time_lo,time_hi; _asm{ xor eax,eax xor edx,edx cpuid rdtsc mov start_lo,eax mov start_hi,edx } double coef1,coef2,coef3,coef4,coef5,coef6,coef7,coef8,coef9,coef10,coef11,rad,sqr; coef1 = -0.16666666666666666666666666666667; coef2 = 0.00833333333333333333333333333333; coef3 = -1.984126984126984126984126984127e-4; coef4 = 2.7557319223985890652557319223986e-6; coef5 = -2.5052108385441718775052108385442e-8; coef6 = 1.6059043836821614599392377170155e-10; coef7 = -7.6471637318198164759011319857881e-13; coef8 = 2.8114572543455207631989455830103e-15 ; coef9 = -8.2206352466243297169559812368723e-18; coef10 = 1.9572941063391261230847574373505e-20; coef11 = -3.8681701706306840377169119315228e-23; rad = x; sqr = x*x; //x^2 sum = rad+rad*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11))))))))))); _asm{ xor eax,eax xor edx,edx cpuid rdtsc sub eax,start_lo mov time_lo,eax sub edx,start_hi mov time_hi,edx } printf("%15s n","running time of fastsin() start_lo value :"); printf("%10d n",start_lo); printf("%15s n","running time of fastsin(): start_hi value:"); printf("10d n",start_hi); printf("%15s n"," running time of fastsin() :end_lo value:"); printf("%10d n",time_lo); printf("%15s n","running time of fastsin() :end_hi value:"); printf("%10d n",time_hi); printf(" Sum:%.24f n", sum); } return sum; } [/bash]
and here is console screen from one of the many tests
fastsin(x)
running time of fastsin() start_lo value :
1874052280 // starting value of eax
running time of fastsin(): start_hi value:
10d // starting value of edx
running time of fastsin() :end_lo value:
788 // eax value after subtraction
running time of fastsin() :end_hi value:
0 // edx value after subtraction
Sum:0.434965534111230230000000
fastcos(x)
running time of fastcos() start_lo value :
1876461176
running time of fastcos(): start_hi value:
10d
running time of fastcos() :end_lo value:
LoDelta : 1084
running time of fastcos() :end_hi value:
HiDelta : 0
Sum: 0.852524522059505680000000
fasttan()
running time of fasttan() start_lo value :
1878359514
running time of fasttan(): start_hi value:
10d
running time of fasttan() :end_lo value:
LoDelta : 668
running time of fasttan() :end_hi value:
HiDelta : 0
Sum :0.483055065616578410000000
time measurement of Library sin()
Delta is: 36396 // LIBRARY CRT call
Result :0.434965534111230230000000
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
862(hex) = 2146(dec) and it consistent with a2226 number. In order toimprove accuracyyou can dothe following:
- Do all tests on the 2nd CPU
- Switch a priority of the test process to Real-time ( thatreally shows better numbers for ~1% to ~2% )
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
the best will be to follow Sergey advice: call a lot of times (for ex. 1 milliion times)the polynomial approximation in between your CPUID+RDTSC calls, you must be careful that your compiler don't optimize it by a single call
if you insist calling it only onceyou must remove a delta value from your timings, you can for example comment out all the code between the ASM block to do some calibration, you will see how much "time" it takes to do nothing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Do all tests on the 2nd CPU
- Switch a priority of the test process to Real-time ( thatreally shows better numbers for ~1% to ~2% )
I ran the tests on second CPU and also set priority to 24(Real-Time) and did not observe any significant change in the terms of speed of execution.Measured performance of CRT sin(x) with 2^20 iteration loop and average value was 1919.0991 cycles.
Disassembled msvcrt.dll inspected sin() function from what i have been able to understand CRT sin() has argument comparision to 0 and if/else statement.If arg >0 sin() jumps to subprocedure which calculates sine with the help of built-in instruction fsin.Afaik fsin has an throughput(cpi) of ~ 40-100 cpi.I suppose that fsin also performes range-reduction and overflow checking.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did exactly as youradvise.I performed a few tests with commented out polynomial approximation.The result was oscillating around400cycles.RDTSC and CPUIDtogether have cpi of ~224 cyclesandexecuted two times these instructions completely"destroyed" my measurement.Looked at assembly and saw that compiler used MOVSD,ADDSD,MULSD instructions for polynomial calculation.From the beginning I suspected that a handfull simple XMM instructions cannot be responsible for those hundreds of cycles.bronxzv
if you insist calling it only onceyou must remove a delta value from your timings, you can for example comment out all the code between the ASM block to do some calibration, you will see how much "time" it takes to do nothing
When I tested my assembly code implementation of taylor expansion for sin(x) I ran my code (single term) inside of 1000 000 iterations loop average value was ~4.25cycles per 4 XMM instructions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://download.intel.com/embedded/software/IA/324264.pdf
please don't forget to tell us at the end how much the optimized version is faster than you first naive implementation

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page