Solved: Optimization of sine function's Taylor expansion - Page 5

Bernard · ‎05-24-2012

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

bronxzv · ‎06-08-2012

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

View solution in original post

Bernard · ‎06-07-2012

please don't forget to tell us at the end how much the optimized version is faster than you first naive implementation

bronxzv
Thanks for the pdf it was a great read.On average a fewoptimized(polynomial approximations and pre-calculated coefficients)methods(in java) when compared to the same method which are based on for-loop with library function call and divisions are 5-6 times faster and sometimes even 10 times faster.Now I have to optimize 4000 lines of code :).

Btw whenI must write code for some esoteric function like kelvin ker(x) it is not so easy task to precalculate coefficients so the procedure must be written to automate this task based on the for-loops iterations.

bronxzv · ‎06-07-2012

out of curiosity I have tested it, thepolynomial discussed here is evaluated in 25.5 +/-0.2 cycles on Ivy Bridge from data in the L1D$ with the scalar version shown below (vectorization avoidedusing \Qvec-)

with 4x unrolling it's down to 23.6 +/- 0.2 cycles
with vectorization and 4x unrolling we are at6.0 +/- 0.1 cycles

ASM :

[plain].B51.6:: ; Preds .B51.6 .B51.5 ; Infreq vmovsd xmm2, QWORD PTR [544+rbp+rax*8] ;480.47 vmulsd xmm0, xmm2, xmm2 ;480.35 vmulsd xmm3, xmm0, QWORD PTR [_2il0floatpacket.6991] ;480.35 vmulsd xmm1, xmm2, xmm0 ;480.35 vaddsd xmm4, xmm3, QWORD PTR [_2il0floatpacket.6990] ;480.35 vmulsd xmm5, xmm0, xmm4 ;480.35 vaddsd xmm11, xmm5, QWORD PTR [_2il0floatpacket.6989] ;480.35 vmulsd xmm12, xmm0, xmm11 ;480.35 vaddsd xmm13, xmm12, QWORD PTR [_2il0floatpacket.6988] ;480.35 vmulsd xmm14, xmm0, xmm13 ;480.35 vaddsd xmm15, xmm14, QWORD PTR [_2il0floatpacket.6987] ;480.35 vmulsd xmm3, xmm0, xmm15 ;480.35 vaddsd xmm4, xmm3, QWORD PTR [_2il0floatpacket.6986] ;480.35 vmulsd xmm5, xmm0, xmm4 ;480.35 vaddsd xmm11, xmm5, QWORD PTR [_2il0floatpacket.6985] ;480.35 vmulsd xmm12, xmm0, xmm11 ;480.35 vaddsd xmm13, xmm12, QWORD PTR [_2il0floatpacket.6984] ;480.35 vmulsd xmm14, xmm0, xmm13 ;480.35 vaddsd xmm15, xmm14, QWORD PTR [_2il0floatpacket.6983] ;480.35 vmulsd xmm3, xmm0, xmm15 ;480.35 vaddsd xmm4, xmm3, QWORD PTR [_2il0floatpacket.6982] ;480.35 vmulsd xmm0, xmm0, xmm4 ;480.35 vaddsd xmm0, xmm0, QWORD PTR [_2il0floatpacket.6981] ;480.35 vmulsd xmm1, xmm1, xmm0 ;480.35 vaddsd xmm2, xmm2, xmm1 ;480.35 vmovsd QWORD PTR [2592+rbp+rax*8], xmm2 ;480.49 inc rax ;480.35 cmp rax, 256 ;480.35 jl .B51.6 ; Prob 82% ;480.35[/plain]

source code :

[cpp]inline double fastsin(double x) { const double coef1 = -0.16666666666666666666666666666667, coef2 = 0.00833333333333333333333333333333, coef3 = -1.984126984126984126984126984127e-4, coef4 = 2.7557319223985890652557319223986e-6, coef5 = -2.5052108385441718775052108385442e-8, coef6 = 1.6059043836821614599392377170155e-10, coef7 = -7.6471637318198164759011319857881e-13, coef8 = 2.8114572543455207631989455830103e-15, coef9 = -8.2206352466243297169559812368723e-18, coef10 = 1.9572941063391261230847574373505e-20, coef11 = -3.8681701706306840377169119315228e-23; const double sqr = x*x; //x^2 return x+x*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11))))))))))); } void FastSinTest(const double a[], double b[], int size) { for (int i=0; i = fastsin(a); } // main: const int size = 256, nrOfTests = 1000000; double a[size],b[size]; for (int i=0; i = double(i)/double(size); Chrono chrono(""); for (int i=0; i; DS << "checkSum = " << checkSum << "n"; [/cpp]

Bernard · ‎06-07-2012

Almost the same code was produced by my compiler,only because of CPU architecture VEX-prefixed instr. are not supported.Why Did you inline the function?try to running the test withoutinlining.
How the function chrono.getTime() is implemented?Does it use inline RDTSC or Win32 GetTickCount()?
The results with vectorization switched off are almost as expected to be in the range 20-30 cycles.

bronxzv · ‎06-07-2012

sorry but even without the "inline" keyword it's inlined by the Intel compiler, it's generally a good idea to inline polynomial evaluations since there is a long dependency chainand the compiler can schedule in between other unrelated instructions like another polynomial evaluation for ex. and/or unroll evaluation of the same polynomial (don't try it by hand in assembly!)

Chrono is based on WIN32 QueryPerformanceFrequency / QueryPerformanceCounter
Chrono::getTime() returns a time in [ms] with better than 1e-3 ms accuracy

in this example totalrun time is around 1 sec so anywall-clock time based stopwatchis OK
you typicallyget better than single clock accuracywhen you can execute a lot of time the studied code, RDTSC is better suited for more complex cases though

Bernard · ‎06-07-2012

bronxzv
Today I ran another set of tests on my fastsin() function.The code beetwen blocks of RDTSC instructions was loopedmillion times and the delta was divided by the number of loop iterations.
On average the calculated result was 1660 cycles per iteration thats mean 745.73 nanosecond.When I measured only polynomial approximation(coefficients assignment outside of the loop)I got 1075 cycles per iteration.
I also commented out the code and ran the tests the result for the same settings was 117 cycles(probably loop overhead itself)per iteration.I performed those tests according to this post http://software.intel.com/en-us/forums/showthread.php?t=52482.
My only explanation is that RDTSC and CPUID affect the accuracy of thetime measurement for arbitrary number of iteration.Probably cumulative sum of these instructions cpi is added to polynomial execution time cpi and simply distorts the true value.
Ran another set of testswherefastsin() was called1 milliontimes from the main() and the delta measured by the GetTickCount() in miliseconds was 78 miliseconds for 1 million iterations so the one iteration is
0.000078 milisec and this is78000 nanosecond per function call.How did you guys were able to achievemore accurate results.?
Btw here is the response from one of the Intel's engineer:

"rdtsc
{a small number of instructions with a cumulative latency less than hundreds or thousands of clocks} // AS IN MY CASE
rdtsc
is very unlikely to yield a reliable result. The CPUID step is probably not needed either, although there are other opinions on that point.

The rdtsc instruction is serializing with respect to itself. It is not actually serializing."
Second answer from the same post:
2." Measurements with RDTSC are most credible if the number of clocks between the pair of RDTSC instructions is at least a few thousand, preferably tens of thousands".

bronxzv · ‎06-08-2012

sorry but your timings make nosense at all, for ex. 117 cycles per iteration for the loop overhead

without seeing your source code I can't point to the error(s) in your methodology

concerning RDTSC the best will beto read thePDF I have linked yesterday, it's more recent and far more insightful than the post you are refering to (btw using RDTSC is plain overkill for such a simple case with a single routine under study)

since I made the effort to test it and to post full source code I'll also suggest to compile my example in your environment, just replace the CPU frequency by yours and plug in your favorite stopwatch

Bernard · ‎06-08-2012

I know i need to have some problems with my methodology :)
here is the snippet of code which calls fastsin() from the main()

The result of calculation was 78000 nanosecond for one iteration.

[bash] double sine = 0; unsigned int start,end,delta; start = GetTickCount(); for(int i = 0;i<1000000;i++){ sine = fastsin(0.45); } end = GetTickCount(); printf("start: %10d n",start); printf("end : %10d n", end); printf("delta: %d n", (end-start)); printf("sine : %0.24f n" ,sine);[/bash]

Bernard · ‎06-08-2012

Here the fastsin() function with commented out polynomial approximation in order to" measure" RDTSC and CPUID overhead.
The result of 1million iterations loop in cycles is 117.514276 cycles.Delta is in(hex)7012024 in (dec) 117514276, result is 117514276/1000000 = 117.514276.

[bash] int start_lo,start_hi,time_lo,time_hi,counter; counter = 1000000; _asm{ xor eax,eax xor edx,edx cpuid rdtsc mov start_lo,eax mov start_hi,edx } for(int i = 0;i

bronxzv · ‎06-08-2012

I'll be glad to also see your fastsin() implementation

since you call it with a constantthe Intel compiler will probably resolve it at compile time so I can't use it as is, also it will probably not loop over 1 million time since the same variable is written over and over again and it can see that fastsin has no global side effect, I'll suggest to base future tests on my code instead

bronxzv · ‎06-08-2012

I suggest to forget about RDTSC entirely at the moment, ged rid of all this _asm stuff and simply measure (with GetTickCount in the *outer loop*) what you are interested in: the polynomial approx.

Bernard · ‎06-08-2012

Full implementation of fastsin() with for-loop to test polynomial block.

[bash] double fastsin(double x){ double sum = 0; double half_pi,zero_arg; half_pi = Pi/2; zero_arg = Zero; int const Iter = 1000000; if(x > half_pi){ return (x-x)/(x-x) ; }else if (x < zero_arg){ return (x-x)/(x-x); }else{ int start_lo,start_hi,time_lo,time_hi; _asm{ xor eax,eax xor edx,edx cpuid rdtsc mov start_lo,eax mov start_hi,edx } for(int i = 0;i

bronxzv · ‎06-08-2012

so it looks like you actually call the poly approx.1e12 times, 1e6x from the outer loop, 1e6 xinner loop, the inner loop is probably simplified by the compiler though and you are mostly measuring the speed of printf + CPUID/RDSTC

I'll adapt my code for using GetTickCount and post it shortly so that we have some common testbed

bronxzv · ‎06-08-2012

here is the code using GetTickCount, to use it in your context: - replace "DS" by any ostream (for ex. cout) - replace the 3.5e6 constant by yours (CPU cyclesperms), note thatI get the same timings than with the previous version, justmore variation due to the poor resolution of GetTickCount vs QueryPerformanceCounter
[cpp]
inline double fastsin(double x) { const double coef1 = -0.16666666666666666666666666666667, coef2 = 0.00833333333333333333333333333333, coef3 = -1.984126984126984126984126984127e-4, coef4 = 2.7557319223985890652557319223986e-6, coef5 = -2.5052108385441718775052108385442e-8, coef6 = 1.6059043836821614599392377170155e-10, coef7 = -7.6471637318198164759011319857881e-13, coef8 = 2.8114572543455207631989455830103e-15, coef9 = -8.2206352466243297169559812368723e-18, coef10 = 1.9572941063391261230847574373505e-20, coef11 = -3.8681701706306840377169119315228e-23; const double sqr = x*x; //x^2 return x+x*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11))))))))))); } void FastSinTest(const double a[], double b[], int size) { #pragma unroll(4) for (int i=0; i = fastsin(a); } // to plug into main:
const int size = 256, nrOfTests = 1000000; double a[size],b[size]; for (int i=0; i = double(i)/double(size); const DWORD start = GetTickCount(); for (int i=0; i; DS << "checkSum = " << checkSum << "n"; [/cpp]

Bernard · ‎06-08-2012

fastsin() is called once from the main() and inside the function polynomial block is executed 1e6 times.I have never coded it to be called 1e12 times.When I tested it from the main polynomial block was not inside the loop.I simply posted two various test methods and you wrongly assumed that it ran 1e12 times.No it is not the case.
printf is called after the end of _asm block and the overhead of the full trip to the kernel mode multiplied by 1e6 values to be displayed is a lot of more than few hundred cycles per iteration.

here is the main() last test i got 1700cycles probably containsRDTSC+CPUID overhead.

[bash]int main(void){ printf("%15s n","fastsin(x)"); fastsin(0.45); printf("%15s n" ,"fastcos(x)"); fastcos(0.55); printf("%15s n","fasttan()"); fasttan(0.45); fastexp(1.0); }[/bash]

bronxzv · ‎06-08-2012

I assumed you posted it in answer to my request here: http://software.intel.com/en-us/forums/showpost.php?p=187014

so, in other words,you have not posted a complete example using GetTickCount soit's difficult to say what you do wrong, more generally without full source code that all peers can compile welose a lot of time in misunderstandings (read:I'm getting bored)

Bernard · ‎06-08-2012

Here is the full implementation of fastsin() tested from the main , and main() function which calls 1e6 times fastsin()the result in millisecond is63

start value: 19839910
end value : 19839973
delta is : 63
sine is: 0.434965534111230230000000

code snippet

[bash]int main(void){ unsigned int const num_Iter = 1000000; unsigned int end,start,delta,i; double sine; sine = 0; start = GetTickCount(); for(i = 0;i half_pi){ return (x-x)/(x-x) ; }else if (x < zero_arg){ return (x-x)/(x-x); }else{ double coef1,coef2,coef3,coef4,coef5,coef6,coef7,coef8,coef9,coef10,coef11,rad,sqr; coef1 = -0.16666666666666666666666666666667;// 1/3! coef2 = 0.00833333333333333333333333333333;// 1/5! coef3 = -1.984126984126984126984126984127e-4;// 1/7! coef4 = 2.7557319223985890652557319223986e-6;// 1/9! coef5 = -2.5052108385441718775052108385442e-8;// 1/11! coef6 = 1.6059043836821614599392377170155e-10;// 1/13! coef7 = -7.6471637318198164759011319857881e-13;// 1/15! coef8 = 2.8114572543455207631989455830103e-15 ;// 1/17! coef9 = -8.2206352466243297169559812368723e-18;// 1/19! coef10 = 1.9572941063391261230847574373505e-20;// 1/21! coef11 = -3.8681701706306840377169119315228e-23;// 1/23! rad = x;// sqr = x*x; //x^2 sum = rad+rad*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11))))))))))); } return sum; }[/bash]

bronxzv · ‎06-08-2012

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

SergeyKostrov · ‎06-08-2012

Hi everybody,

Iliya, your measured time, that is 63 milliseconds, forthe 'fastsine' functiondoesn't take into account some
overhead for the 'for' loop.

Could you try the following scheme? This is a pseudo code:

...
OverheadOfForStatement = 0
FastSineSpeedInMs = 0
Start = 0
End = 0
NumOfIterations = 2^22

ChangeProcessPriorityToRealTime()

Start = ::GetTickCount()
for( i = 0;i < NumOfIterations; i++) // Verify that aC++ compiler keeps the empty loop!
{
}
End= ::GetTickCount()
OverheadOfForStatement = (End - Start )

Start = ::GetTickCount()
for( i = 0;i < NumOfIterations; i++)
{
FastSine(...)
}
End= ::GetTickCount()
FastSineSpeedInMs = (End - Start- OverheadOfForStatement ) / NumOfIterations

ChangeProcessPriorityToNormal()
...

Best regards,
Sergey

Bernard · ‎06-09-2012

Hi Sergey
Tomorrow I will start to post more comprehensive results from my testings.I will compare Java methods for calculation various math functions against the native code based implementation of the same functions.

"Iliya, your measured time, that is 63 milliseconds, forthe 'fastsine' functiondoesn't take into account some
overhead for the 'for' loop.

Could you try the following scheme? This is a pseudo code":

Results of loop-overhead testing
As you can see I cannotmeasure loop overheadmoreover Ialso checked with debugger that empty for-loop is executed.Priority wasset with the help of Process Explorer.Assembly instructions can be counted so overheadis sum of a few x86 instr like"jae[target], add 1 cmp,some_value" andshould be not more than a few cycles per iteration.

[bash]start value of loop_overhead : 5600529 end value of loop_overhead : 5600529 delta of loop_overhead is : 0 start value of fastsin(): 5600529 end value of fastsin() : 5600607 delta of fastsin() and loop_overhead is : 78 sine is: 0.434965534111230230000000[/bash]

Bernard · ‎06-11-2012

Hi everybody

Tested today execution time Java version of my fastsin() function.Two timing methods were used:

1.polynomial approximation block's directly calculated delta with nanosecond resolution as returned by System.nanoTime() the result was 906 nanosec.

2.One million iterations foor-loop with fastsin() call inside the loop delta measured by System.currentTimeMillisec() method the result was 63 millisec exactly same as native code time delta.

The slow execution time of the first method I think is probably related to the possibility of System.nanoTime()
beign binary translatedby JIT compiler to RDTSC instruction.

How this could be possible that native codefastsin() and java fastsin() both of them have the same speed of execution.Even if Java JIT recognizes and optimizes hot-spots by binary translated code caching.
[bash]double value55 = 0; // JAVA version test for fastsin() method long start1,end1; final int max_Iter = 1000000; start1 = System.currentTimeMillis(); for(int k = 0; k < max_Iter;k++){ value55 = func.fastsin(0.45); } end1 = System.currentTimeMillis(); System.out.println("running time of fastsin9) is :" + (end1-start1) + "tmillisec"); System.out.printf("%15s n", "fastsin()"); System.out.printf("%.24f n",value55); }[/bash]RESULT:

running time of fastsin() is :63 millisec

fastsin()

0.434965534111230230000000

bronxzv · ‎06-11-2012

How this could be possible that native code fastsin() and java fastsin() both of them have the same speed of execution.Even if Java JIT recognizes and optimizes hot-spots by binary translated code caching.

I'll not rule out the fact that the JIT is able to do well on such a simple example

now, you are using a stopwatch with 1 ms resolution so it's not enough precise to compare timings of around 63 ms (~ 3% max potential error), to improve the situation:

1) increase the iteration count so that it runs for roughly 2 s, i.e. ~30x more iterations, you'll have then 0.1% max error

and/or

2) use a more precise stopwatch based on QueryPerformanceCounter

also, as previously noted, I don't think that calling it with a constant is the best idea, insteadI'll advise to samplethe domain like in the example I provided the other day

Optimization of sine function's taylor expansion