Solved: Optimization of sine function's Taylor expansion - Page 9

Bernard · ‎05-24-2012

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

bronxzv · ‎06-08-2012

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

View solution in original post

SergeyKostrov · ‎06-25-2012

Quoting Sergey Kostrov

Quoting iliyapolak
...Why your results are so large almost as my slow implementation...

[SergeyK] Because your computer is faster.

...I always measure a relative performance...

And here are results of a very special test:

Performanceof C 'Sin' Macros measured with RDTSC instruction is as follows:

Normalized Taylor Series Sine ( 7 terms ) - 25 clock cycles
Normalized Taylor Series Sine ( 9 terms ) - 35 clock cycles
Normalized Taylor Series Sine ( 11 terms ) - 43 clock cycles

However, at the time of testing a Windows Update was in progress...

Best regards,
Sergey

Bernard · ‎06-25-2012

Normalized Taylor Series Sine ( 7 terms ) - 25 clock cycles
Normalized Taylor Series Sine ( 9 terms ) - 35 clock cycles
Normalized Taylor Series Sine ( 11 terms ) - 43 clock cycles

These results are closer to the results achieaved by me and by bronxzv.From theoretical point of view it should be very fast because of a few addition and multiplication and one assignment of the accumulator register to the memory store.

Bernard · ‎06-25-2012

78 ms (ticks ) / 2^22 = 0.000018596649169921875 ms ~= 0.0000186

Try to run your tests for 1e6 iterations.

Here are my results for 2^22 iterations unoptimized fastsin() ,but heavily optimized byIntel C++ compiler with full code inlining inside the loop and 10x loop unrolling.

start val of fastsin() 5566115
end val of fastsin() 5566178
running time of fastsin() release code is: 63 millisec
fastsin() is: 0.988566034786635070000000

63 ms / 2^22 = 0.0000150203704833984375ms ~= 15 ns per iteration
Your code ran for 18.6 ns per iterationprobably due toless agressive compiler optimization.
Btw my CPU is kinda slow Core i3 2.226Ghz.

Bernard · ‎06-25-2012

This is a follow up on two Posts #117 and #114. I think you need to disable ALL optimizations in order to measure an overhead of
an empty 'for' statement. Intel C++ compiler could easily "remove" it. Since itdidn't andyour result was 0 something else
was wrong. I'll take a look at it some time this week

Yes something could be wrong with this measurement.I think that overhead in the worst-case scenario is a few cycles of compare to sth, inc counter and jmp to top_loop assembly instructions.In real world testing CPU should execute such a loop inparallel with inside the loop statements.

Bernard · ‎06-25-2012

- Value VA in mswas obtained on a computer A with CPU frequency FA
-Value VB in mswas obtained on a computer B with CPU frequency FB
- If value VB is less then value VA the computer B is faster then computer A

Yes I was wrong.It is not only the different CPU's , but also uncontrollable by us non-deterministic environment which can greatly affect the state of our measurements.

Bernard · ‎06-26-2012

Let's say I've set a target to outperform some
"reference function" from CRT library

MSVCRT.DLL sin() functions after after argument checking calls x87 fsin instruction.I suppose that fsin performes also range reduction and input checking its cpi is ~40-100 cycles you have also add the overhead of an library wrapper and argument checking so the CRT sin() will be always slower than our fastsin() implementation.Even x87 fsin is slower on average than fastsin() because of range reduction and if I'am not wrong coefficient pre-calculations done on-the-fly.This explains the accuracy of fastsin() compared to CRT sin()
Here is the disassembled by the IDA PRO code snippet of _CIsin which is compiler optimized version of many sin() implementations.

[bash] push edx .text:6FF759A3 fstcw [esp+4+var_4] .text:6FF759A7 jz loc_6FF7E5AC .text:6FF759AD cmp [esp+4+var_4], 27Fh .text:6FF759B3 jz short loc_6FF759BB .text:6FF759B5 fldcw ds:word_6FF5C0D6 .text:6FF759BB .text:6FF759BB loc_6FF759BB: ; CODE XREF: sub_6FF759A2+11j .text:6FF759BB fsin ; x87 instruction call
.text:6FF759BD fstsw ax .text:6FF759C0 sahf .text:6FF759C1 jp loc_6FF7E552 .text:6FF759C7 .text:6FF759C7 loc_6FF759C7: ; CODE XREF: sub_6FF759A2+8BC4j .text:6FF759C7 cmp dword_6FFF50C4, 0 .text:6FF759CE jnz loc_6FF6C003 .text:6FF759D4 mov edx, 1Eh .text:6FF759D9 lea ecx, unk_6FFF5390 .text:6FF759DF jmp loc_6FF5EE3C .text:6FF759DF sub_6FF759A2 endp
[/bash]

bronxzv · ‎06-26-2012

In real world testing CPU should execute such a loop in parallel with inside the loop statements

indeed, that's why it's notsensibleto measure the timing of an empty loop as an estimate of the impact on a real loop,there is typically noconflicts between GPR increment and compare and the FADD/FMUL ports, moreover the branch is macro-fused and with high prediction hit rate, all in all the actual overhead is almost 0

a great tool to see how several instructions are executed in parallel and to study your critical path is the IACA available here:
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/

Bernard · ‎06-26-2012

increment and compare and the FADD/FMUL ports, moreover the branch is macro-fused and with high prediction hit rate, all in all the actual overhead is almost 0

CPU internal logic will easily optimize such a case,as I could say this is a classic case when integer calculations are executed in parallel with floating-point instructions.Situation will be slightly different when floating point based for-loop is used.Here for-loop's fp code could be interleaved with main calculation body and executed by Port0 and Port1 in parallel withloop's body.

Btw. thanks for the link

SergeyKostrov · ‎06-26-2012

Quoting iliyapolak

...63 ms / 2^22 = 0.0000150203704833984375ms ~= 15 ns per iteration
Your code ran for 18.6 ns per iterationprobably due toless agressive compiler optimization...

I always disable ALL optimizations during testing because I need to evaluate results for 5 different C++ compilers and 6 different platforms.

SergeyKostrov · ‎06-26-2012

Quoting iliyapolak

...CRT sin() will be always slower than our fastsin() implementation...

This is not always true and CRT 'sin' outperforms some of my sine functions with 9 terms ( see Post #183 /
based on Chebyshev polynomial). This is not a surprise for me because CRT 'sin'is implemented in assembler.
ALL my sine functions are implemented in C.

Bernard · ‎06-26-2012

This is not always true and CRT 'sin' outperforms some of my sine functions with 9 terms ( see Post #183 /
based on Chebyshev polynomial

It could be possible because the cpi of fsin varies between 40-100 cycles per instruction.If fsin can reach 40-50 cpi and your custom chebyshev-based implementation is not optimized agressively by the compiler it is possible than CRT sin() could run faster.

>>This is not a surprise for me because CRT 'sin'is implemented in assembler

MSVCRT.DLL sin() functions have some overhead which is based on argument checking and comparing to 0 and branching on result to the location which in the end calls x87fsin(you can clearly see the call to fsin in my post above).When CRT sin() is called many times compiler can simply inline the code in for-loop block because of almost 100% predictable backwardbranching the assembly code for the overhead will be present in the cache already decoded to uops , so it can greatly speedup execution time of CRT sin().
Btw JVM solution Math.sin also uses fsin ,but range reduction is performed at the bytecode level so it is slower than MSVCRT sinefunctions.

Bernard · ‎06-26-2012

I always disable ALL optimizations during testing because I need to evaluate results for 5 different C++ compilers and 6 different platforms

Why do not you try to run fastsin() or your other sinefunctions heavily optimized by Intel compiler.
It can be very interesting to see the results of such a measurement.

SergeyKostrov · ‎06-26-2012

Quoting iliyapolak

I always disable ALL optimizations during testing because I need to evaluate results for 5 different C++ compilers and 6 different platforms
Why do not you try to run fastsin() or your other sinefunctions heavily optimized by Intel compiler.
It can be very interesting to see the results of such a measurement.

Iliya,

I work for a project and I can't do everything I want. I tested your 'FastSin' functionand, unfortunately,
I need to move on with another things.Your set of special functions implemented in Java is good ( Thank you again! )but
it has no practical applications for the project. Please take a look at a thread for more details:

http://software.intel.com/en-us/forums/showthread.php?t=106134&o=a&s=lr

Post #3

Best regards,
Sergey

SergeyKostrov · ‎06-26-2012

Quoting iliyapolak

Let's say I've set a target to outperform some
"reference function" from CRT library

MSVCRT.DLL sin() functions after after argument checking calls x87 fsin instruction. I suppose that fsin performes also range reduction...

'FSIN' instructionaccepts any values in the range from -2^63 to +2^63. If an input value is outside of that range areduction of
theinput valuemust be done and Intel recommends to use 'FPREM' instruction.

Best regards,
Sergey

Bernard · ‎06-26-2012

Your set of special functions implemented in Java is good ( Thank you again! )but
it has no practical applications for the project

Thank You Sergey.Maybe in foreseeable future you could implement some parts of special functions class.I would be very glad to see it :)

What is your main area of software development?Is it numerical methods software?

Bernard · ‎06-26-2012

FSIN' instructionaccepts any values in the range from -2^63 to +2^63

I was talking about the range reduction(or mapping)of large values in radian(input -2^63,+2^63) to values in range |x|

sirrida · ‎06-27-2012

Coming a bit back to the topic of this forum I'm wondering why no attempt has been made to parallelize the calculation or otherwise speed up the usual Horner scheme. One method is Estrin's method.

For sine Horner's scheme is as follows (the constants c_n = 1/n! for the usual Taylor approximation):
x2 = x*x;
result = ((((((c13*x2+c11)*x2+c9)*x2+c7)*x2+c5)*x2+c3)*x2+c1)*x;

In assembler (x and result via [eax]):
movsd xmm0,[eax]
movsd xmm1,xmm0 ; x
mulsd xmm0,xmm0
movsd xmm2,xmm0 ; x2
mulsd xmm0,[c13]
addsd xmm0,[c11]
mulsd xmm0,xmm2
addsd xmm0,[c9]
mulsd xmm0,xmm2
addsd xmm0,[c7]
mulsd xmm0,xmm2
addsd xmm0,[c5]
mulsd xmm0,xmm2
addsd xmm0,[c3]
mulsd xmm0,xmm2
addsd xmm0,[c1]
mulsd xmm0,xmm1
movsd [eax],xmm0

A simplified variant of Estrin's scheme is
x2 = x*x;
x4 = x2*x2;
result =
( (((+c11)*x4+c7)*x4+c3)*x2+
(((+c13)*x4+c9)*x4+c5)*x4+c1 )*x;

In assembler (x and result via [eax]):
movsd xmm0,[eax]
movsd xmm1,xmm0 ; x
mulsd xmm0,xmm0
movsd xmm2,xmm0 ; x2
mulsd xmm0,xmm0
movsd xmm3,xmm0
movsd xmm4,xmm0 ; x4
mulsd xmm0,[c13]
mulsd xmm3,[c11]
addsd xmm0,[c9]
addsd xmm3,[c7]
mulsd xmm0,xmm4
mulsd xmm3,xmm4
addsd xmm0,[c5]
addsd xmm3,[c3]
mulsd xmm0,xmm4
mulsd xmm3,xmm2
addsd xmm0,[c1]
addsd xmm0,xmm3
mulsd xmm0,xmm1
movsd [eax],xmm0

At least on Atom N450 and Core i7 920 this is a gain.

Bernard · ‎06-27-2012

Very interesting.
Here is the Horner scheme for polynomial approximation oftan() function.The first four terms of tan() taylor series are shown below.It is not optimal in the term of speed becuse of stack accesses which are used to calculate powers of a variable x.

[bash]main_loop: addps xmm5,step movups xmm7,xmm5 movups xmm0,xmm7 ;xmm0 accumulator mulps xmm7,xmm7 movups xmm6,xmm7 ;copy x^2 mulps xmm7,xmm0 ;x^3 movups [ebp-16],xmm7 ;store x^3 movups xmm1,coef1 mulps xmm1,xmm7 addps xmm0,xmm1 ;x+x^3/3 movups xmm7,[ebp-16] mulps xmm7,xmm6 ;x^5 movups [ebp-32],xmm7 ;store x^5 movups xmm1,coef2 mulps xmm1,xmm7 addps xmm0,xmm1 ;x+x^3/3+2*x^5/15 movups xmm7,[ebp-32] mulps xmm7,xmm6 ;x^7 movups [ebp-48],xmm7 ;store x^7 movups xmm1,coef3 mulps xmm1,xmm7 addps xmm0,xmm1 ;17*x^7/315 movups xmm7,[ebp-48] mulps xmm7,xmm6 ;x^9 movups[ebp-64],xmm7 ;store x^9 movups xmm1,coef4 mulps xmm1,xmm7 addps xmm0,xmm1 ;62*x^9/2835[/bash]

sirrida · ‎06-27-2012

This is not the Horner scheme which starts with the last coefficient (highest index).
Why do you store the odd powers of x? You don't need to store them and you even need not reload them.
BTW: You should try to omit modifying a register which is needed the very next command - always keep a creator and its consumer dispersed as far as possible, especially when the producer has a long latency.

Bernard · ‎06-27-2012

This is not the Horner scheme which starts with the last coefficient (highest index).

Right I was thinking about the my other implementation of tan() function written in C and made such a comment.It is straightforward implementation of taylor expansion with pre-calculated coefficients.
Btw polynomial evaluation can also start from the lowest first coefficient saw it many times for example FDLIBM implementation of sine function.

>>Why do you store the odd powers of x? You don't need to store them and you even need not reload them

Becuse when I wrote this code I did not know how to efficient multiply the powers of x without the stack accesses.And I simply implemented in assembly varioustaylor expansion without even thinking about the code optimization.

bronxzv · ‎06-27-2012

It could be possible because the cpi of fsin varies between 40-100 cycles per instruction

that's one more reason for testing it over asub-domain instead ofa single value

Optimization of sine function's taylor expansion