- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hello!

While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).

I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.

[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRCcalls 1e6 times fastsin() the result in millisecond is 63

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

*Quoting iliyapolak*

*...Why your results are so large almost as my slow implementation...*

[

**SergeyK**] Because your computer is faster.

...I always measure a relative performance...

And here are results of a very special test:

Performanceof **C 'Sin' Macros** measured with **RDTSC** instruction is as follows:

Normalized Taylor Series Sine ( **7** terms ) - **25** clock cycles

Normalized Taylor Series Sine ( **9** terms ) - **35** clock cycles

Normalized Taylor Series Sine ( **11** terms ) - **43** clock cycles

However, at the time of testing a Windows Update was in progress...

Best regards,

Sergey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

These results are closer to the results achieaved by me and by bronxzv.From theoretical point of view it should be very fast because of a few addition and multiplication and one assignment of the accumulator register to the memory store.Normalized Taylor Series Sine (

7terms ) -25clock cycles

Normalized Taylor Series Sine (9terms ) -35clock cycles

Normalized Taylor Series Sine (11terms ) -43clock cycles

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Try to run your tests for 1e6 iterations.

78ms (ticks ) /2^22=0.000018596649169921875ms ~=0.0000186

Here are my results for 2^22 iterations unoptimized fastsin() ,but heavily optimized byIntel C++ compiler with full code inlining inside the loop and 10x loop unrolling.

**start val of fastsin() 5566115**

end val of fastsin() 5566178

running time of fastsin() release code is: 63 millisec

fastsin() is: 0.988566034786635070000000

63 ms / 2^22 = 0.0000150203704833984375ms ~= 15 ns per iteration

end val of fastsin() 5566178

running time of fastsin() release code is: 63 millisec

fastsin() is: 0.988566034786635070000000

Your code ran for 18.6 ns per iterationprobably due toless agressive compiler optimization.

Btw my CPU is kinda slow Core i3 2.226Ghz.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Yes something could be wrong with this measurement.I think that overhead in the worst-case scenario is a few cycles of compare to sth, inc counter and jmp to top_loop assembly instructions.In real world testing CPU should execute such a loop inparallel with inside the loop statements.This is a follow up on two Posts #117 and #114. I think you need to disable ALL optimizations in order to measure an overhead of

an empty 'for' statement. Intel C++ compiler could easily "remove" it. Since itdidn't andyour result was 0 something else

was wrong. I'll take a look at it some time this week

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Yes I was wrong.It is not only the different CPU's , but also uncontrollable by us non-deterministic environment which can greatly affect the state of our measurements.- Value VA in mswas obtained on a computer A with CPU frequency FA

-Value VB in mswas obtained on a computer B with CPU frequency FB

- If value VB is less then value VA the computer B is faster then computer A

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Let's say I've set a target to outperform some

"reference function" from CRT library

MSVCRT.DLL sin() functions after after argument checking calls x87 fsin instruction.I suppose that fsin performes also range reduction and input checking its cpi is ~40-100 cycles you have also add the overhead of an library wrapper and argument checking so the CRT sin() will be always slower than our fastsin() implementation.Even x87 fsin is slower on average than fastsin() because of range reduction and if I'am not wrong coefficient pre-calculations done on-the-fly.This explains the accuracy of fastsin() compared to CRT sin()

Here is the disassembled by the IDA PRO code snippet of _CIsin which is compiler optimized version of many sin() implementations.

[bash] push edx .text:6FF759A3 fstcw [esp+4+var_4] .text:6FF759A7 jz loc_6FF7E5AC .text:6FF759AD cmp [esp+4+var_4], 27Fh .text:6FF759B3 jz short loc_6FF759BB .text:6FF759B5 fldcw ds:word_6FF5C0D6 .text:6FF759BB .text:6FF759BB loc_6FF759BB: ; CODE XREF: sub_6FF759A2+11j .text:6FF759BB fsin ; x87 instruction call

.text:6FF759BD fstsw ax .text:6FF759C0 sahf .text:6FF759C1 jp loc_6FF7E552 .text:6FF759C7 .text:6FF759C7 loc_6FF759C7: ; CODE XREF: sub_6FF759A2+8BC4j .text:6FF759C7 cmp dword_6FFF50C4, 0 .text:6FF759CE jnz loc_6FF6C003 .text:6FF759D4 mov edx, 1Eh .text:6FF759D9 lea ecx, unk_6FFF5390 .text:6FF759DF jmp loc_6FF5EE3C .text:6FF759DF sub_6FF759A2 endp

[/bash]

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

In real world testing CPU should execute such a loop in parallel with inside the loop statements

indeed, that's why it's notsensibleto measure the timing of an empty loop as an estimate of the impact on a real loop,there is typically noconflicts between GPR increment and compare and the FADD/FMUL ports, moreover the branch is macro-fused and with high prediction hit rate, all in all the actual overhead is almost 0

a great tool to see how several instructions are executed in parallel and to study your critical path is the IACA available here:

http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

CPU internal logic will easily optimize such a case,as I could say this is a classic case when integer calculations are executed in parallel with floating-point instructions.Situation will be slightly different when floating point based for-loop is used.Here for-loop's fp code could be interleaved with main calculation body and executed by Port0 and Port1 in parallel withloop's body.increment and compare and the FADD/FMUL ports, moreover the branch is macro-fused and with high prediction hit rate, all in all the actual overhead is almost 0

Btw. thanks for the link

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

*...63 ms / 2^22 = 0.0000150203704833984375ms ~= 15 ns per iteration*

**Your code ran for 18.6 ns per iterationprobably due toless agressive compiler optimization**...I always disable ALL optimizations during testing because I need to evaluate results for 5 different C++ compilers and 6 different platforms.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

*...CRT sin() will be always slower than our fastsin() implementation...*

This is not always true and CRT 'sin' outperforms some of my sine functions with 9 terms ( see Post #183 /

based on Chebyshev polynomial). This is not a surprise for me because CRT 'sin'is implemented in assembler.

ALL my sine functions are implemented in C.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

It could be possible because the cpi of fsin varies between 40-100 cycles per instruction.If fsin can reach 40-50 cpi and your custom chebyshev-based implementation is not optimized agressively by the compiler it is possible than CRT sin() could run faster.This is not always true and CRT 'sin' outperforms some of my sine functions with 9 terms ( see Post #183 /

based on Chebyshev polynomial

**>>This is not a surprise for me because CRT 'sin'is implemented in assembler**

MSVCRT.DLL sin() functions have some overhead which is based on argument checking and comparing to 0 and branching on result to the location which in the end calls x87fsin(you can clearly see the call to fsin in my post above).When CRT sin() is called many times compiler can simply inline the code in for-loop block because of almost 100% predictable backwardbranching the assembly code for the overhead will be present in the cache already decoded to uops , so it can greatly speedup execution time of CRT sin().

Btw JVM solution Math.sin also uses fsin ,but range reduction is performed at the bytecode level so it is slower than MSVCRT sinefunctions.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Why do not you try to run fastsin() or your other sinefunctions heavily optimized by Intel compiler.I always disable ALL optimizations during testing because I need to evaluate results for 5 different C++ compilers and 6 different platforms

It can be very interesting to see the results of such a measurement.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

It can be very interesting to see the results of such a measurement.

Why do not you try to run fastsin() or your other sinefunctions heavily optimized by Intel compiler.I always disable ALL optimizations during testing because I need to evaluate results for 5 different C++ compilers and 6 different platforms

It can be very interesting to see the results of such a measurement.

Iliya,

I work for a project and I can't do everything I want. I tested your 'FastSin' functionand, unfortunately,

I need to move on with another things.Your set of special functions implemented in Java is good ( Thank you again! )but

it has no practical applications for the project. Please take a look at a thread for more details:

http://software.intel.com/en-us/forums/showthread.php?t=106134&o=a&s=lr

Post #3

Best regards,

Sergey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Let's say I've set a target to outperform some

"reference function" from CRT library

*MSVCRT.DLL sin() functions after after argument checking calls x87 fsin instruction.*

**I suppose that fsin performes also range reduction**

*...*

'

**FSIN**' instructionaccepts any values in the range from -2^63 to +2^63. If an input value is outside of that range areduction of

theinput valuemust be done and Intel recommends to use '

**FPREM**' instruction.

Best regards,

Sergey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Your set of special functions implemented in Java is good ( Thank you again! )but

it has no practical applications for the project

Thank You Sergey.Maybe in foreseeable future you could implement some parts of special functions class.I would be very glad to see it :)

What is your main area of software development?Is it numerical methods software?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I was talking about the range reduction(or mapping)of large values in radian(input -2^63,+2^63) to values in range |x|

FSIN' instructionaccepts any values in the range from -2^63 to +2^63

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

For sine Horner's scheme is as follows (the constants c_n = 1/n! for the usual Taylor approximation):

x2 = x*x;

result = ((((((c13*x2+c11)*x2+c9)*x2+c7)*x2+c5)*x2+c3)*x2+c1)*x;

In assembler (x and result via [eax]):

movsd xmm0,[eax]

movsd xmm1,xmm0 ; x

mulsd xmm0,xmm0

movsd xmm2,xmm0 ; x2

mulsd xmm0,[c13]

addsd xmm0,[c11]

mulsd xmm0,xmm2

addsd xmm0,[c9]

mulsd xmm0,xmm2

addsd xmm0,[c7]

mulsd xmm0,xmm2

addsd xmm0,[c5]

mulsd xmm0,xmm2

addsd xmm0,[c3]

mulsd xmm0,xmm2

addsd xmm0,[c1]

mulsd xmm0,xmm1

movsd [eax],xmm0

A simplified variant of Estrin's scheme is

x2 = x*x;

x4 = x2*x2;

result =

( (((+c11)*x4+c7)*x4+c3)*x2+

(((+c13)*x4+c9)*x4+c5)*x4+c1 )*x;

In assembler (x and result via [eax]):

movsd xmm0,[eax]

movsd xmm1,xmm0 ; x

mulsd xmm0,xmm0

movsd xmm2,xmm0 ; x2

mulsd xmm0,xmm0

movsd xmm3,xmm0

movsd xmm4,xmm0 ; x4

mulsd xmm0,[c13]

mulsd xmm3,[c11]

addsd xmm0,[c9]

addsd xmm3,[c7]

mulsd xmm0,xmm4

mulsd xmm3,xmm4

addsd xmm0,[c5]

addsd xmm3,[c3]

mulsd xmm0,xmm4

mulsd xmm3,xmm2

addsd xmm0,[c1]

addsd xmm0,xmm3

mulsd xmm0,xmm1

movsd [eax],xmm0

At least on Atom N450 and Core i7 920 this is a gain.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Here is the Horner scheme for polynomial approximation oftan() function.The first four terms of tan() taylor series are shown below.It is not optimal in the term of speed becuse of stack accesses which are used to calculate powers of a variable x.

[bash]main_loop: addps xmm5,step movups xmm7,xmm5 movups xmm0,xmm7 ;xmm0 accumulator mulps xmm7,xmm7 movups xmm6,xmm7 ;copy x^2 mulps xmm7,xmm0 ;x^3 movups [ebp-16],xmm7 ;store x^3 movups xmm1,coef1 mulps xmm1,xmm7 addps xmm0,xmm1 ;x+x^3/3 movups xmm7,[ebp-16] mulps xmm7,xmm6 ;x^5 movups [ebp-32],xmm7 ;store x^5 movups xmm1,coef2 mulps xmm1,xmm7 addps xmm0,xmm1 ;x+x^3/3+2*x^5/15 movups xmm7,[ebp-32] mulps xmm7,xmm6 ;x^7 movups [ebp-48],xmm7 ;store x^7 movups xmm1,coef3 mulps xmm1,xmm7 addps xmm0,xmm1 ;17*x^7/315 movups xmm7,[ebp-48] mulps xmm7,xmm6 ;x^9 movups[ebp-64],xmm7 ;store x^9 movups xmm1,coef4 mulps xmm1,xmm7 addps xmm0,xmm1 ;62*x^9/2835[/bash]

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Why do you store the odd powers of x? You don't need to store them and you even need not reload them.

BTW: You should try to omit modifying a register which is needed the very next command - always keep a creator and its consumer dispersed as far as possible, especially when the producer has a long latency.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Right I was thinking about the my other implementation of tan() function written in C and made such a comment.It is straightforward implementation of taylor expansion with pre-calculated coefficients.This is not the Horner scheme which starts with the last coefficient (highest index).

Btw polynomial evaluation can also start from the lowest first coefficient saw it many times for example FDLIBM implementation of sine function.

**>>Why do you store the odd powers of x? You don't need to store them and you even need not reload them**

Becuse when I wrote this code I did not know how to efficient multiply the powers of x without the stack accesses.And I simply implemented in assembly varioustaylor expansion without even thinking about the code optimization.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

that's one more reason for testing it over asub-domain instead ofa single valueIt could be possible because the cpi of fsin varies between 40-100 cycles per instruction

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page