- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRCcalls 1e6 times fastsin() the result in millisecond is 63
if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[SergeyK] Because your computer is faster.
...I always measure a relative performance...
And here are results of a very special test:
Performanceof C 'Sin' Macros measured with RDTSC instruction is as follows:
Normalized Taylor Series Sine ( 7 terms ) - 25 clock cycles
Normalized Taylor Series Sine ( 9 terms ) - 35 clock cycles
Normalized Taylor Series Sine ( 11 terms ) - 43 clock cycles
However, at the time of testing a Windows Update was in progress...
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
These results are closer to the results achieaved by me and by bronxzv.From theoretical point of view it should be very fast because of a few addition and multiplication and one assignment of the accumulator register to the memory store.Normalized Taylor Series Sine ( 7 terms ) - 25 clock cycles
Normalized Taylor Series Sine ( 9 terms ) - 35 clock cycles
Normalized Taylor Series Sine ( 11 terms ) - 43 clock cycles
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try to run your tests for 1e6 iterations.78 ms (ticks ) / 2^22 = 0.000018596649169921875 ms ~= 0.0000186
Here are my results for 2^22 iterations unoptimized fastsin() ,but heavily optimized byIntel C++ compiler with full code inlining inside the loop and 10x loop unrolling.
start val of fastsin() 5566115
end val of fastsin() 5566178
running time of fastsin() release code is: 63 millisec
fastsin() is: 0.988566034786635070000000
63 ms / 2^22 = 0.0000150203704833984375ms ~= 15 ns per iteration
Your code ran for 18.6 ns per iterationprobably due toless agressive compiler optimization.
Btw my CPU is kinda slow Core i3 2.226Ghz.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes something could be wrong with this measurement.I think that overhead in the worst-case scenario is a few cycles of compare to sth, inc counter and jmp to top_loop assembly instructions.In real world testing CPU should execute such a loop inparallel with inside the loop statements.This is a follow up on two Posts #117 and #114. I think you need to disable ALL optimizations in order to measure an overhead of
an empty 'for' statement. Intel C++ compiler could easily "remove" it. Since itdidn't andyour result was 0 something else
was wrong. I'll take a look at it some time this week
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes I was wrong.It is not only the different CPU's , but also uncontrollable by us non-deterministic environment which can greatly affect the state of our measurements.- Value VA in mswas obtained on a computer A with CPU frequency FA
-Value VB in mswas obtained on a computer B with CPU frequency FB
- If value VB is less then value VA the computer B is faster then computer A
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let's say I've set a target to outperform some
"reference function" from CRT library
MSVCRT.DLL sin() functions after after argument checking calls x87 fsin instruction.I suppose that fsin performes also range reduction and input checking its cpi is ~40-100 cycles you have also add the overhead of an library wrapper and argument checking so the CRT sin() will be always slower than our fastsin() implementation.Even x87 fsin is slower on average than fastsin() because of range reduction and if I'am not wrong coefficient pre-calculations done on-the-fly.This explains the accuracy of fastsin() compared to CRT sin()
Here is the disassembled by the IDA PRO code snippet of _CIsin which is compiler optimized version of many sin() implementations.
[bash] push edx .text:6FF759A3 fstcw [esp+4+var_4] .text:6FF759A7 jz loc_6FF7E5AC .text:6FF759AD cmp [esp+4+var_4], 27Fh .text:6FF759B3 jz short loc_6FF759BB .text:6FF759B5 fldcw ds:word_6FF5C0D6 .text:6FF759BB .text:6FF759BB loc_6FF759BB: ; CODE XREF: sub_6FF759A2+11j .text:6FF759BB fsin ; x87 instruction call
.text:6FF759BD fstsw ax .text:6FF759C0 sahf .text:6FF759C1 jp loc_6FF7E552 .text:6FF759C7 .text:6FF759C7 loc_6FF759C7: ; CODE XREF: sub_6FF759A2+8BC4j .text:6FF759C7 cmp dword_6FFF50C4, 0 .text:6FF759CE jnz loc_6FF6C003 .text:6FF759D4 mov edx, 1Eh .text:6FF759D9 lea ecx, unk_6FFF5390 .text:6FF759DF jmp loc_6FF5EE3C .text:6FF759DF sub_6FF759A2 endp
[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In real world testing CPU should execute such a loop in parallel with inside the loop statements
indeed, that's why it's notsensibleto measure the timing of an empty loop as an estimate of the impact on a real loop,there is typically noconflicts between GPR increment and compare and the FADD/FMUL ports, moreover the branch is macro-fused and with high prediction hit rate, all in all the actual overhead is almost 0
a great tool to see how several instructions are executed in parallel and to study your critical path is the IACA available here:
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
CPU internal logic will easily optimize such a case,as I could say this is a classic case when integer calculations are executed in parallel with floating-point instructions.Situation will be slightly different when floating point based for-loop is used.Here for-loop's fp code could be interleaved with main calculation body and executed by Port0 and Port1 in parallel withloop's body.increment and compare and the FADD/FMUL ports, moreover the branch is macro-fused and with high prediction hit rate, all in all the actual overhead is almost 0
Btw. thanks for the link
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your code ran for 18.6 ns per iterationprobably due toless agressive compiler optimization...
I always disable ALL optimizations during testing because I need to evaluate results for 5 different C++ compilers and 6 different platforms.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is not always true and CRT 'sin' outperforms some of my sine functions with 9 terms ( see Post #183 /
based on Chebyshev polynomial). This is not a surprise for me because CRT 'sin'is implemented in assembler.
ALL my sine functions are implemented in C.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It could be possible because the cpi of fsin varies between 40-100 cycles per instruction.If fsin can reach 40-50 cpi and your custom chebyshev-based implementation is not optimized agressively by the compiler it is possible than CRT sin() could run faster.This is not always true and CRT 'sin' outperforms some of my sine functions with 9 terms ( see Post #183 /
based on Chebyshev polynomial
>>This is not a surprise for me because CRT 'sin'is implemented in assembler
MSVCRT.DLL sin() functions have some overhead which is based on argument checking and comparing to 0 and branching on result to the location which in the end calls x87fsin(you can clearly see the call to fsin in my post above).When CRT sin() is called many times compiler can simply inline the code in for-loop block because of almost 100% predictable backwardbranching the assembly code for the overhead will be present in the cache already decoded to uops , so it can greatly speedup execution time of CRT sin().
Btw JVM solution Math.sin also uses fsin ,but range reduction is performed at the bytecode level so it is slower than MSVCRT sinefunctions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why do not you try to run fastsin() or your other sinefunctions heavily optimized by Intel compiler.I always disable ALL optimizations during testing because I need to evaluate results for 5 different C++ compilers and 6 different platforms
It can be very interesting to see the results of such a measurement.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why do not you try to run fastsin() or your other sinefunctions heavily optimized by Intel compiler.I always disable ALL optimizations during testing because I need to evaluate results for 5 different C++ compilers and 6 different platforms
It can be very interesting to see the results of such a measurement.
Iliya,
I work for a project and I can't do everything I want. I tested your 'FastSin' functionand, unfortunately,
I need to move on with another things.Your set of special functions implemented in Java is good ( Thank you again! )but
it has no practical applications for the project. Please take a look at a thread for more details:
http://software.intel.com/en-us/forums/showthread.php?t=106134&o=a&s=lr
Post #3
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let's say I've set a target to outperform some
"reference function" from CRT library
MSVCRT.DLL sin() functions after after argument checking calls x87 fsin instruction. I suppose that fsin performes also range reduction...
'FSIN' instructionaccepts any values in the range from -2^63 to +2^63. If an input value is outside of that range areduction of
theinput valuemust be done and Intel recommends to use 'FPREM' instruction.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your set of special functions implemented in Java is good ( Thank you again! )but
it has no practical applications for the project
Thank You Sergey.Maybe in foreseeable future you could implement some parts of special functions class.I would be very glad to see it :)
What is your main area of software development?Is it numerical methods software?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was talking about the range reduction(or mapping)of large values in radian(input -2^63,+2^63) to values in range |x|FSIN' instructionaccepts any values in the range from -2^63 to +2^63
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For sine Horner's scheme is as follows (the constants c_n = 1/n! for the usual Taylor approximation):
x2 = x*x;
result = ((((((c13*x2+c11)*x2+c9)*x2+c7)*x2+c5)*x2+c3)*x2+c1)*x;
In assembler (x and result via [eax]):
movsd xmm0,[eax]
movsd xmm1,xmm0 ; x
mulsd xmm0,xmm0
movsd xmm2,xmm0 ; x2
mulsd xmm0,[c13]
addsd xmm0,[c11]
mulsd xmm0,xmm2
addsd xmm0,[c9]
mulsd xmm0,xmm2
addsd xmm0,[c7]
mulsd xmm0,xmm2
addsd xmm0,[c5]
mulsd xmm0,xmm2
addsd xmm0,[c3]
mulsd xmm0,xmm2
addsd xmm0,[c1]
mulsd xmm0,xmm1
movsd [eax],xmm0
A simplified variant of Estrin's scheme is
x2 = x*x;
x4 = x2*x2;
result =
( (((+c11)*x4+c7)*x4+c3)*x2+
(((+c13)*x4+c9)*x4+c5)*x4+c1 )*x;
In assembler (x and result via [eax]):
movsd xmm0,[eax]
movsd xmm1,xmm0 ; x
mulsd xmm0,xmm0
movsd xmm2,xmm0 ; x2
mulsd xmm0,xmm0
movsd xmm3,xmm0
movsd xmm4,xmm0 ; x4
mulsd xmm0,[c13]
mulsd xmm3,[c11]
addsd xmm0,[c9]
addsd xmm3,[c7]
mulsd xmm0,xmm4
mulsd xmm3,xmm4
addsd xmm0,[c5]
addsd xmm3,[c3]
mulsd xmm0,xmm4
mulsd xmm3,xmm2
addsd xmm0,[c1]
addsd xmm0,xmm3
mulsd xmm0,xmm1
movsd [eax],xmm0
At least on Atom N450 and Core i7 920 this is a gain.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is the Horner scheme for polynomial approximation oftan() function.The first four terms of tan() taylor series are shown below.It is not optimal in the term of speed becuse of stack accesses which are used to calculate powers of a variable x.
[bash]main_loop: addps xmm5,step movups xmm7,xmm5 movups xmm0,xmm7 ;xmm0 accumulator mulps xmm7,xmm7 movups xmm6,xmm7 ;copy x^2 mulps xmm7,xmm0 ;x^3 movups [ebp-16],xmm7 ;store x^3 movups xmm1,coef1 mulps xmm1,xmm7 addps xmm0,xmm1 ;x+x^3/3 movups xmm7,[ebp-16] mulps xmm7,xmm6 ;x^5 movups [ebp-32],xmm7 ;store x^5 movups xmm1,coef2 mulps xmm1,xmm7 addps xmm0,xmm1 ;x+x^3/3+2*x^5/15 movups xmm7,[ebp-32] mulps xmm7,xmm6 ;x^7 movups [ebp-48],xmm7 ;store x^7 movups xmm1,coef3 mulps xmm1,xmm7 addps xmm0,xmm1 ;17*x^7/315 movups xmm7,[ebp-48] mulps xmm7,xmm6 ;x^9 movups[ebp-64],xmm7 ;store x^9 movups xmm1,coef4 mulps xmm1,xmm7 addps xmm0,xmm1 ;62*x^9/2835[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why do you store the odd powers of x? You don't need to store them and you even need not reload them.
BTW: You should try to omit modifying a register which is needed the very next command - always keep a creator and its consumer dispersed as far as possible, especially when the producer has a long latency.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Right I was thinking about the my other implementation of tan() function written in C and made such a comment.It is straightforward implementation of taylor expansion with pre-calculated coefficients.This is not the Horner scheme which starts with the last coefficient (highest index).
Btw polynomial evaluation can also start from the lowest first coefficient saw it many times for example FDLIBM implementation of sine function.
>>Why do you store the odd powers of x? You don't need to store them and you even need not reload them
Becuse when I wrote this code I did not know how to efficient multiply the powers of x without the stack accesses.And I simply implemented in assembly varioustaylor expansion without even thinking about the code optimization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
that's one more reason for testing it over asub-domain instead ofa single valueIt could be possible because the cpi of fsin varies between 40-100 cycles per instruction
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page