- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRCcalls 1e6 times fastsin() the result in millisecond is 63
if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll provide you with performance numbers, compared to CRT-function 'sin', for:
 Normalized Series ( up to 7th term )
 Normalized Series ( up to 9th term )
 Chebyshev Polynomial ( up to 7th term )
 Chebyshev Polynomial ( up to 9th term )
 Normalized Chebyshev Polynomial ( up to 7th term )
 Normalized Chebyshev Polynomial ( up to 9th term )
 Normalized Taylor Series ( up to 7th term )
 Normalized Taylor Series ( up to 9th term )
MyAPI isimplemented in C as functions and macros.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As far i know CRT- sin uses taylor approximation(polynomial multiplication)up to 14th term combined with excessive range reduction and overflow checking so we have a lot of branches in that code.
Sadly my implementation of sin(written in java)is based on taylor series it is simply for loop with pow library function called from inside the loop so it is not so fast as others algorithms I need to rewrite it.
Sergey while comparing various approximations i think that good idea is to code it in assembly language in order to remove compiler-generated code when compiled from high-level language.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AFAIK Taylor seriesaremuch worse thanother approximations such as Legendre/Chebyshev/Jacobi polynomials, or best of allto minimize the absolute error: Minimax approximations
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
x + x**3 * (a3 + x**2 * (a5 + .....
for the case a0==0 a1=1 a2==0 ....
As others indicated, Cheybyshev or mini-max economization of the polynomial gives you better convergence (and accuracy).
You've demonstrated fully that beginning by concentrating on intrinsics without getting the algorithm right is pointless.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wonder how is it possible?All powers of the Taylorseries forsin(X)are odd:
sin(X) ~= X - X^3/3! + X^5/5! - X^7/7! + X^9/9! - X^11/11! + X^13/13! - X^15/15! + X^17/17! - ...
Unless it is partially normalized:
sin(X) ~= X* (1 - X^2/3! + X^4/5! - X^6/7! + X^8/9! - X^10/11! + X^12/13! - X^14/15! )
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll provide you with performance numbers, compared to CRT-function 'sin', for:
 Normalized Series ( up to 7th term )
 Normalized Series ( up to 9th term )
 Chebyshev Polynomial ( up to 7th term )
 Chebyshev Polynomial ( up to 9th term )
 Normalized Chebyshev Polynomial ( up to 7th term )
 Normalized Chebyshev Polynomial ( up to 9th term )
 Normalized Taylor Series ( up to 7th term )
 Normalized Taylor Series ( up to 9th term )
MyAPI isimplemented in C as functions and macros.
Here are results for C functions:
Notes:
1. Visual Studio 2005 / Microsoft C++ compiler / Release configuration / All Optimizations Disabled
2. All tests for a Double-Precision data type ( 'double' ) / FPU precision was set to 53-bits
3. Every test was executed 4,194,304 ( 2^22 ) times with a Process priority set to Normal
4. These functions without verifications ofinput parameters
5. A '7t'means up to 7th term
6. Interpolated Sine function uses a 1.00 degree step for a LUT of Sine Values generated by a CRT-function 'sin'
7. AE ( Absolute Error )= User-function-Value - CRT-function-Value
8. BT - Best Time / BA - Best Accuracy
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Iliya,
Take into account that accuracy of approximation if sine valuesvaries. This is how it looks like for a range from 0 to 90 degrees
for a Normalized Chebyshev Polynomial up to 9th term:
[cpp]... Normalized Chebyshev Polynomial 9t CRT-function 'sin' Sin( 0) = 0.00000000000000000000 = 0.00000000000000000000 - AE = 0.000000000000000000 Sin( 1) = 0.01745240642776235700 = 0.01745240643728351200 - AE = -0.000000000009521155 Sin( 2) = 0.03489949668347069200 = 0.03489949670250096900 - AE = -0.000000000019030277 Sin( 3) = 0.05233595621442847100 = 0.05233595624294383500 - AE = -0.000000000028515364 Sin( 4) = 0.06975647370616093500 = 0.06975647374412530200 - AE = -0.000000000037964368 Sin( 5) = 0.08715574270029286000 = 0.08715574274765816600 - AE = -0.000000000047365306 Sin( 6) = 0.10452846321094722000 = 0.10452846326765347000 - AE = -0.000000000056706251 Sin( 7) = 0.12186934333917229000 = 0.12186934340514748000 - AE = -0.000000000065975184 Sin( 8) = 0.13917310088490520000 = 0.13917310096006544000 - AE = -0.000000000075160239 Sin( 9) = 0.15643446495598134000 = 0.15643446504023087000 - AE = -0.000000000084249524 Sin(10) = 0.17364817757369913000 = 0.17364817766693033000 - AE = -0.000000000093231201 Sin(11) = 0.19080899527445139000 = 0.19080899537654480000 - AE = -0.000000000102093417 Sin(12) = 0.20791169070693508000 = 0.20791169081775934000 - AE = -0.000000000110824266 Sin(13) = 0.22495105422445330000 = 0.22495105434386500000 - AE = -0.000000000119411703 Sin(14) = 0.24192189547182422000 = 0.24192189559966773000 - AE = -0.000000000127843514 Sin(15) = 0.25881904496641395000 = 0.25881904510252074000 - AE = -0.000000000136106793 Sin(16) = 0.27563735567281122000 = 0.27563735581699916000 - AE = -0.000000000144187939 Sin(17) = 0.29237170457066503000 = 0.29237170472273677000 - AE = -0.000000000152071744 Sin(18) = 0.30901699421520712000 = 0.30901699437494740000 - AE = -0.000000000159740277 Sin(19) = 0.32556815428998409000 = 0.32556815445715670000 - AE = -0.000000000167172609 Sin(20) = 0.34202014315132739000 = 0.34202014332566871000 - AE = -0.000000000174341319 Sin(21) = 0.35836794936408850000 = 0.35836794954530027000 - AE = -0.000000000181211768 Sin(22) = 0.37460659322817519000 = 0.37460659341591201000 - AE = -0.000000000187736826 Sin(23) = 0.39073112829542028000 = 0.39073112848927377000 - AE = -0.000000000193853489 Sin(24) = 0.40673664287632461000 = 0.40673664307580021000 - AE = -0.000000000199475603 Sin(25) = 0.42261826153621362000 = 0.42261826174069944000 - AE = -0.000000000204485817 Sin(26) = 0.43837114658035231000 = 0.43837114678907740000 - AE = -0.000000000208725093 Sin(27) = 0.45399049952756826000 = 0.45399049973954675000 - AE = -0.000000000211978490 Sin(28) = 0.46947156257193301000 = 0.46947156278589081000 - AE = -0.000000000213957796 Sin(29) = 0.48480962003205796000 = 0.48480962024633706000 - AE = -0.000000000214279094 Sin(30) = 0.49999999978756438000 = 0.49999999999999994000 - AE = -0.000000000212435569 Sin(31) = 0.51503807470229079000 = 0.51503807491005416000 - AE = -0.000000000207763362 Sin(32) = 0.52991926403380607000 = 0.52991926423320490000 - AE = -0.000000000199398831 Sin(33) = 0.54463903482879883000 = 0.54463903501502708000 - AE = -0.000000000186228255 Sin(34) = 0.55919290330392202000 = 0.55919290347074690000 - AE = -0.000000000166824887 Sin(35) = 0.57357643621167165000 = 0.57357643635104605000 - AE = -0.000000000139374401 Sin(36) = 0.58778525219088762000 = 0.58778525229247314000 - AE = -0.000000000101585518 Sin(37) = 0.60181502310146640000 = 0.60181502315204827000 - AE = -0.000000000050581872 Sin(38) = 0.61566147534288251000 = 0.61566147532565829000 - AE = 0.000000000017224222 Sin(39) = 0.62932039115611915000 = 0.62932039104983739000 - AE = 0.000000000106281761 Sin(40) = 0.64278760990861683000 = 0.64278760968653925000 - AE = 0.000000000222077579 Sin(41) = 0.65605902936184934000 = 0.65605902899050728000 - AE = 0.000000000371342068 Sin(42) = 0.66913060692114923000 = 0.66913060635885824000 - AE = 0.000000000562290992 Sin(43) = 0.68199836086740095000 = 0.68199836006249848000 - AE = 0.000000000804902478 Sin(44) = 0.69465837157023880000 = 0.69465837045899725000 - AE = 0.000000001111241543 Sin(45) = 0.70710678268237914000 = 0.70710678118654746000 - AE = 0.000000001495831681 Sin(46) = 0.71933980231473449000 = 0.71933980033865108000 - AE = 0.000000001976083408 Sin(47) = 0.73135370419195334000 = 0.73135370161917046000 - AE = 0.000000002572782876 Sin(48) = 0.74314482878804677000 = 0.74314482547739424000 - AE = 0.000000003310652530 Sin(49) = 0.75470958444175995000 = 0.75470958022277201000 - AE = 0.000000004218987937 Sin(50) = 0.76604444845135888000 = 0.76604444311897801000 - AE = 0.000000005332380870 Sin(51) = 0.77714596814850889000 = 0.77714596145697090000 - AE = 0.000000006691537990 Sin(52) = 0.78801076195092989000 = 0.78801075360672201000 - AE = 0.000000008344207880 Sin(53) = 0.79863552039351415000 = 0.79863551004729283000 - AE = 0.000000010346221324 Sin(54) = 0.80901700713761193000 = 0.80901699437494745000 - AE = 0.000000012762664481 Sin(55) = 0.81915205995818574000 = 0.81915204428899180000 - AE = 0.000000015669193942 Sin(56) = 0.82903759170854807000 = 0.82903757255504174000 - AE = 0.000000019153506337 Sin(57) = 0.83867059126240617000 = 0.83867056794542405000 - AE = 0.000000023316982123 Sin(58) = 0.84804812443294242000 = 0.84804809615642596000 - AE = 0.000000028276516462 Sin(59) = 0.85716733486866747000 = 0.85716730070211233000 - AE = 0.000000034166555141 Sin(60) = 0.86602544492579625000 = 0.86602540378443860000 - AE = 0.000000041141357654 Sin(61) = 0.87461975651689605000 = 0.87461970713939574000 - AE = 0.000000049377500311 Sin(62) = 0.88294765193557789000 = 0.88294759285892688000 - AE = 0.000000059076651016 Sin(63) = 0.89100659465699772000 = 0.89100652418836779000 - AE = 0.000000070468629931 Sin(64) = 0.89879413011395126000 = 0.89879404629916704000 - AE = 0.000000083814784224 Sin(65) = 0.90630788644835647000 = 0.90630778703664994000 - AE = 0.000000099411706533 Sin(66) = 0.91354557523791691000 = 0.91354545764260087000 - AE = 0.000000117595316040 Sin(67) = 0.92050499219778437000 = 0.92050485345244037000 - AE = 0.000000138745343992 Sin(68) = 0.92718401785703108000 = 0.92718385456678742000 - AE = 0.000000163290243660 Sin(69) = 0.93358061820976956000 = 0.93358042649720174000 - AE = 0.000000191712567821 Sin(70) = 0.93969284534074993000 = 0.93969262078590832000 - AE = 0.000000224554841610 Sin(71) = 0.94551883802529291000 = 0.94551857559931674000 - AE = 0.000000262425976172 Sin(72) = 0.95105682230340971000 = 0.95105651629515353000 - AE = 0.000000306008256179 Sin(73) = 0.95630511202798463000 = 0.95630475596303544000 - AE = 0.000000356064949192 Sin(74) = 0.96126210938689549000 = 0.96126169593831889000 - AE = 0.000000413448576597 Sin(75) = 0.96592630539896773000 = 0.96592582628906831000 - AE = 0.000000479109899421 Sin(76) = 0.97029628038365956000 = 0.97029572627599647000 - AE = 0.000000554107663087 Sin(77) = 0.97437070440439277000 = 0.97437006478523525000 - AE = 0.000000639619157528 Sin(78) = 0.97814833768545340000 = 0.97814760073380558000 - AE = 0.000000736951647817 Sin(79) = 0.98162803100239626000 = 0.98162718344766398000 - AE = 0.000000847554732286 Sin(80) = 0.98480872604589964000 = 0.98480775301220802000 - AE = 0.000000973033691620 Sin(81) = 0.98768945575902911000 = 0.98768834059513777000 - AE = 0.000001115163891341 Sin(82) = 0.99026934464787952000 = 0.99026806874157036000 - AE = 0.000001275906309162 Sin(83) = 0.99254760906557782000 = 0.99254615164132198000 - AE = 0.000001457424255835 Sin(84) = 0.99452355746963761000 = 0.99452189536827329000 - AE = 0.000001662101364319 Sin(85) = 0.99619659065267530000 = 0.99619469809174555000 - AE = 0.000001892560929750 Sin(86) = 0.99756620194650159000 = 0.99756405025982420000 - AE = 0.000002151686677387 Sin(87) = 0.99863197739962362000 = 0.99862953475457383000 - AE = 0.000002442645049783 Sin(88) = 0.99939359592819632000 = 0.99939082701909576000 - AE = 0.000002768909100559 Sin(89) = 0.99985082944048176000 = 0.99984769515639127000 - AE = 0.000003134284090489 Sin(90) = 1.00000354293488590000 = 1.00000000000000000000 - AE = 0.000003542934885914 ... [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Precision Control of FPU also needs to be taken into account and here is example ( for 'double' data type ):
[cpp] 24-bit accuracy CRT Sin( 45.0 ) = 0.7071067966408575200000 Normalized: Chebyshev Polynomial 9t Sin( 45.0 ) = 0.7071067690849304200000 53-bit accuracy CRT Sin( 45.0 ) = 0.7071067811865474600000 Normalized: Chebyshev Polynomial 9t Sin( 45.0 ) = 0.7071067826823791400000 64-bit accuracy CRT Sin( 45.0 ) = 0.7071067811865474600000 Normalized: Chebyshev Polynomial 9t Sin( 45.0 ) = 0.7071067826823791400000 Default accuracy CRT Sin( 45.0 ) = 0.7071067811865474600000 Normalized: Chebyshev Polynomial 9t Sin( 45.0 ) = 0.7071067826823791400000 [/cpp]
As you can see 53-bit, 64-bit and Default results are identical. I always recommend to useDefault settings
for FPU and in that case results on different platforms with different C/C++ compilers will be highly consistent!
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The hardware default for x87 FPU is 64-bit precision mode, but Windows and Intel compiler practice is to set (by "default") 53-bit precision (which gives same results as SSE2 double, aside from exponent range). Microsoft libraries don't support the Intel options /Qlong-double -pc80, so their results aren't predictable for that case; printf will see only 64-bit doubles, never 80-bit long doubles.
gnu compilers don't set 53-bit precision mode in their run-time initialization, so there you get 64-bit precision mode. linux glibc libraries have full support for long-double 64-bit precision mode. mingw compilers rely on Microsoft libraries, thus no library support for long double.
When printf formats a double, IEEE standards support zeroing out the insignificant decimal digits, but nothing requires that various compilers make this same choice.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I mustsay that you have made a great job thanks for posting that. I have a few question regarding your methodology.
How did you make a running-time measurement? Did you use KeGetTickCount function,or maybe an inline rdtsc?
For interpolated sine what interpolation did you use linear interpolation or lagrange interpolation(quadratic or cubic)?
For normalized Taylor Series did you use polynomial fit or maybe for loop and how you calculated powers of an rad argument (many times i used calls to library function pow , but it is non efficient method ).
Were chebyshev polynomial's coefficients pre-calculated , or did your measuremant take into account calculation of chebyshev coefficients ( many calls to library cos).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
CRT sin implementation is more than two times slower than your library normalized taylor series sin.I think that extensive range reduction ,input checking and overflow checking are responsible for that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Take into account that accuracy of approximation if sine valuesvaries. This is how it looks like for a range from 0 to 90 degrees
for a Normalized Chebyshev Polynomial up to 9th term
If I 'am not wrong CRT sin function uses or is based on FDLIBM implementation.I can see that AE rises when an angle is close to 90 degrees.When I waschecking my librarysine function for an accuracy I also saw
such a behaviour when an argument was closer toPi/2 error was largerwhen compared tosmaller angles closer to 0.
I suppose that this is taylor series inducedinaccuracy.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For quick and dirty usage, a rational approximation may be better if it is to cover the entire range |x| < Pi/2. As tan() lends itself to a rational approximation (with odd polynomial in numerator and even in denominator), the trigonometric formulae can be used to deduce the appropriate form for cos() sin()....
Minimax economization of numerator and denominator polynomials is needed to make this competitive.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
inacurracies.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
TimP
Slightly off-topic question.Is it possible for Intel profiler (the one that is part of Parallel Studio) to profile java class file?I 'am interested in finding hot-spots ,function calls count and cpu-time spent while calculating various pieces of code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]inline double EvalPoly5(double a0, double a1, double a2, double a3, double a4, double a5, double x) // Returns a5*x^5 + a4*x^4 + a3*x^3 + a2*x^2 + a1*x + a0 { return ((((a5*x+a4)*x+a3)*x+a2)*x+a1)*x+a0; } inline double SinApprox(double x) // 0 <= x <= Pi/2 { const double a0 = 1.0, a2 = -.1666666664, a4 = 0.0083333315, a6 = -.0001984090, a8 = .0000027526, a10 = -.0000000239; return EvalPoly5(a0,a2,a4,a6,a8,a10,x*x)*x; }[/cpp]
as you can see this kind ofcode is straightforward to vectorize and the polynomial evaluation is a perfect fit for FMA
 
					
				
				
			
		
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page