- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRCcalls 1e6 times fastsin() the result in millisecond is 63
if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1e6 iteration called with loop counter.
start value of fastexp() is: 20642489
end value of fastexp() is: 20642520
delta of fastexp() is: 31 millisec
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Iliya, I just completed reading all latest posts from the Page #9 and I'll try to do some testing with your 'FastSin' function
implemented in C. So, in what thread do you want to continue discussion? At the moment you have already two threads...
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Glad to see you back.:)
I prefer to continue discussion in this thread because this thread is already known and most viewed.
I'am uploading very interesting document named "K5 Transcendental Functions".
This pdf describes testing and implementation in hardware (AMD K5 CPU) of transcendental functions.
From this doc you can clearly see that engineers used Mathematica program as a reference model to test their fixed-precision implementation against.Also you can learn that taylor expansion with pre-calculated coefficients was used.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
actually they say the coefficients were generated on-the-fly (page 165) thus the choice of Taylor's series which is easy and fast
Also you can learn that taylor expansion with pre-calculated coefficients was used
AFAIK modern methodsuseprecomputed coefficientswith minimax polynomials computed offline, for a software implementation the memory used by the coefficients is a non-issue anyway
also note that the range reduction algorithm they report isn't easily amenable to a software implementation since it works only with multiple precision arithmetic, see Elementary Functions Algorithms and Implementation 2nd edition (Jean-Michel Muller, Birkhuser) pages 173-191 forother range reduction methods like Cody and Waite or Payne and Hanek
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another set of tests this time gamma() function stirling approximation.
There are two implementations.
One of them is based on 1d arrays ,library pow() calls and division inside for-loop
to calculate coefficients and to sum the series expansion.Both functions use 14 terms of stirling expansion.
Second method is faster because array'scoefficients fillingand summation is eliminated
and also pow() library calls are not used.Instead the function relies on simple calculation of
argument x ofconsecutivepowers mulitplied by constant denominator and divided by numerator.
Pre-calculation of coefficients was not possible because of dependency of denominator on argument x.
Accuracy when compared to Mathematica code is between 15-16 decimal places.
Code for C implementation of stirling gamma function
slower version
[bash]inline double gamma(double x){ double sum,result; sum = 0; result = 0; if( x > gamma_huge){ return (x-x)/(x-x); //NaN }else if( x < one){ return (x-x)/(x-x); }else{ double ln,power,pi_sqrt,two_pi,arg; two_pi = 2*Pi; double coef1[] = {1.0,1.0,139.0,571.0,163879.0, 5246819.0,534703531.0,4483131259.0, 432261921612371.0, 6232523202521089.0, 25834629665134204969.0,1579029138854919086429.0, 746590869962651602203151.0,1511513601028097903631961.0 }; double coef2[] = {12.0,288.0,51840.0,2488320.0,209018880.0, 75246796800.0, 902961561600.0,86684309913600.0, 514904800886784000.0, 86504006548979712000.0, 13494625021640835072000.0,9716130015581401251840000.0, 116593560186976815022080000.0,2798245444487443560529920000.0 }; int const size = 14; double temp[size]; double temp2[size]; double temp3[size]; int i,j; long k; k = 0; for(i = 0;i
Results of 1e6 iterations gamma() called with an argument incremented by 0.000001.Microsoft Optimizied compiler
gamma() start value is: 13214329
gamma() end value is: 13214688
execution time of gamma() 1e6 iterations is: 359 millisec
gamma 1.999999997453735200000000
Intel Compiler optimized for speed
gamma() start value is: 24295767
gamma() end value is: 24296016
execution time of gamma() 1e6 iterations is: 249 millisec
gamma 1.999999997453735200000000
Intel compiler settings
/Zi /nologo /W3 /MP /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /GS- /Gy /arch:SSE2 /fp:precise /QxHost /Zc:wchar_t /Zc:forScope /Fp"Release\inline.c.pch" /FAcs /Fa"Release" /Fo"Release" /Fd"Release\vc100.pdb" /Gd /TP
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tested today fastsin() 1e6 iterations and the result was 15 millisec i.e ~33.39 cycles per iterationfor my CPU.Finally the results are close to yours.
results
start val of fastsin() 13214314
end val of fastsin() 13214329
running time of fastsin() release code is: 15 millisec
fastsin() is: 0.891207360591512180000000
bronxzv
"AFAIK modern methodsuseprecomputed coefficientswith minimax polynomials computed offline, for a software implementation the memory used by the coefficients is a non-issue anyway".
When I studied this book "Numerical Recipes in C" I saw that many of their implementations are based on
for-loops coefficient calculation .In order to optimize the speed of execution the authors do not use (for powers) calculations library pow() calls.Few times I also saw when division was used instead of reciprocal multiplication.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Code for faster version
[bash]inline double fastgamma(double x){ double sum,result; sum = 0; result = 0; if( x > gamma_huge){ return (x-x)/(x-x); }else if( x < one){ return (x-x)/(x-x); }else{ double ln,power,pi_sqrt,two_pi,arg; two_pi = 2*Pi; double nom1,nom2,nom3,nom4,nom5,nom6,nom7,nom8,nom9,nom10; double denom1,denom2,denom3,denom4,denom5,denom6,denom7,denom8,denom9,denom10; double coef1,coef2,coef3,coef4,coef5,coef6,coef7,coef8,coef9,coef10; double z1,z2,z3,z4,z5,z6,z7,z8,z9,z10; nom1 = 1.0; nom2 = 1.0; nom3 = 139.0; nom4 = 571.0; nom5 = 163879.0; nom6 = 5246819.0; nom7 = 534703531.0; nom8 = 4483131259.0; nom9 = 432261921612371.0; nom10 = 6232523202521089.0; denom1 = 12.0; denom2 = 288.0; denom3 = 51840.0; denom4 = 2488320.0; denom5 = 209018880.0; denom6 = 75246796800.0; denom7 = 902961561600.0; denom8 = 86684309913600.0; denom9 = 514904800886784000.0; denom10 = 86504006548979712000.0; z1 = x; z2 = z1*z1;//x^2 z3 = z2*z1;//x^3 z4 = z3*z1;//x^4 z5 = z4*z1;//x^5 z6 = z5*z1;//x^6 z7 = z6*z1;//x^7 z8 = z7*z1;//x^8 z9 = z8*z1;//x^9 z10 = z9*z1;//x^10 ln = exp(-x); arg = x-0.5; power = pow(x,arg); pi_sqrt = sqrt(two_pi); sum = ln*power*pi_sqrt; coef1 = nom1/(z1*denom1); coef2 = nom2/(z2*denom2); coef3 = nom3/(z3*denom3); coef4 = nom4/(z4*denom4); coef5 = nom5/(z5*denom5); coef6 = nom6/(z6*denom6); coef7 = nom7/(z7*denom7); coef8 = nom8/(z8*denom8); coef9 = nom9/(z9*denom9); coef10 = nom10/(z10*denom10); result = one+coef1+coef2-coef3-coef4+coef5+coef6-coef7-coef8+coef9+coef10; } return sum*result; }[/bash]
Results of 1e6 iterations of fastgamma() called with an argument 2.0 incremented by 0.000001
fastgamma() start value is 13214688
fastgamma() end value is 13214875
execution time of fastgamma() 1e6 iterations is: 187 millisec
fastgamma 2.000000016396600500000000
The same test compiled with Intel compiler
fastgamma() start value is 24296016
fastgamma() end value is 24296157
execution time of fastgamma() 1e6 iterations is: 141 millisec
fastgamma 2.000000016396600500000000
Compiler settings
/Zi /nologo /W3 /MP /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /GS- /Gy /arch:SSE2 /fp:precise /QxHost /Zc:wchar_t /Zc:forScope /Fp"Release\inline.c.pch" /FAcs /Fa"Release" /Fo"Release" /Fd"Release\vc100.pdb" /Gd /TP
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I found a couple of notes related to some concerns about Performance and Security Cookie Checking ( or Buffer Security Checks )...
1.In Post #132 ( current thread )
// ...Still as you can see native code is slower maybe probably because of overhead of security cookie beign checked on the stack...
2. InPost #142 ( current thread )
// ...indeed starting a function with a useless "rep stosd" is not the best thing to do if you want fast code...
3. In a thread 'Java polynomial approx. faster than C code polynomial approx.':
http://software.intel.com/en-us/forums/showthread.php?t=105973&o=a&s=lr
In Post #5:
Zi /nologo /W3 /WX- /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE"
/Gm- /EHsc /GS- /Gy /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope
/Fp"Release\inline.c.pch" /FAcs /Fa"Release" /Fo"Release" /Fd"Release\vc100.pdb" /Gd /analyze-/errorReport:queue
Isee thata compiler optionBuffer Security Checkis disabled: '/GS-'.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Iliya,
Wouldn't it better to have a different thread for it? For example, all discussions could be separated as follows:
'FastSin' in a Thread A
'FastGamma' in a Thread B
'FastExp' in a Thread C
etc.
I think in that case acontent of threads in the forumwill be more consistent and useful for everybody.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey!Wouldn't it better to have a different thread for it? For example, all discussions could be separated as follows:
'FastSin' in a Thread A
'FastGamma' in a Thread B
'FastExp' in a Thread C
I think that this is a good idea to separate discussion in different threads.
Now I'am starting with gamma function optimization and speed of execution testing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I created new thread on gamma stirling approximation Can you try to test those functions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've made a comment regarding Buffer Security Checks ( compiler option /GS ) in a Post #160. Please take a look.
The code was compiled with release mode with buffer checks disabled.Earlier compilation were in Debug mode.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'd like to make a comment regarding your method of error processing, that is:
...
return ( x-x / x-x );
...
For any 'x' expression ( x-x/x-x ) equals to -1.#IND00000000000. Here is a Test-Case:
...
double x = -25.0L;
double y = ( x-x ) / ( x-x );
printf( "%.15f\n", y );
...
Output:
-1.#IND00000000000
CRT trigonometric functions 'sin' or 'cos' always return some value because they do a range reduction. Why did you decide
to use( x-x / x-x )?
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
P.S
In fastsin() function posted here the input checking is wrongly written.It should be
if(x > (Pi/2)){
return (x-x)/(x-x); //return NaN
}else if(-x < (-Pi/2)){
return (-x+x)/(-x+x);//return NaN
}else{
// fastsin() code here .....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[SergeyK] Almost the same ( 75% ) for trigonometric API isin my case.
The function simply returns NaN value when an input is out of range...
Hi Iliya,
I don't know if you have some requirements for your API or you're doing some R&D. So, I simply would like to express
my opinion and it is as follows:
Let me mention a well known fact that the sine trigonometric function repeats itself every 2*PI radians. In case of
large angles a subtraction of a multiple of 2*PI could be done and a new value of the angle will be
between 0 and 2*PI. Then,let's think about practical applications ofsome API. What if somebody will need to calculate
the sine of 750 degrees ( PI/6 or 30 degrees )? It is possible to make a note that it is mandatory for a user
to do its own range reduction. A CRT-function 'sin' would be very impractical if it wouldn't have a range reduction and
everybody would be forced to do it. It would create a "perfect" case for inconsistencies of FP-operations
with tirgonometric functions and everybody would complain about it.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I decided to apply a more flexible approach when working on implementation of trigonometric API:
- For a set of macros:
input value is in radians
there is no range reduction
always returns some value
speed rating ( ***** )
- For a set of methods ina template class:
input value could be in radians or in degrees
there is no range reduction
always returns some value
speed rating ( **** )
- For a set of non-template based external functions:
input value could be in radians or in degrees
there is a range reduction
always returns some value
speed rating ( *** )
All these sets are implemented in C, absolutely portable, faster then standard CRT-functions ( speed rating *** ),
less accurate then standard CRT-functions.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And one more comment regarding performance...
Performance of differentSine functions compared to CRT-function 'sin' when tested on different platforms:
- Normalized Taylor Series up to 7th term is the Fastest implementation
for Windows Desktop Platforms
- Normalized Series up to 7th term is the Fastest implementation
for Windows CE Platforms
- Normalized Chebyshev Polynomial up to 9th term is the Fastest implementation
for 16-bit Platforms ( like, MS-DOS )
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page