- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRCcalls 1e6 times fastsin() the result in millisecond is 63
if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I used for loop with compound statement calculating factorials and Math.pow library calls to calcualte powers of an argument.The results were summed in array.As i said very slow compared to yours method.
For example here is code for Gamma Stirling approximation (formula and coefficientstaken from Abramovitz and Stegun) written in Java.
[bash]public double GammaStirlingApprox(double x){ double sum = 0; double result = 0; if(x > Double.MAX_VALUE){ return Double.NaN; }else{ double ln; double pow; double pi_sqrt; double two_pi = 2* Math.PI; double []coef = {12.0,288.0,51840.0,2488320.0,209018880.0,75246796800.0,902961561600.0,86684309913600.0 }; double[]coef2 = {1.0,1.0,139.0,571.0,163879.0,5246819.0,534703531.0,4483131259.0}; double[]array = new double[coef.length]; double[]array2 = new double[coef.length]; int i; int j; int k = 0; for(i = 0;i
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everybody, This is my consolidated response on posts fromIliyapolak:
>>...How did you make a running-time measurement?
In codes it looks like:
>>...Did you use KeGetTickCount function, or maybe an inline rdtsc?
Please see above. I use a Win32 API function 'GetTickCount'.
>>...For interpolated sine what interpolation did you use Linear interpolation...
Yes.
>>Lagrange interpolation?..
No, unfortunately, it is on hold for a long time.
>>...For Normalized Taylor Series did you use polynomial fit or maybe for loop and how you calculated powers...
In codes it looks like:
>>...Were Chebyshev polynomial's coefficients pre-calculated...
Yes.
>>...did your measuremant take into account calculation of Chebyshev coefficients...
No, because all Chebyshev coefficients are pre-calculated and saved in constants.
>>...( many calls to library cos)...
No, a different method ( Telescoping a Series)was used.
>>...CRT sin implementation is more than two times slower than your library Normalized Taylor series sin
>>I think that extensive range reduction, input checking and overflow checking are responsible for that...
Yes, that's correct. A special set of macrosto calculate sine, cosine, etc, even more faster because all
overheads ( like make a call, createthestack, clean ups, return )are minimazed.
>>...I suppose that Sergey did some averaging of error's values. He wrote that he ran his tests 2^22 times,
>>but he did not say how many values of 2^22 tests were averaged...
No. For example, for all calls like'CalcNTS9(45)' an Absolute Error is the same.
>>...I used for loop with compound statement calculating factorials and Math.pow library calls to calculate
>>powers of an argument. The results were summed in array. AsI said very slow compared to yours method...
I agree that it affects performance of your calculations. As another example this is how a macro declaration
looks like:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>...We can see that the error vary wildly within a range of values, that's why when comparing
>>algorithms the standard methods are to compare the maximum absolute error or the average
>>absolute error over the domain...
It could be a set of different statistical methods like:
- Min error
- Max error
- Average error
- Standard Deviation of all errors
and it depends on how important that analysis for some project. It is no question that formission-critical
applications it must be done.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>...I can't get much information from what you just said. I don't know whether you mean hardware or
>>software default, in the case of x87 FPU...
I control precision with CRT-function '_control87' and this is because '_control87' changes the control words
for the x87 and the SSE2 units. In codes it looks like:
>>...MinGW compilers rely on Microsoft libraries...
Please take a look at a case with some inconsistency... I missed it about 1.5 year ago during intensive testing and
detected it recently.
I have a couple of tangent methods in a template class and here are results when a 'float' data type
is used:
Summary is as follows:
A Win32 console application that uses MSVCR80.DLL calculated Tan(45) using Normalized Taylor Series ( up to 9th term ) method:
msF.TanNTS9(45) = 0.9999999403953552200000
Compiled by Microsoft C++ compiler on Windows XP Desktop 32-bit platform.
A Win32 console application that uses MSVCRT.DLL calculated Tan(45) using Normalized Taylor Series ( up to 9th term ) method:
msF.TanNTS9(45) = 1.0000000000000000000000
Compiled byMinGW C++ compiler on Windows XP Desktop 32-bit platform.
At the moment I can't explain why the result of Tan(45)is not equal to'1.0'. Isuppose this is due to some differences of
theCRT-function '_control87' in these two DLLs, but I could be wrong.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Does your library contain alsovarious implementations ofspecial functions too?
For example here is my code which calculates Bessel function of first kind J0(x).Plynomial approximation is taken from Abramovitz and Stegun p. 369 ,9.4.1
First part of code contains for loops with arrays multiplications , but second part of code is based on polynomial multiplications.I would like to ask you how to accurate measure running-time of these two parts of code?
[bash]public double BesselJ0(double x){ double sum = 0; double result = 0; if(x <= -3 || x <= 3){ double third = x/3; int k = 0; int exp = 0; int i,j; double[]coef = {-2.2499997, 1.2656208, -0.3163866, 0.0444479, -0.0039444, 0.0002100 }; double[]array = new double[coef.length]; double small = 4.5*1e-8; double b = 0; for(i = 0;i
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your first quoted result looks as if it had been calculated in float (24-bit precision). There is a _control87 setting which does that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No
...I would like to ask you how to accurate measure running-time of these two parts of code?
You could consider two versions and here are pseudo codes:
Version 1 - Uses Win32 API function GetTickCount
...
Declare Start, End, Diff1, Diff2vatiables ( int32 )
Function()
{
Start = GetTickCount()
//Codes for Part1...
End = GetTickCount()
Diff1 = End - Start
Start = GetTickCount()
//Codes for Part2...
End = GetTickCount()
Diff2 = End - Start
return
}
...
Version 2 - Uses RDTSC instruction ( more accurate measurements )
...
Declare Start, End, Diff1, Diff2vatiables ( int64 )
Function()
{
Start =RDTSC
//Codes for Part1...
End =RDTSC
Diff1 = End - Start
Start =RDTSC
//Codes for Part2...
End =RDTSC
Diff2 = End - Start
return
}
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey!
Can not use directlythosefunctions mentioned by you because my code is written in java ,but the time difference calculation can be performed exactly as you wrote.Afaik java system library contains high -precision time measurement method which I think is directly translated by the jvm.dll to rdtsc,but overhead of such a translation could besignificant only if jvm jit-compiler will not cache the translation(did not recognize it as a hot-spot code).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suppose that this is taylor series inducedinaccuracy...
I would alsomention these two problems:
Limitations of IEEE 754 standard
These Series and Polynomials are onlyapproximating a curve of some function
Consider an equation for Normalized Series up to 9th term that calculatessine:
Sin(x) ~= x*( a - x^2*( b - x^2*( c - x^2*( d - x^2*e )))) ( S )
All coefficients are related to PI, that is, a = PI/2, b = ((PI/2)^3)/3!, and so on.
In the equationStwo sets of coefficients could be used. For example:
Set of Coefficients 1 Set of Coefficients 2
less accurate ( SC1 ) very accurate ( SC2 )
a = 1.5707963290000a = 1.57079632679489661923132169163980000
b = 0.6459640960000 b = 0.64596409750624625365575656389795000
c = 0.0796926259900 c = 0.07969262624616704512050554949047800
d = 0.0046817541020 d = 0.00468175413531868810068546393395340
e = 0.0001604411842 e = 0.00016044118478735982187266087016347
Then, for a double-precision data type and 53-bit FPU precisionsine of 30 degrees is:
Normalized Series up to 9th term with SC1 Sin(30) = 0.5000000008100623500000 ( V1 )
Normalized Series up to 9th term with SC2 Sin(30) = 0.5000000000202800000000 ( V2 )
A true value forsine of 30 degrees is 0.5:
AE1 = V1 - 0.5 = 0.00000000081006235
AE2 = V2 - 0.5 = 0.00000000002028000
You see that Absolute Error AE2 is less than AE1 in ~40 times:
AE1 / AE2 = 0.00000000081006235 / 0.00000000002028000 = 39.94
Normalized Series up to 11th term with a Set of Coefficients ( SC3 )
a = 1.57079632679489661923132169163980000
b = 0.64596409750624625365575656389795000
c = 0.07969262624616704512050554949047800
d = 0.00468175413531868810068546393395340
e = 0.00016044118478735982187266087016347
f = 0.00000359253000000000000000000000000
does a magic:
Sin(30) = 0.5000000000000000000000
and I verified it with all C/C++ compilers that I use.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It could be a set of different statistical methods like:
- Min error
- Max error
- Average error
- Standard Deviation of all errors
Sure but I don't really seea realworld use for the min error, for example avery roughLUT based solution may have severalexact results (up to LSB) over the domain (i.e. min error = 0.0) and a good minimax polynomial no single exact result (i.e. min error > 0.0). I'll be interested to see your error analysis here: http://software.intel.com/en-us/forums/showpost.php?p=185933 displayed as max AE and/or average AE over [0.0;90.0] deg if it's a simple option of your software
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
formula when implemented in software failed to produce the exact value. I blamed more java compiler implementation of Ieee 754 standard
than formula accuracy...
I recommend to run a set of very simple tests in order to verify quality of implementation of IEEE 754 Standard for Java, like:
[cpp] ... Verification: ( 0.1f*0.1f ) for 'float': 24-bit : [ 0.1 * 0.1 = 0.01000000070780516 ] Default : [ 0.1 * 0.1 = 0.01000000029802323 ] Verification: ( 0.1L*0.1L ) for 'double': 24-bit : [ 0.1 * 0.1 = 0.00999999977648258 ] 53-bit : [ 0.1 * 0.1 = 0.01000000000000000 ] 64-bit : [ 0.1 * 0.1 = 0.01000000000000000 ] Default : [ 0.1 * 0.1 = 0.01000000000000000 ] ... [/cpp]
Can you reproduce these numbers?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Discussions about consistency of floating-point results will never stop. Please take a look at a very good article regarding
that subject:
Consistency of Floating-Point Results using the Intel Compiler
by Dr. Martyn J. Corden & David Kreitzer
Best regards,
Sergey
PS: Ihave very consistent results across different platforms and C/C++ compilers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you reproduce these numbers?
Here are my results for multiplication of x = 0.1f and y = 0.1f both of them declared as a float values:
x
0.10000000149011612000000000000000
y
0.10000000149011612000000000000000
Here is the multiplication result:
product
0.01000000070780515700000000000000
Here are the results for multiplication of java double values x = 0.1d and y = 0.1d
x1
0.10000000000000000000000000000000
y1
0.10000000000000000000000000000000
product1
0.01000000000000000200000000000000
Sergey how can I use 53-bit precision if java only supports standart IEEE 754 float and double values
Read please this pdf written by W Kahan about awful performance of floating point java implementation
www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I see thatyour result( product of 0.1*0.1 / single-precision)isconsistent with myresultinPost #56:
[cpp] Java : 0.01000000070780515700000000000000 Java (rounded): 0.01000000070780516 C/C++ : 0.01000000070780516 [/cpp]
and the same applies to the double-precision data type.
I could assume that Java VM initialized theFPU to 24-bit precision in the 1st case, and to53-bit ( default ) precision
in the 2nd case.
So, in both casesresults are consistent.
My concern is as follows: in the 1st case I would expect initialization of theFPU toDefault settings instead of 24-bit.
Do you need another test ( a matrix multiplication )to verify Java's floating point support?
>>...How can I use 53-bit precision if java only supports standart IEEE 754 float and double values...
In C/C++ applications a developer could control a precision with functions '_control87' or
'_controlfp'. Microsoft added several extensions, like '_control87_2' or '_controlfp_s', but
these two functions areMicrosoft specific.
In Java API you should look for some method similar to '_control87' function. If Java
doesn't have a method that controlssettings of theFPUI would consider a Java Native Interface ( JNI ) DLL
that implements a wrapper around '_control87' function.
Note: In case of using inline assemblerwith SSE2 instructions in a JNI DLL aprecision could
be controlled with a set of macros, like '_MM_SET_ROUNDING_MODE', or
intrinsic functions, like '_mm_setcsr'. Take a look at attached'xmmintrin.h' header file.
Remember, that '_control87' function controls both units (x87 andSSE2 ).
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Two files are enclosed and this is a short example:
[cpp]Microsoft C++ compiler Normalized Series up to 11 term sine vs. CRT sine 53-bit precision NS11t CRT Absolute Error ... Sin( 29.0 ) = 0.4848096202463386 = 0.4848096202463371 - AE = 0.0000000000000016 Sin( 30.0 ) = 0.5000000000000000 = 0.4999999999999999 - AE = 0.0000000000000001 Sin( 31.0 ) = 0.5150380749100507 = 0.5150380749100542 - AE = -0.0000000000000034 ... Borland C++ compiler Normalized Series up to 11 term sine vs. CRT sine 53-bit precision NS11t CRT Absolute Error ... Sin( 29.0 ) = 0.4848096202463387 = 0.4848096202463371 - AE = 0.0000000000000016 Sin( 30.0 ) = 0.5000000000000000 = 0.4999999999999999 - AE = 0.0000000000000001 Sin( 31.0 ) = 0.5150380749100507 = 0.5150380749100542 - AE = -0.0000000000000034 ... [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey!
I think that FPU unit's initializations are performed by jvm automatically(maybe hardcoded and done at jvm.dll startup or when compiling fp code.I need to disassemble jvm.dll and look for instructions which accessMXCSR register)and the programmer is not given the option to manipulate the settings as it is done by Microsoft _control87 function.
Soon I will switch from java to c with inline assembly programming for my math library.
-I would consider a Java Native Interface ( JNI ) DLL-
The better options is to write in c language.
-Do you need another test ( a matrix multiplication )to verify Java's floating point support-
Yes
-My concern is as follows: in the 1st case I would expect initialization of theFPU toDefault settings instead of 24-bit-
Because the results obtained from C/C++ and java calculations are the same when rounded I suppose that that CPU (both of us have Intel CPU) rounding algorithm is responsible for this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
SSE2 instructions in a JNI DLL a precision could
be controlled
note that there isn'tthenotion of precision with SSEx/AVXx anymore, all SS/PS instructions are with 24-bit precision and SD/PD instructions with 53-bit precision, MXCSRhas 2rounding mode bits but no fp precision control (only the precision exception flag and mask) unlike x87

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page