Solved: >>...Ok I agree with you - Page 14

Bernard · ‎05-24-2012

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

bronxzv · ‎06-08-2012

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

View solution in original post

SergeyKostrov · ‎10-15-2012

>>...Ok I agree with you.Someone who is interested will look through all our posts and find the relevant info. >>Regarding the title Yes please change it.I think that proper word is optimization. . Unfortunately, I can't do this since you're the author ( owner ) of the thread.

Bernard · ‎10-16-2012

@Sergey Does Intel have its own BigDecimal - like library?

SergeyKostrov · ‎10-16-2012

Hi Iliya, >>...Does Intel have its own BigDecimal - like library? Is that what you're looking for? http://software.intel.com/en-us/articles/intel-decimal-floating-point-math-library/

SergeyKostrov · ‎10-16-2012

>>...the result reached almost 1025 decimal places ,but was accurate only to four decimal places... Interesting results! I think you need to "step down" and try to calculate just for 4 or 6 decimals in order to understand what is wrong.

Bernard · ‎10-17-2012

...>Is that what you're looking for? http://software.intel.com/en-us/articles/intel-decimal-floating-point-ma..>>> No they have not a precision larger than 128-bit types.

Bernard · ‎10-17-2012

...>>>nteresting results! I think you need to "step down" and try to calculate just for 4 or 6 decimals in order to understand what is wrong.>>> Yes I know that something wrong has occurred during the calculation.The coefficients were calculated by Mathematica 8 and can be trusted as very accurate and my double-precision fastsin() was accurate up to 15 decimal places.I think that I can blame BigDecimal library string parsing methods for mishandling those very large arguments passed to the constructors.Another explanation could be some rounding and truncation performed by the library algorithms. Later I will test single-threaded low-precision version and post the results.

SergeyKostrov · ‎10-17-2012

>>...Later I will test single-threaded low-precision version and post the results.... That's the right decision, Iliya. It always makes sence to simplify a test case when additional investigation is needed.

Bernard · ‎10-17-2012

>>>...That's the right decision, Iliya. It always makes sence to simplify a test case when additional investigation is needed...>>> I used global synchronized method to prevent a race condition and only one thread at time is allowed to calculate Horner polynomial. So the lack of precision is very strange to me.

Bernard · ‎10-20-2012

Hi Sergey >>>...That's the right decision, Iliya. It always makes sence to simplify a test case when additional investigation is needed...>>> I did an extensive set of tests on my BigDecimal version of fastsin,sadly the issue with the accuracy still persists and even got worse. Up to an argument of 0.30 radian the tested method can achieve an accuracy of 4 decimal places.For larger arguments the value of arbitrary precision version of fastsin exceeds 1.0.I tried various combination of precision truncation and method rounding and nothing can help. I suppose that somehow BigDecimal arithmetics methods which works on objects are responsible for such a divergence.

SergeyKostrov · ‎10-30-2012

>>...I did an extensive set of tests on my BigDecimal version of fastsin,sadly the issue with the accuracy still persists and even got worse... Hi Iliya, take a look at: . http://software.intel.com/en-us/articles/haswell-support-in-intel-ipp and here is a quote: ... Floating Point Multiply Accumulate - Our floating-point multiply accumulate significantly increases peak flops and provides improved precision to further improve transcendental mathematics. They are broadly usable in high performance computing, professional quality imaging, and face detection. They operate on scalars, 128-bit packed single and double precision data types, and 256-bit packed single and double-precision data types. [These instructions were described previously, in the initial Intel AVX specification]. ... Is it similar to FMA ( Fused Multiply-Add ) or something else?

Bernard · ‎10-31-2012

>>>Hi Iliya, take a look at: . http://software.intel.com/en-us/articles/haswell-support-in-intel-ipp>>> Thanks for link and info. After reading through this article I came to conclusion that in my case it will not resolve my problem. Hardware implemented FMA specially tailored for transcendentals calculation should be helpful in my case, but my code uses Java BigDecimal types which are binary translated by JIT compiler to x87 FPU code.I'm not sure if JIT translates even to SSE3 and/or SSE4 so as you can see it all depends on Oracle Java JIT developers. If you are interested i rewrote arbitrary precision fastsin() method in its new form fastsin() uses iterative calculation and polynomial summation in order to achieve a convergence and I was able to achieve a great accuracy up to dozens of decimal places when I run a tests against arbitrary precision sin() implemented by Mathematica 8. sine function test1 for sin(0.1), arguments in radian Mathematica 8 result , precision of result 1000 decimal places. BigDecimal MathematicaResult = new BigDecimal("0.09983341664682815230681419841062202698991538801798225999276686156165"+ "1744283292427609662443804063036267832503180935989035450807237470459378"+ "8733561019849184104968347730506328324943597890052224240860814227296674"+ "3717361429690723851578245480827955359733878534633879232333616788674249"+ "0732984534800581449739604751862203867889743928811306547167725769195052"+ "7939226260748108333691856379033824884726830161865556889346698245411612"+ "7350321922213400495628460096652233220746111279082638889248599762600486"+ "7369921168342258220078495944896623421021701909153051750727014383748742"+ "8538739455042219898916646878947958333218732148928243147445744176306144"+ "8319041883501260441677212938344509031562810353361999685976614314713635"+ "2376471332757607248532041985898211422170648603374528422580789056073829"+ "9833912269838132246283976008123694667742027949047688125913753569293073"+ "4542900571839934963830844812380293707930735198964874164274114958462665"+ "4721689231998075453149763755566049310161914743340176805534440027234162"+ "60117444319296284058355"); My own implementation BigDecimal type precision of result 1000 decimal places. BigDecimal MyResult = new BigDecimal("0.09983341664682815230681419841062202698991538801798225999276691183645306851605639339348255072848269607650126301750623266285915014777363849429882494172304175818564569878093779910005002435262146480554144164068634503196275439405716427829467718846989916097461274251683687505626691144955116757431195882281097332046230322367528010330012325509400263897859673738501094946240128179295914379177052294570998660870974914229353299674791489086469695092248956464490776701589133882672522224862317928978624183262578915405699876841355904181868741141987361769147329635241904812103650693742736260834678043920025858501656467334341308687703560216612464405573345205965242336552029461830301279503719798956900959047668640280681892448507477838575486397731923600164078104802726389290258429170226935933291644091950079513081214487731344803942215805520434642873011293542131331139589003591171761381808368667811893808151688806855063014066906054722182073878502847520332849760979231929048065256809912370394113984135840931628730"+ "11562022880642264046740491556892673469828818826945815853664137950845961480297893453241670066891939"); Absolute error = -5.02748013242327639657838201069244196598086687598365702436274083429103031791154254688397381233476888109257303049666056902353473990407194600189110189078583484574868186484958398689089163018221892664037245135388883801689588181830395113773730713685686000993557855361744888167079262768697835439434623848746515679126073212404092820502626938000530208144724689918354744116716425629459662049636783854348084302260358988333561371816637813744627046208117987786899176141451952663897179644646397225693380352993781826329919231406184551157513054627516946914584505887721324611845355101608653146170470480065933909111732126971669336643695721498243335392664544424531523171711372228219914210245794701324446527354703217952356284393759416894571062761667332113325078466738366272532863917550533731836280758843056774906862734091656942807495988115860918681464258094805560424058255325887407095753254395729315604821573757373123912935892639139056787572348730943383156187380642264046740491556892673469828818826 9458158536641379508460E-62

SergeyKostrov · ‎10-31-2012

>>...for sin(0.1) arguments in radian >> Mathematica 8 result << sin( 0.1 ) = 0.099833416646828152306814198410622026989915388017982259992766861... >> Java BigDecimal result ( identical to Mathematica 8 up to 60 digits )<< sin( 0.1 ) = 0.099833416646828152306814198410622026989915388017982259992766911... >> Windows Calculator ( up to 33 digits ) << sin( 0.1 ) = 0.099833416646828152306814198410622 Your results look impressive.

Bernard · ‎10-31-2012

>>>Your results look impressive.>>> Thank you. I have already written a few large precision methods and I will post the results and source code soon. Why do not you try to use Java BigDecimal class ?

SergeyKostrov · ‎10-31-2012

>>...Why do not you try to use Java BigDecimal class? I don't do Java programming but an R&D project for Android OS will finally "force" me...

Bernard · ‎11-01-2012

>>>I don't do Java programming but an R&D project for Android OS will finally "force" me...>>> You can classify Java programming as a your programming knowledge multiplier:)

Bernard · ‎11-07-2012

Hi Sergey! Here I'm posting vectorised version of my fastsin() function.By looking at code you can see more than 200 lines of code. I still did not do an extensive tests for an accuracy , precision and speed of execution.Later I will post the results. By studying code you can see that I must to declare and define 10 structures and use them as a scalar coefficients. Inside _asm{} block I cannot declare MASM REAL types and trying to load XMM registers with double coeffs one at time does not work, so I'm forced to produce such a monstrous code.But I'am able to do vector calculation directly on XMM registers. Source code : inline struct vec4D fastsin4D(struct Argument4D arg1){ if(arg1.f1 >= HALF_PI_FLT || arg1.f2 >= HALF_PI_FLT || arg1.f3 >= HALF_PI_FLT || arg1.f4 >= HALF_PI_FLT){ vec4D vec; vec.f1 = arg1.f1; vec.f2 = arg1.f2; vec.f3 = arg1.f3; vec.f4 = arg1.f4; vec.f1 = (vec.f1 - vec.f1)/(vec.f1 - vec.f1); vec.f2 = (vec.f2 - vec.f2)/(vec.f2 - vec.f2); vec.f3 = (vec.f3 - vec.f3)/(vec.f3 - vec.f3); vec.f4 = (vec.f4 - vec.f4)/(vec.f4 - vec.f4); return vec; } else if(arg1.f1 <= NEG_HALF_PI_FLT || arg1.f2 <= NEG_HALF_PI_FLT || arg1.f3 <= NEG_HALF_PI_FLT || arg1.f4 <= NEG_HALF_PI_FLT){ vec4D vec1; vec1.f1 = arg1.f1; vec1.f2 = arg1.f2; vec1.f3 = arg1.f3; vec1.f4 = arg1.f4; vec1.f1 = (vec1.f1 + vec1.f1)/(vec1.f1 + vec1.f1); vec1.f2 = (vec1.f2 + vec1.f2)/(vec1.f2 + vec1.f2); vec1.f3 = (vec1.f3 + vec1.f3)/(vec1.f3 + vec1.f3); vec1.f4 = (vec1.f4 + vec1.f4)/(vec1.f4 + vec1.f4); return vec1; }else{ struct CoeffFlt1 { float c1,c2,c3,c4; } coeffflt1; struct CoeffFlt2 { float c1,c2,c3,c4; } coeffflt2; struct CoeffFlt3 { float c1,c2,c3,c4; } coeffflt3; struct CoeffFlt4 { float c1,c2,c3,c4; } coeffflt4; struct CoeffFlt5 { float c1,c2,c3,c4; } coeffflt5; struct CoeffFlt6 { float c1,c2,c3,c4; } coeffflt6; struct CoeffFlt7 { float c1,c2,c3,c4; } coeffflt7; struct CoeffFlt8 { float c1,c2,c3,c4; } coeffflt8; struct CoeffFlt9 { float c1,c2,c3,c4; } coeffflt9; struct CoeffFlt10 { float c1,c2,c3,c4; } coeffflt10; coeffflt1.c1 = -0.1666666; coeffflt1.c2 = -0.1666666; coeffflt1.c3 = -0.1666666; coeffflt1.c4 = -0.1666666; coeffflt2.c1 = 0.0083333; coeffflt2.c2 = 0.0083333; coeffflt2.c3 = 0.0083333; coeffflt2.c4 = 0.0083333; coeffflt3.c1 = -1.9841269e-4; coeffflt3.c2 = -1.9841269e-4; coeffflt3.c3 = -1.9841269e-4; coeffflt3.c4 = -1.9841269e-4; coeffflt4.c1 = 2.7557319e-6; coeffflt4.c2 = 2.7557319e-6; coeffflt4.c3 = 2.7557319e-6; coeffflt4.c4 = 2.7557319e-6; coeffflt5.c1 = -2.5052108e-8; coeffflt5.c2 = -2.5052108e-8; coeffflt5.c3 = -2.5052108e-8; coeffflt5.c4 = -2.5052108e-8; coeffflt6.c1 = 1.6059043e-10; coeffflt6.c2 = 1.6059043e-10; coeffflt6.c3 = 1.6059043e-10; coeffflt6.c4 = 1.6059043e-10; coeffflt7.c1 = -7.6471637e-13; coeffflt7.c2 = -7.6471637e-13; coeffflt7.c3 = -7.6471637e-13; coeffflt7.c4 = -7.6471637e-13; coeffflt8.c1 = 2.8114572e-15; coeffflt8.c2 = 2.8114572e-15; coeffflt8.c3 = 2.8114572e-15; coeffflt8.c4 = 2.8114572e-15; coeffflt8.c1 = -8.2206352e-18; coeffflt8.c2 = -8.2206352e-18; coeffflt8.c3 = -8.2206352e-18; coeffflt8.c4 = -8.2206352e-18; coeffflt9.c1 = 1.9572941e-20; coeffflt9.c2 = 1.9572941e-20; coeffflt9.c3 = 1.9572941e-20; coeffflt9.c4 = 1.9572941e-20; coeffflt10.c1 = -3.8681701e-23; coeffflt10.c2 = -3.8681701e-23; coeffflt10.c3 = -3.8681701e-23; coeffflt10.c4 = -3.8681701e-23; vec4D vec0; vec0.f1 = 0.0f; vec0.f2 = 0.0f; vec0.f3 = 0.0f; vec0.f4 = 0.0f; _asm{ xorps xmm0,xmm0 xorps xmm1,xmm1 xorps xmm6,xmm6 xorps xmm7,xmm7 xorps xmm5,xmm5 movups xmm0,arg1//sine x,y,z,w movups xmm5,xmm0 //copy arg1 to multiply arg power movups xmm7,xmm0 // copy arg to accumulator mulps xmm0,xmm5 // x^2 movups xmm6,xmm0 // copy of x^2 mulps xmm0,xmm5 // x^3 movups xmm1,coeffflt1 mulps xmm1,xmm0 addps xmm7,xmm1 // xmm7 accumulator movups xmm1,coeffflt2 mulps xmm0,xmm6 // x^5 mulps xmm1,xmm0 addps xmm7,xmm1 movups xmm1,coeffflt3 mulps xmm0,xmm6 // x^7 mulps xmm1,xmm0 addps xmm7,xmm1 movups xmm1,coeffflt4 mulps xmm0,xmm6 // x^9 mulps xmm1,xmm0 addps xmm7,xmm1 movups xmm1,coeffflt5 mulps xmm0,xmm6 //x^11 mulps xmm1,xmm0 addps xmm7,xmm1 movups xmm1,coeffflt6 mulps xmm0,xmm6 //x^13 mulps xmm1,xmm0 addps xmm7,xmm1 movups xmm1,coeffflt7 mulps xmm0,xmm6 //x^15 mulps xmm1,xmm0 addps xmm7,xmm1 movups xmm1,coeffflt8 mulps xmm0,xmm6 mulps xmm1,xmm0 addps xmm7,xmm1 movups xmm1,coeffflt9 mulps xmm0,xmm6 mulps xmm1,xmm0 addps xmm7,xmm1 movups xmm1,coeffflt10 mulps xmm0,xmm6 mulps xmm1,xmm0 addps xmm7,xmm1 movups vec0,xmm7 } return vec0; } }

SergeyKostrov · ‎11-07-2012

Hi Iliya, I see how your problem looks like... Could you also post declarations of HALF_PI_FLT and NEG_HALF_PI_FLT? Did you declare these two constants as '#define ...' or as 'const float ...'?

SergeyKostrov · ‎11-07-2012

Declarations of 'vec4D' and 'Argument4D' are also needed.

Bernard · ‎11-07-2012

>>>Hi Iliya, I see how your problem looks like... Could you also post declarations of HALF_PI_FLT and NEG_HALF_PI_FLT? Did you declare these two constants as '#define ...' or as 'const float ...'?>>> Yes it is going to be a serious problem.For example if for such a simple function as fastsin() is needed more than 200 lines of code what will be needed for gamma or bessel function. HALF_PI_FLT is a float value of Pi/2 and NEG_HALF_PI_FLT is -Pi/2 both of them are declared as #define.constants. The problem is heavy usage of structures declaration and definition needed for Horner scheme calculation.The bulk of code is due to this. I have tried various combination of indirect addressing with [eax+float size] , but it does not work. Also as expected none of MASM directive and primitive type works.So I'm left with those long struct declaration and definition.I would also use less coefficients because of float type precision but I need to test it.

Bernard · ‎11-07-2012

>>>Declarations of 'vec4D' and 'Argument4D' are also needed.>>> struct vec4D { float c1,c2,c3,c4; } vec0; struct Argument4D { float a1,a2,a3,a4; }arg1;

SergeyKostrov · ‎11-07-2012

Hi Iliya, Thanks for the update. My first impression is as follows: In overall, it looks good and can be optimized. But, you're trying to use a Java-like style of programming especially when it comes to declarations of different helper types, like structures, and initialization of members of these helper types. I'll take a look at your function tomorrow. Best regards, Sergey

Optimization of sine function's taylor expansion