Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
1093 Discussions

Optimization of sine function's taylor expansion

Bernard
Valued Contributor I
8,929 Views

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

0 Kudos
1 Solution
bronxzv
New Contributor II
8,738 Views

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

View solution in original post

0 Kudos
342 Replies
SergeyKostrov
Valued Contributor II
383 Views
Iliya, please take a look at www.monster.com. It should have a section related to interviews, etc. Good Luck!
0 Kudos
Bernard
Valued Contributor I
383 Views
>>>Good Luck!>>> Thank you very much.Regarding monster.com i found a few good sites with very helpful inteview preparation advises.
0 Kudos
Bernard
Valued Contributor I
383 Views
Hi Sergey! I attended today a programming job interview , I was asked a few questions about the Spring and Hibernate frameworks sadly these frameworks are unknown to me.I talked to them about my projects ,but interviewers demand the knowledge in web programming,using spring framework and UI programming.I must wait their decision untill wednsdey next week.
0 Kudos
Bernard
Valued Contributor I
383 Views
Hi Sergey I'm posting a few test-cases of my "intrinsinc" library more will follow soon. _VecLibDouble (lib typedef of double) arbitrary size array addition. _VecLibDouble *Vector_Add_d(_VecLibDouble a[],_VecLibDouble b[],int len){ if(a == NULL || b == NULL) return (_VecLibDouble*)NULL; if((sizeof(a)/sizeof(a[0]) - sizeof(b)/sizeof(b[0])) > 0) return (_VecLibDouble*)NULL; _asm{ xor eax,eax xor ebx,ebx xor ecx,ecx xorpd xmm0,xmm0 xorpd xmm1,xmm1 mov ecx,len mov eax, mov ebx, Back: movupd xmm0,[eax] movupd xmm1,[ebx] addpd xmm0,xmm1 movupd [eax],xmm0 add eax,16 dec ecx jnz Back } return a; } main() arguments and call _VecLibDouble testar2[10],*testarptr; _VecLibDouble testar3[10]; for(int i = 0;i < 10;i++){ testar2 = (double)i; testar3 = 100.0; } testarpt...
0 Kudos
SergeyKostrov
Valued Contributor II
381 Views
Thanks for the update and I'll take a look some time later ( there will be a couple of very busy weeks for me ). Best regards, Sergey
0 Kudos
Bernard
Valued Contributor I
381 Views
>>>Thanks for the update and I'll take a look some time later >>> Ok Thank you for your help and very insigthful tips.
0 Kudos
Bernard
Valued Contributor I
381 Views
Hi Sergey! I'm posting an update related to my VecLib project.While testing slightly optimized version of sine function where the sine convergence is achieved with the help of SSE inline assembly I ran into some problem.I eliminated one instruction which performed explicit multiplication of the argument by x^2 so the total count of instruction per one term was three,but the accurracy was greatly reduced up to 2-3 decimal places.Double precision and single precision primitives were used so the truncation can not be blamed for the inaccurate result. I suspect that somehow combined multiplication of an argument by pre-calculated coefficient coupled with exponentiation of the argument all of it performed in the same register xmm0 which served as an accumulator caused the loss in accuracy. I rewrote the inline asm block and removed the load of xmm0 register by adding another instruction which multiplies the argument by x^2 and that problem dissapeared. Please look at Vec_Sin_f() function inline assembly code block and Vec_Cos_d() inline assembly block. Thanks in advance.
0 Kudos
Reply