Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
1123 Discussions

Optimization of sine function's taylor expansion

Valued Contributor I

While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

0 Kudos
1 Solution
New Contributor II

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

View solution in original post

0 Kudos
342 Replies
Valued Contributor II
Iliya, please take a look at It should have a section related to interviews, etc. Good Luck!
0 Kudos
Valued Contributor I
>>>Good Luck!>>> Thank you very much.Regarding i found a few good sites with very helpful inteview preparation advises.
0 Kudos
Valued Contributor I
Hi Sergey! I attended today a programming job interview , I was asked a few questions about the Spring and Hibernate frameworks sadly these frameworks are unknown to me.I talked to them about my projects ,but interviewers demand the knowledge in web programming,using spring framework and UI programming.I must wait their decision untill wednsdey next week.
0 Kudos
Valued Contributor I
Hi Sergey I'm posting a few test-cases of my "intrinsinc" library more will follow soon. _VecLibDouble (lib typedef of double) arbitrary size array addition. _VecLibDouble *Vector_Add_d(_VecLibDouble a[],_VecLibDouble b[],int len){ if(a == NULL || b == NULL) return (_VecLibDouble*)NULL; if((sizeof(a)/sizeof(a[0]) - sizeof(b)/sizeof(b[0])) > 0) return (_VecLibDouble*)NULL; _asm{ xor eax,eax xor ebx,ebx xor ecx,ecx xorpd xmm0,xmm0 xorpd xmm1,xmm1 mov ecx,len mov eax, mov ebx, Back: movupd xmm0,[eax] movupd xmm1,[ebx] addpd xmm0,xmm1 movupd [eax],xmm0 add eax,16 dec ecx jnz Back } return a; } main() arguments and call _VecLibDouble testar2[10],*testarptr; _VecLibDouble testar3[10]; for(int i = 0;i < 10;i++){ testar2 = (double)i; testar3 = 100.0; } testarpt...
0 Kudos
Valued Contributor II
Thanks for the update and I'll take a look some time later ( there will be a couple of very busy weeks for me ). Best regards, Sergey
0 Kudos
Valued Contributor I
>>>Thanks for the update and I'll take a look some time later >>> Ok Thank you for your help and very insigthful tips.
0 Kudos
Valued Contributor I
Hi Sergey! I'm posting an update related to my VecLib project.While testing slightly optimized version of sine function where the sine convergence is achieved with the help of SSE inline assembly I ran into some problem.I eliminated one instruction which performed explicit multiplication of the argument by x^2 so the total count of instruction per one term was three,but the accurracy was greatly reduced up to 2-3 decimal places.Double precision and single precision primitives were used so the truncation can not be blamed for the inaccurate result. I suspect that somehow combined multiplication of an argument by pre-calculated coefficient coupled with exponentiation of the argument all of it performed in the same register xmm0 which served as an accumulator caused the loss in accuracy. I rewrote the inline asm block and removed the load of xmm0 register by adding another instruction which multiplies the argument by x^2 and that problem dissapeared. Please look at Vec_Sin_f() function inline assembly code block and Vec_Cos_d() inline assembly block. Thanks in advance.
0 Kudos