Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
1135 Discussioni

Optimization of sine function's taylor expansion

Bernard
Collaboratore stimato I
38.035Visualizzazioni

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

0 Kudos
1 Soluzione
bronxzv
Nuovo collaboratore II
37.844Visualizzazioni

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

Visualizza soluzione nel messaggio originale

342 Risposte
Bernard
Collaboratore stimato I
1.792Visualizzazioni
>>> But, you're trying to use a Java-like style of programming especially when it comes to declarations of different helper types, like structures, and initialization of members of these helper types. I'll take a look at your function tomorrow.>>> Thanks for answer. Structures types can be accessed with pointers or with 'dot' operator.It is question of programming style.Today i will switch to the 2D ,3D,4D array types and try to eliminate those pesky structures if it works I can save more than 100 lines of code.I will post the update. Update on vectorised fastsin() with array type coefficients. Sadly I still cannot load xmm registers with an array type coefficients as I told you earlier only passing structure reference to xmm register works. I will investigate this issue with debugger today.
Bernard
Collaboratore stimato I
1.792Visualizzazioni
I did preliminary tests on execution speed of _asm{} block.The tests were made with the help of RDTSC instructions beign executed inside _asm{} block.After a few tests I got ~1567 cycles even if latency of accessing and zeroing eax and edx registers and RDTSC and CPUID instructions is included it could not be more than ~300-400 cycles for executing whole block.Inline assembly code should be very fast and total latency in cpi per one Horner scheme polynomial term is not more than 4-5 cycles.So total time of calculation for 10 terms should be ~40-50 cycles excluding xoring xmm registers and copying x^2 value. Could you run set of tests on this?
SergeyKostrov
Collaboratore stimato II
1.792Visualizzazioni
>> Tip #1 << Try to use 'typedef's for declaration of structures instead of 'struct': typedef struct tagSOMESTRUCT { } SOMESTRUCT; >> Tip # 2 << Lots of unnecesseary things in that piece of code: ... if(arg1.f1 >= HALF_PI_FLT || arg1.f2 >= HALF_PI_FLT || arg1.f3 >= HALF_PI_FLT || arg1.f4 >= HALF_PI_FLT) { vec4D vec; vec.f1 = arg1.f1; vec.f2 = arg1.f2; vec.f3 = arg1.f3; vec.f4 = arg1.f4; vec.f1 = (vec.f1 - vec.f1)/(vec.f1 - vec.f1); vec.f2 = (vec.f2 - vec.f2)/(vec.f2 - vec.f2); vec.f3 = (vec.f3 - vec.f3)/(vec.f3 - vec.f3); vec.f4 = (vec.f4 - vec.f4)/(vec.f4 - vec.f4); return vec; } ... Why don't you want to return just NULL: ... if(arg1.f1 >= HALF_PI_FLT || arg1.f2 >= HALF_PI_FLT || arg1.f3 >= HALF_PI_FLT || arg1.f4 >= HALF_PI_FLT) { return (vec &)NULL; } ... If there is an error with one of input parameters I try to leave a function / method as soon as possible without doing any calculations.
SergeyKostrov
Collaboratore stimato II
1.792Visualizzazioni
>> Tip # 3 << inline struct vec4D fastsin4D( struct Argument4D arg1 ) { ... } This is a C-like, that is very obsolete, way of declaring a function. Also, why don't you want to have a class that wraps and consolidates all common things, variables, declarations, etc?
SergeyKostrov
Collaboratore stimato II
1.792Visualizzazioni
>> Tip # 4 << There are lots of declarations of some types and initializations of some constants inside of the function: inline struct vec4D fastsin4D( struct Argument4D arg1 ) { ... struct CoeffFlt1 { float c1,c2,c3,c4; } coeffflt1; ... struct CoeffFlt10 { float c1,c2,c3,c4; } coeffflt10; ... coeffflt1.c1 = -0.1666666; coeffflt1.c2 = -0.1666666; coeffflt1.c3 = -0.1666666; coeffflt1.c4 = -0.1666666; ... coeffflt10.c1 = -3.8681701e-23; coeffflt10.c2 = -3.8681701e-23; coeffflt10.c3 = -3.8681701e-23; coeffflt10.c4 = -3.8681701e-23; ... } - Declarations of some types should be done only once even if it is a compile time thing - Initializations of constants could be done only once during initialization of your library
SergeyKostrov
Collaboratore stimato II
1.792Visualizzazioni
>> Tip # 5 << Think about applications and use cases of your API. Will a developer X be happy with your function(s) / class(es) or not? >> Tip # 6 << Have a test-case(s) for your API from the very beginning. >> Tip # 7 << Evaluate performance of your API from the very beginning. It will save time later and you won't need to do a re-design if some function doesn't work fast.
Bernard
Collaboratore stimato I
1.804Visualizzazioni
>>>- Declarations of some types should be done only once even if it is a compile time thing - Initializations of constants could be done only once during initialization of your library>>> Thank You very much for your help I'm still learning and switching between 3 programming languages sometime confuses me:) Regarding declaration of those sine coefficients and their structures I will "typedef" them and move them to header file.Is this good solution to this problem?
Bernard
Collaboratore stimato I
1.804Visualizzazioni
>>>This is a C-like, that is very obsolete, way of declaring a function. Also, why don't you want to have a class that wraps and consolidates all common things, variables, declarations, etc?>>> Yes I know this and I'm only at early stage of library design.Later I will create static library wrapped in classes with static function members. Now I only tested a few possible vectorization of input argument.
Bernard
Collaboratore stimato I
1.804Visualizzazioni
I think that I will switch to intrinsics .
Bernard
Collaboratore stimato I
1.804Visualizzazioni
>> Tip # 5 << Think about applications and use cases of your API. Will a developer X be happy with your function(s) / class(es) or not? >> Tip # 6 << Have a test-case(s) for your API from the very beginning. >> Tip # 7 << Evaluate performance of your API from the very beginning. It will save time later and you won't need to do a re-design if some function doesn't work fast. I will proceed with the design of library according to your advise. Today I tested large composed structure with four 4D inner structures put simply 4x4D vectors matrix.I was able to load them into XMM registers and perform various vector-wise and scalar-wise operation.I could try to calculate 16 scalar sine values and return pointer to such a matrix.
Bernard
Collaboratore stimato I
1.804Visualizzazioni
>>>Try to use 'typedef's for declaration of structures instead of 'struct':>>> Yes I will do this.My intention is to wrap primitive types and data types in my own "typedefs". Btw Is it recommended to declare and initialize structure in header file? I tried to do this and even after using include guards I got compile errors. >>>Lots of unnecesseary things in that piece of code:>>> That piece of code simple fills and returns structure with NaN values,but I agree with you that this adds unnecesary clutter.I could also fix the values with for exmaple HALF_PI and did recursive call , but better option is simply return NULL.
SergeyKostrov
Collaboratore stimato II
1.804Visualizzazioni
>>>>Try to use 'typedef's for declaration of structures instead of 'struct': >> >>Yes I will do this. My intention is to wrap primitive types and data types in my own "typedefs". >>Btw Is it recommended to declare and initialize structure in header file? - Declaration could be done in both. That is, in a header file or in a cpp-file - Initialization has to be done in a cpp-file Note: If you do initialization in a header file you can get a compilation error like 'A variable is already declared' or something like that ( every C++ compiler will display a different error ). >>I tried to do this and even after using include guards I got compile errors. Could you post a text of the compiler error? What about a really small reproducer?
SergeyKostrov
Collaboratore stimato II
1.804Visualizzazioni
Hi Iliya, Here are a couple of more tips: - Use of macros help in reducing the overall size of sources ( Note: macros can't be debugged! ) - Try to use a Hungarian Notation when declaring variables or members of a class - Consider C++ templates ( one implementation will cover many different types ) - If possible, consider portability ( Portability is always a challenge! ) I could provide some little examples and just let me know what you need. Best regards, Sergey
SergeyKostrov
Collaboratore stimato II
1.804Visualizzazioni
>>Regarding declaration of those sine coefficients and their structures I will "typedef" them and move them to header file. Is this >>good solution to this problem? Here is an example: [cpp] template < class T > class TSomeClass { public: TSomeClass( void ) { InitData(); }; virtual ~TSomeClass( void ) { }; private: inline void InitData( void ) { m_tNF2 = ( T )1.0 / ( T )2.0; m_tNF3 = ( T )1.0 / ( T )6.0; m_tNF4 = ( T )1.0 / ( T )24.0; m_tNF5 = ( T )1.0 / ( T )120.0; ... }; private: T m_tNF2; T m_tNF3; T m_tNF4; T m_tNF5; ... }; [/cpp]
SergeyKostrov
Collaboratore stimato II
1.804Visualizzazioni
This is a test: #include void main( void ) { printf("Pasted from VS 2005"); }
Bernard
Collaboratore stimato I
1.804Visualizzazioni
>>>- Use of macros help in reducing the overall size of sources ( Note: macros can't be debugged! )>>> I do not like macro definition.Usually I put in header file various constants and function declaration , regarding length of source code it does not matter for me:) >>>Try to use a Hungarian Notation when declaring variables or members of a class>>> I hate it but I will use it in order to prevent name conflicts. >>>- If possible, consider portability ( Portability is always a challenge! )>>> What do you mean by writing this? Are you talking about the hiding differences between various OS or between various compilers? I think that portability question is still too complicated to me. I have no more than 10 months of programming experience:)
Bernard
Collaboratore stimato I
1.804Visualizzazioni
>>>- Declaration could be done in both. That is, in a header file or in a cpp-file - Initialization has to be done in a cpp-file Note: If you do initialization in a header file you can get a compilation error like 'A variable is already declared' or something like that ( every C++ compiler will display a different error ).>>> I could declare and initialize my own typedefs for various trigo and special functions pre-calculated coefficients(primitive types) and put them in .h file,but I won't be able to load them into XMM registers inside asm block.This is my main problem.Lets look at this from different perspective if I'm forced to use struct types I can declare them inside #include guarded "region" but I still need to initialize them in .cpp file probably as a global constants so I will use less lines of code spent of every function.Or simply keep initializing those structures inside function. As I wrote you earlier I was not able initialize even inside #ifdef block.
Bernard
Collaboratore stimato I
1.804Visualizzazioni
Here is an example: This is good design I will use it as a template. My design will be based on static functions inside my own namespace.For such a library I still do not want to do full object orientation. I still love structured programming paradigm design:)
SergeyKostrov
Collaboratore stimato II
1.804Visualizzazioni
>>...Are you talking about the hiding differences between various OS or between various compilers?.. The 2nd part of your question '...between various compilers...'. For example, Intel and MSC C++ compilers must compile the sources and results have to be the same ( or almost the same ).
SergeyKostrov
Collaboratore stimato II
1.804Visualizzazioni
>>...I do not like macro definition.Usually I put in header file various constants and function declaration... I was talking about macros like this: ... #define HrtExchange( Value1, Value2, Unused ) \ { \ _asm MOV eax, [##Value1] \ _asm MOV edx, [##Value2] \ _asm MOV [##Value1], edx \ _asm MOV [##Value2], eax \ } ...
Bernard
Collaboratore stimato I
1.804Visualizzazioni
I was talking about macros like this: ... #define HrtExchange( Value1, Value2, Unused ) \ { \ _asm MOV eax, [##Value1] \ _asm MOV edx, [##Value2] \ _asm MOV [##Value1], edx \ _asm MOV [##Value2], eax \ } Theoritically I could move simple scalar double argument trigo function computation to header file and define them as a macros,but what about the 4D double vectors and complex multi-vector structures?I studied a design of xnamath library which was fully implemented inside header file. What do You think about the such a design? Such a macros are good for removing burden of coeffs calculation/initialization ,but my intention is to vectorize calculation and not to speed up it Regarding initialization coeffs structure inside header file I abandoned it and simply will only declare them in .h file.Inside trigo function I will initialize/instatiate those structures by such a design it could be possible to reduce such a function to more than 100 lines of code. Now I'm testing large 4x4D structure with 16 sine scalar argument for now I was able to do simple vectors computation.
Rispondere