Risolto: Optimization of sine function's taylor expansion - Pagina 15

Bernard · ‎05-24-2012

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

bronxzv · ‎06-08-2012

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

Visualizza soluzione nel messaggio originale

Bernard · ‎11-08-2012

>>> But, you're trying to use a Java-like style of programming especially when it comes to declarations of different helper types, like structures, and initialization of members of these helper types. I'll take a look at your function tomorrow.>>> Thanks for answer. Structures types can be accessed with pointers or with 'dot' operator.It is question of programming style.Today i will switch to the 2D ,3D,4D array types and try to eliminate those pesky structures if it works I can save more than 100 lines of code.I will post the update. Update on vectorised fastsin() with array type coefficients. Sadly I still cannot load xmm registers with an array type coefficients as I told you earlier only passing structure reference to xmm register works. I will investigate this issue with debugger today.

Bernard · ‎11-08-2012

I did preliminary tests on execution speed of _asm{} block.The tests were made with the help of RDTSC instructions beign executed inside _asm{} block.After a few tests I got ~1567 cycles even if latency of accessing and zeroing eax and edx registers and RDTSC and CPUID instructions is included it could not be more than ~300-400 cycles for executing whole block.Inline assembly code should be very fast and total latency in cpi per one Horner scheme polynomial term is not more than 4-5 cycles.So total time of calculation for 10 terms should be ~40-50 cycles excluding xoring xmm registers and copying x^2 value. Could you run set of tests on this?

SergeyKostrov · ‎11-08-2012

>> Tip #1 << Try to use 'typedef's for declaration of structures instead of 'struct': typedef struct tagSOMESTRUCT { } SOMESTRUCT; >> Tip # 2 << Lots of unnecesseary things in that piece of code: ... if(arg1.f1 >= HALF_PI_FLT || arg1.f2 >= HALF_PI_FLT || arg1.f3 >= HALF_PI_FLT || arg1.f4 >= HALF_PI_FLT) { vec4D vec; vec.f1 = arg1.f1; vec.f2 = arg1.f2; vec.f3 = arg1.f3; vec.f4 = arg1.f4; vec.f1 = (vec.f1 - vec.f1)/(vec.f1 - vec.f1); vec.f2 = (vec.f2 - vec.f2)/(vec.f2 - vec.f2); vec.f3 = (vec.f3 - vec.f3)/(vec.f3 - vec.f3); vec.f4 = (vec.f4 - vec.f4)/(vec.f4 - vec.f4); return vec; } ... Why don't you want to return just NULL: ... if(arg1.f1 >= HALF_PI_FLT || arg1.f2 >= HALF_PI_FLT || arg1.f3 >= HALF_PI_FLT || arg1.f4 >= HALF_PI_FLT) { return (vec &)NULL; } ... If there is an error with one of input parameters I try to leave a function / method as soon as possible without doing any calculations.

SergeyKostrov · ‎11-08-2012

>> Tip # 3 << inline struct vec4D fastsin4D( struct Argument4D arg1 ) { ... } This is a C-like, that is very obsolete, way of declaring a function. Also, why don't you want to have a class that wraps and consolidates all common things, variables, declarations, etc?

SergeyKostrov · ‎11-08-2012

>> Tip # 4 << There are lots of declarations of some types and initializations of some constants inside of the function: inline struct vec4D fastsin4D( struct Argument4D arg1 ) { ... struct CoeffFlt1 { float c1,c2,c3,c4; } coeffflt1; ... struct CoeffFlt10 { float c1,c2,c3,c4; } coeffflt10; ... coeffflt1.c1 = -0.1666666; coeffflt1.c2 = -0.1666666; coeffflt1.c3 = -0.1666666; coeffflt1.c4 = -0.1666666; ... coeffflt10.c1 = -3.8681701e-23; coeffflt10.c2 = -3.8681701e-23; coeffflt10.c3 = -3.8681701e-23; coeffflt10.c4 = -3.8681701e-23; ... } - Declarations of some types should be done only once even if it is a compile time thing - Initializations of constants could be done only once during initialization of your library

SergeyKostrov · ‎11-08-2012

>> Tip # 5 << Think about applications and use cases of your API. Will a developer X be happy with your function(s) / class(es) or not? >> Tip # 6 << Have a test-case(s) for your API from the very beginning. >> Tip # 7 << Evaluate performance of your API from the very beginning. It will save time later and you won't need to do a re-design if some function doesn't work fast.

Bernard · ‎11-08-2012

>>>- Declarations of some types should be done only once even if it is a compile time thing - Initializations of constants could be done only once during initialization of your library>>> Thank You very much for your help I'm still learning and switching between 3 programming languages sometime confuses me:) Regarding declaration of those sine coefficients and their structures I will "typedef" them and move them to header file.Is this good solution to this problem?

Bernard · ‎11-08-2012

>>>This is a C-like, that is very obsolete, way of declaring a function. Also, why don't you want to have a class that wraps and consolidates all common things, variables, declarations, etc?>>> Yes I know this and I'm only at early stage of library design.Later I will create static library wrapped in classes with static function members. Now I only tested a few possible vectorization of input argument.

Bernard · ‎11-08-2012

I think that I will switch to intrinsics .

Bernard · ‎11-08-2012

>> Tip # 5 << Think about applications and use cases of your API. Will a developer X be happy with your function(s) / class(es) or not? >> Tip # 6 << Have a test-case(s) for your API from the very beginning. >> Tip # 7 << Evaluate performance of your API from the very beginning. It will save time later and you won't need to do a re-design if some function doesn't work fast. I will proceed with the design of library according to your advise. Today I tested large composed structure with four 4D inner structures put simply 4x4D vectors matrix.I was able to load them into XMM registers and perform various vector-wise and scalar-wise operation.I could try to calculate 16 scalar sine values and return pointer to such a matrix.

Bernard · ‎11-08-2012

>>>Try to use 'typedef's for declaration of structures instead of 'struct':>>> Yes I will do this.My intention is to wrap primitive types and data types in my own "typedefs". Btw Is it recommended to declare and initialize structure in header file? I tried to do this and even after using include guards I got compile errors. >>>Lots of unnecesseary things in that piece of code:>>> That piece of code simple fills and returns structure with NaN values,but I agree with you that this adds unnecesary clutter.I could also fix the values with for exmaple HALF_PI and did recursive call , but better option is simply return NULL.

SergeyKostrov · ‎11-08-2012

>>>>Try to use 'typedef's for declaration of structures instead of 'struct': >> >>Yes I will do this. My intention is to wrap primitive types and data types in my own "typedefs". >>Btw Is it recommended to declare and initialize structure in header file? - Declaration could be done in both. That is, in a header file or in a cpp-file - Initialization has to be done in a cpp-file Note: If you do initialization in a header file you can get a compilation error like 'A variable is already declared' or something like that ( every C++ compiler will display a different error ). >>I tried to do this and even after using include guards I got compile errors. Could you post a text of the compiler error? What about a really small reproducer?

SergeyKostrov · ‎11-08-2012

Hi Iliya, Here are a couple of more tips: - Use of macros help in reducing the overall size of sources ( Note: macros can't be debugged! ) - Try to use a Hungarian Notation when declaring variables or members of a class - Consider C++ templates ( one implementation will cover many different types ) - If possible, consider portability ( Portability is always a challenge! ) I could provide some little examples and just let me know what you need. Best regards, Sergey

SergeyKostrov · ‎11-08-2012

>>Regarding declaration of those sine coefficients and their structures I will "typedef" them and move them to header file. Is this >>good solution to this problem? Here is an example: [cpp] template < class T > class TSomeClass { public: TSomeClass( void ) { InitData(); }; virtual ~TSomeClass( void ) { }; private: inline void InitData( void ) { m_tNF2 = ( T )1.0 / ( T )2.0; m_tNF3 = ( T )1.0 / ( T )6.0; m_tNF4 = ( T )1.0 / ( T )24.0; m_tNF5 = ( T )1.0 / ( T )120.0; ... }; private: T m_tNF2; T m_tNF3; T m_tNF4; T m_tNF5; ... }; [/cpp]

SergeyKostrov · ‎11-08-2012

This is a test: #include void main( void ) { printf("Pasted from VS 2005"); }

Bernard · ‎11-08-2012

>>>- Use of macros help in reducing the overall size of sources ( Note: macros can't be debugged! )>>> I do not like macro definition.Usually I put in header file various constants and function declaration , regarding length of source code it does not matter for me:) >>>Try to use a Hungarian Notation when declaring variables or members of a class>>> I hate it but I will use it in order to prevent name conflicts. >>>- If possible, consider portability ( Portability is always a challenge! )>>> What do you mean by writing this? Are you talking about the hiding differences between various OS or between various compilers? I think that portability question is still too complicated to me. I have no more than 10 months of programming experience:)

Bernard · ‎11-09-2012

>>>- Declaration could be done in both. That is, in a header file or in a cpp-file - Initialization has to be done in a cpp-file Note: If you do initialization in a header file you can get a compilation error like 'A variable is already declared' or something like that ( every C++ compiler will display a different error ).>>> I could declare and initialize my own typedefs for various trigo and special functions pre-calculated coefficients(primitive types) and put them in .h file,but I won't be able to load them into XMM registers inside asm block.This is my main problem.Lets look at this from different perspective if I'm forced to use struct types I can declare them inside #include guarded "region" but I still need to initialize them in .cpp file probably as a global constants so I will use less lines of code spent of every function.Or simply keep initializing those structures inside function. As I wrote you earlier I was not able initialize even inside #ifdef block.

Bernard · ‎11-09-2012

Here is an example: This is good design I will use it as a template. My design will be based on static functions inside my own namespace.For such a library I still do not want to do full object orientation. I still love structured programming paradigm design:)

SergeyKostrov · ‎11-09-2012

>>...Are you talking about the hiding differences between various OS or between various compilers?.. The 2nd part of your question '...between various compilers...'. For example, Intel and MSC C++ compilers must compile the sources and results have to be the same ( or almost the same ).

SergeyKostrov · ‎11-09-2012

>>...I do not like macro definition.Usually I put in header file various constants and function declaration... I was talking about macros like this: ... #define HrtExchange( Value1, Value2, Unused ) \ { \ _asm MOV eax, [##Value1] \ _asm MOV edx, [##Value2] \ _asm MOV [##Value1], edx \ _asm MOV [##Value2], eax \ } ...

Bernard · ‎11-09-2012

I was talking about macros like this: ... #define HrtExchange( Value1, Value2, Unused ) \ { \ _asm MOV eax, [##Value1] \ _asm MOV edx, [##Value2] \ _asm MOV [##Value1], edx \ _asm MOV [##Value2], eax \ } Theoritically I could move simple scalar double argument trigo function computation to header file and define them as a macros,but what about the 4D double vectors and complex multi-vector structures?I studied a design of xnamath library which was fully implemented inside header file. What do You think about the such a design? Such a macros are good for removing burden of coeffs calculation/initialization ,but my intention is to vectorize calculation and not to speed up it Regarding initialization coeffs structure inside header file I abandoned it and simply will only declare them in .h file.Inside trigo function I will initialize/instatiate those structures by such a design it could be possible to reduce such a function to more than 100 lines of code. Now I'm testing large 4x4D structure with 16 sine scalar argument for now I was able to do simple vectors computation.