Solved: >>>>I know that it looks like - Page 16

Bernard · ‎05-24-2012

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

bronxzv · ‎06-08-2012

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

View solution in original post

Bernard · ‎11-09-2012

>>>The 2nd part of your question '...between various compilers...'. For example, Intel and MSC C++ compilers must compile the sources and results have to be the same ( or almost the same ).>>> How can you design such a library? Do you perform an extensive testing of your library on various compilers an by observing the results can you develop some form of an abstraction which hides the differences between various compilers. For example when your library function uses long double for sin calculation and this function must be compiled by Microsoft compiler will it by design switch to using double primitive types.

Bernard · ‎11-09-2012

@Sergey How can I insert formatted code in message like your template class example?

SergeyKostrov · ‎11-09-2012

>>...it could be possible to reduce such a function to more than 100 lines of code... You need to decide what is more important for you. That is, some number of extra lines ( could be hundreds and more ) in source codes, or performance.

SergeyKostrov · ‎11-09-2012

>>...How can I insert formatted code in message like your template class example?.. Have you read that comment: ... To enable syntax highlighting, surround the language with brackets, where language is one of the following languages: bash, csharp, cpp, css, fortran, jscript, java, perl, php, plain, python, r, ruby, sql, xml, html, javascript, s, splus. For example, *php*foo*/php* for PHP code. NOTE: I replaced square brackets with a character * Here is example for *cpp* ... */cpp*: [cpp] #include *stdio.h* void main( void ) { printf( "Hello, formatted source codes!\n" ); } [/cpp] NOTE: Problems with arrow-left and arrow-right characters after #include directive and extra lines are still not fixed. Well, Intel Software Developers... Is it going to take another a couple of years?

SergeyKostrov · ‎11-09-2012

>>...How can you design such a library? Iliya, you need to create a prototype of your library with some numbers of stubs, use cases, test cases, etc from the very beginning. Actually, this is a very big subject to discuss and it is all about a Right Software Engineering. If some operating system, or a platform, or a CPU, or a C/C++ compiler is not considered from the beginning it is very hard to add and provide support for it later and it involves re-design, re-implementation, re-testing etc. An example of a poorly designed library is MFC. An example of a really good design is OWL ( Object Window Library ) which implemented by Borland Corporation and many modern software developers don't know about it.

Bernard · ‎11-09-2012

>>>You need to decide what is more important for you. That is, some number of extra lines ( could be hundreds and more ) in source codes, or performance.>>> I decided try to not optimize for performance.Structures will be declared in header file and initialized in cpp file , for initialization array-like syntax will be used because of smal number af struct members. Later I will try to optimize for performance by using macros.

Bernard · ‎11-10-2012

Hi Sergey! I completely redesigned vectorized fastsin() function.Reduced by more than 70% size of code (140 lines of code removed)and improved inline SSE Horner scheme evaluation.Inside _asm block I removed even power multiplication of xmm register and total count of instruction executed per one polynomial term is 3(one mov,one mul ,one add).As you can see from the code below array-like initialization of structures was used. Here is an improved version: inline struct SinVector *fastsinVec4D(struct Test1 *test1ptr1){ if(test1ptr1 == NULL){ return NULL; }else if(test1ptr1->c1 >= HALF_PI_FLT || test1ptr1->c2 >= HALF_PI_FLT || test1ptr1->c3 >= HALF_PI_FLT || test1ptr1->c4 >=HALF_PI_FLT) { return NULL; }else if(test1ptr1->c1 <= NEG_HALF_PI_FLT || test1ptr1->c2 <= NEG_HALF_PI_FLT || test1ptr1->c3 <= NEG_HALF_PI_FLT || test1ptr1->c4 <= NEG_HALF_PI_FLT) { return NULL; }else{ SinVector sinvec1 = {-0.1666666,-0.1666666,-0.1666666,-0.1666666},*sinvec1ptr; sinvec1ptr = &sinvec1; SinVector sinvec2 = {0.0083333,0.0083333,0.0083333,0.0083333},*sinvec2ptr; sinvec2ptr = &sinvec2; SinVector sinvec3 = {-1.9841269e-4,-1.9841269e-4,-1.9841269e-4,-1.9841269e-4},*sinvec3ptr; sinvec3ptr = &sinvec3; SinVector sinvec4 = {2.7557319e-6,2.7557319e-6,2.7557319e-6,2.7557319e-6},*sinvec4ptr; sinvec4ptr = &sinvec4; SinVector sinvec5 = {-2.5052108e-8,-2.5052108e-8,-2.5052108e-8,-2.5052108e-8},*sinvec5ptr; sinvec5ptr = &sinvec5; SinVector sinvec6 = { 1.6059043e-10, 1.6059043e-10, 1.6059043e-10, 1.6059043e-10},*sinvec6ptr; sinvec6ptr = &sinvec6; SinVector sinvec7 = {-7.6471637e-13,-7.6471637e-13,-7.6471637e-13,-7.6471637e-13},*sinvec7ptr; sinvec7ptr = &sinvec7; SinVector sinvec8 = {2.8114572e-15,2.8114572e-15,2.8114572e-15,2.8114572e-15},*sinvec8ptr; sinvec8ptr = &sinvec8; SinVector sinvec9 = {-8.2206352e-18,-8.2206352e-18,-8.2206352e-18,-8.2206352e-18},*sinvec9ptr; sinvec9ptr = &sinvec9; SinVector sinvec10 = {1.9572941e-20,1.9572941e-20,1.9572941e-20,1.9572941e-20},*sinvec10ptr; sinvec10ptr = &sinvec10; SinVector sinvec11 = {-3.8681701e-23,-3.8681701e-23,-3.8681701e-23,-3.8681701e-23},*sinvec11ptr; sinvec11ptr = &sinvec11; SinVector result = {0.0f,0.0f,0.0f,0.0f},*resultptr; resultptr = &result; _asm{ xorps xmm0,xmm0 xorps xmm1,xmm1 xorps xmm6,xmm6 xorps xmm7,xmm7 xorps xmm5,xmm5 movups xmm0,test1 //arg x,y,z,w movups xmm7,xmm0 // copy of arg xmm7 accumulator mulps xmm0,test1 //x^2 mulps xmm0,test1 //x^3 movups xmm1,sinvec1 mulps xmm0,xmm1 addps xmm7,xmm0 movups xmm1,sinvec2 mulps xmm0,xmm1 addps xmm7,xmm0 movups xmm1,sinvec3 mulps xmm0,xmm1 addps xmm7,xmm0 movups xmm1,sinvec4 mulps xmm0,xmm1 addps xmm7,xmm0 movups xmm1,sinvec5 mulps xmm0,xmm1 addps xmm7,xmm0 movups xmm1,sinvec6 mulps xmm0,xmm1 addps xmm7,xmm0 movups xmm1,sinvec7 mulps xmm0,xmm1 addps xmm7,xmm0 movups xmm1,sinvec8 mulps xmm0,xmm1 addps xmm7,xmm0 movups xmm1,sinvec9 mulps xmm0,xmm1 addps xmm7,xmm0 movups xmm1,sinvec10 mulps xmm0,xmm1 addps xmm7,xmm0 movups xmm1,sinvec11 mulps xmm0,xmm1 addps xmm7,xmm0 movups result,xmm7 } return resultptr; } }

Bernard · ‎11-10-2012

>>>An example of a poorly designed library is MFC.>>> I'm not surprised reading this sentence. >>>liya, you need to create a prototype of your library with some numbers of stubs, use cases, test cases, etc from the very beginning.>>> I was interested mainly in innerworkings of such a library. For example does your library function or maybe macros perform some tests to check exact CPU version by wrapping CPUID instruction. Another check could be run against various OS's and compilers.

SergeyKostrov · ‎11-10-2012

>>For example does your library function or maybe macros perform some tests to check exact CPU version by wrapping CPUID instruction. No at the moment. A small R&D test was done, however. Unfortunately, there are so many CPUs with different instruction sets that it's very hard to support all of them. Identification of a CPU is not a problem, and once again, support of many CPUs is a problem. >>Another check could be run against various OS's and compilers. Yes and I'll post some codes ( examples ) later.

Bernard · ‎11-10-2012

>>>Yes and I'll post some codes ( examples ) later.>>> Thanks it would be very interesting to gain some insight into innerworking of such a library. Now I'm planning to implement my own library of "primitive" intrinsics. My intention is to wrap in c++ static methods, various SSEn instruction which will perform register loading ,vector and scalar calculation and some data conversion for example converting from "some_type" array where 0<3 to "some_type" structure. What is your opinion about this?

SergeyKostrov · ‎11-11-2012

>>... What is your opinion about this?... Please post a prototype of some function that does what you've described for a code review. Everything is possible but it's hard to tell anything now.

Bernard · ‎11-11-2012

>>>Please post a prototype of some function that does what you've described for a code review. Everything is possible but it's hard to tell anything now.>>> You are right how I could have forgotten:) _VecLibDouble is typedef of double primitive type. Here is the example which has been already tested: _VecLibDouble *Vec_Add2D_d(_VecLibDouble a[],_VecLibDouble b[]){ _VecLibDouble sum[] = {0.0,0.0}; if(a == NULL || b == NULL){ return NULL; else if(sizeof(a)/sizeof(a[0] )> 2 || sozeof(b)/sizeof(b[0])>2) return NULL; ]else{ _asm { xorpd xmm0,xmm0 xorpd xmm1,xmm1 xor eax,eax xor edx,edx mov eax, mov edx, movupd xmm0,[eax] movupd xmm1,[edx] addpd xmm1,xmm0 movupd sum,xmm1 } return sum; } }

SergeyKostrov · ‎11-12-2012

Hi Iliya, Here is my feedback: _VecLibDouble * Vec_Add2D_d( _VecLibDouble a[], _VecLibDouble b[] ) { if( a == NULL || b == NULL ) { return ( _VecLibDouble * )NULL; else if( sizeof(a)/sizeof(a[0] )> 2 || sizeof(b)/sizeof(b[0])>2 ) return ( _VecLibDouble * )NULL; ] else { _VecLibDouble sum[] = { 0.0L, 0.0L }; _asm { xorpd xmm0,xmm0 xorpd xmm1,xmm1 xor eax,eax xor edx,edx mov eax, mov edx, movupd xmm0,[eax] movupd xmm1,[edx] addpd xmm1,xmm0 movupd sum,xmm1 } return ( _VecLibDouble * )sum; } } I always follow very strict rules: - Don't initialize anything before all input parameters are verified - I know that it looks like useless: 'return ( _VecLibDouble * )sum' ( it is my personal style for many years already ) - When i...

Bernard · ‎11-12-2012

always follow very strict rules: - Don't initialize anything before all input parameters are verified - I know that it looks like useless: 'return ( _VecLibDouble * )sum' ( it is my personal style for many years already ) - When it comes to 'double's use 'L' suffix - What about a use case on how the function will be used? Thank you for your programming tips I very appreciate it. >>>I know that it looks like useless: 'return ( _VecLibDouble * )sum' ( it is my personal style for many years already )>>> Is it necessesery to cast sum pointer to _VecLibDouble type?I think that this is needed to reassure proper return type pointer. >>>- When it comes to 'double's use 'L' suffix>>> Are you referring here to long integer type.This rule probably enforces compiler to allocate 8 bytes for double type storage. >>>- What about a use case on how the function will be used?>>> Simplest use case could be passing two 2-component [x,y] 1D vectors and add them by using SIMD XMM registers.The result is accumulated in xmm1 and passed to sum array.I think that inside _asm block i won't use eax register for passing an address of function argument. I have created also typedef struct _Vec128_f which contains 4 floats aligned on 16-byte boundary. Tommorow I will post more test-cases.

SergeyKostrov · ‎11-12-2012

>>>>I know that it looks like useless: 'return ( _VecLibDouble * )sum' ( it is my personal style for many years already ) >> >>Is it necessesery to cast sum pointer to _VecLibDouble type?I think that this is needed to reassure proper return type pointer. No, As I've told this is simply my personal style. A C/C++ compiler must do it for you by default. But, when I do a code review I prefer to see a type of a return parameter explicitly casted to some type.

SergeyKostrov · ‎11-12-2012

Hi Iliya, Here is a note about junk-codes. Even on a small project they are accumulating pretty quickly. After a couple of weeks a team could have hundreds of junk-codes. After a couple of months it could be already as many as thousands. An example is below: _VecLibDouble * Vec_Add2D_d( _VecLibDouble a[], _VecLibDouble b[] ) { if( a == NULL || b == NULL ) { // Do you need a bracket here? That's depends on a style and keep it if you like it. Some companies could require the bracket (!). return ( _VecLibDouble * )NULL; } //... else if( sizeof(a)/sizeof(a[0] )> 2 || sizeof(b)/sizeof(b[0])>2 ) { // Do you need a bracket here? That's depends on a style and keep it if you like it. return ( _VecLibDouble * )NULL; } //... else // You don't need 'else' here. Note: In that case it could be your style of programming and that's OK. { _VecLibDouble sum[] = { 0.0L, 0.0L }; ... return ( _VecLibDouble * )sum; } } and a modified function could look like: _VecLibDouble * Vec_Add2D_d( _VecLibDouble a[], _VecLibDouble b[] ) { if( a == NULL || b == NULL ) return ( _VecLibDouble * )NULL; else if( sizeof(a)/sizeof(a[0] )> 2 || sizeof(b)/sizeof(b[0])>2 ) return ( _VecLibDouble * )NULL; _VecLibDouble sum[] = { 0.0L, 0.0L }; ... return ( _VecLibDouble * )sum; }

Bernard · ‎11-12-2012

>>>Hi Iliya, Here is a note about junk-codes. Even on a small project they are accumulating pretty quickly. After a couple of weeks a team could have hundreds of junk-codes. After a couple of months it could be already as many as thousands. An example is below:>>> Yes I know that pretty well,but as You have deduced that using if-else with block statement brackets is my style.It will be changed in release build as a attempt to optimize my library. I would like to ask you how could be implemented XMM register - loading intrinsic? I mean should I design such a functions with void return type or maybe better option is to load xmm register with one of my "typedefs" and return pointer to structure. I studied Intel load intrinsics and they are returning _m128 type when beign passed a float or double type array as an argument.

SergeyKostrov · ‎11-13-2012

>>...should I design such a functions with void return type or maybe better option is to load xmm register with one of my "typedefs" and >>return pointer to structure... I would consider a solution with as Less As Possible Overhead in terms of performance. So, you could try both versions, evaluate performance and then select the fastest. It always makes sence to create as many as possible versions of some new software subsystem, or a function, in order to select the best solution.

Bernard · ‎11-13-2012

>>>I would consider a solution with as Less As Possible Overhead in terms of performance. So, you could try both versions, evaluate performance and then select the fastest. It always makes sence to create as many as possible versions of some new software subsystem, or a function, in order to select the best solution.>>> I'm designing my library exactly as you advised me.Two primitive vector type were created one is based on structures and second is based on arrays also more than dozen various load, convert and store functions are planned to be written and tested soon.I will constantly post various small test-cases.

SergeyKostrov · ‎11-13-2012

>>...I will constantly post various small test-cases... Hi Iliya, Thank you.

Bernard · ‎11-13-2012

>>>Hi Iliya, Thank you.>>> I will start posting test-cases in friday.I need to prepare myself for java programming related job interview. I would like to ask you what kind of questions/tests I need to be prepared for? What could I be asked by the reviewer?. I 'am asking this here,because I cannot send you a private message(system is rejecting my attempts).

Optimization of sine function's taylor expansion