Solved: Optimization of sine function's taylor expansion - Page 11

Bernard · ‎05-24-2012

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

bronxzv · ‎06-08-2012

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

View solution in original post

SergeyKostrov · ‎06-29-2012

Quoting iliyapolak

Sergey Could you explain how is implemented in software Run-Time HAL.I thin that the possible implementation would be to wrap inline assembly in custom library functions.

Hi Iliya,

There is atechnique in C/C++ and I would call it as a 'macro substitution'. Actually, it is as old as C language andhere is
a small example:

int FunctionA( int iParamA );
int FunctionB( int iParamB );

int FunctionA( int iParamA )
{
printf( "FunctionA called\n" );
}

int FunctionB( int iParamB )
{
printf( "FunctionB called\n" );
}

#define _USE_FUNCTION_A
//#define _USE_FUNCTION_B

#ifdef _USE_FUNCTION_A
#define Function FunctionA
#endif
#ifdef _USE_FUNCTION_B // I always use explicit #ifdefs and Inever use #ifdef-#else-#endif
#define Function FunctionB
#endif

void main( void )
{
Function( 1 );
}

In essence, HRT AL in the ScaLib library is based on that.Remember, that this is a compile time technique. If you need
a similar substitution at arun-time it could be alsodone.

Best regards,
Sergey

Bernard · ‎06-29-2012

Thanks for explaining. You know when I saw first time the word "HAL" I thought that this could be some kind of wrapper library which accesses hardware and on top of this library runs software which can get access to the underlying hardwars by library wrapper functions.And all this is access is performed by custom driver which reads Device Object and Driver Object data structures and hardware accesses are implemnted inn inline assembly.

SergeyKostrov · ‎06-30-2012

Here is a Test-Case for a run-time substitution:

int ( *g_pFunction[2] )( int iParam );

int FunctionA( int iParamA );
int FunctionB( int iParamB );

int FunctionA( int iParamA )
{
printf( "FunctionA called\n" );
}

int FunctionB( int iParamB )
{
printf( "FunctionB called\n" );
}

void main( void )
{
int iCondition = 0;

g_pFunction[0] = FunctionA;
g_pFunction[1] = FunctionB;

if( iCondition == 0)
g_pFunction[0]( 0 ); // FunctionA will be executed
else
g_pFunction[1]( 1 );
}

SergeyKostrov · ‎06-30-2012

Quoting Sergey Kostrov

...
void main( void )
{
int iCondition = 0;

g_pFunction[0] = FunctionA;
g_pFunction[1] = FunctionB;
if( iCondition == 0)
g_pFunction[0]( 0 ); // FunctionA will be executed
else
g_pFunction[1]( 1 );
}

Or
...
void main( void )
{
int iCondition = 0;

g_pFunction[0] = FunctionA;
g_pFunction[1] = FunctionB;

g_pFunction[iCondition]( 0 ); // FunctionA will be executed
}
...

SergeyKostrov · ‎06-30-2012

Quoting TimP (Intel)

...
I apologize for getting confused about the sequence of posts and reverting back to the rcpps subject where the thread started...

This is a very long and interesting thread. Unfortunately, we're deviating all the time from the main subject...

Best regards,
Sergey

SergeyKostrov · ‎06-30-2012

Here is a Template based API substitution technique:

...
class CApiSet1
{
public:
CApiSet1(){};
~CApiSet1(){};
void FunctionA(){ printf("CApiSet1::FunctionA\n"); };
void FunctionB(){ printf("CApiSet1::FunctionB\n"); };
};

class CApiSet2
{
public:
CApiSet2(){};
~CApiSet2(){};
void FunctionA(){ printf("CApiSet2::FunctionA\n"); };
void FunctionB(){ printf("CApiSet2::FunctionB\n"); };
};

template < class APISET > void ExecuteApiSet( APISET &as );

template < class APISET > void ExecuteApiSet( APISET &as )
{
as.FunctionA();
as.FunctionB();
};

void main( void )
{
CApiSet1 as1;
CApiSet2 as2;

ExecuteApiSet( as1 );
ExecuteApiSet( as2 );
}
...

Output is as follows:

CApiSet1::FunctionA
CApiSet1::FunctionB
CApiSet2::FunctionA
CApiSet2::FunctionB

Bernard · ‎06-30-2012

The Intel compilers have default options -no-prec-div -no-prec-sqrt where, in some contexts, rather than using an IEEE accurate hardware instruction (or in some cases library function), there is explicit iteration in the generated code, starting with rcpps or rsqrtps.

Tim!
As I understood properly rcpps instruction can be used as a some fallback optionsto calculate possibly on the fly Taylor Series coefficents.

Bernard · ‎06-30-2012

This is a very long and interesting thread

This thread is most viewed thread for the last month.Butsadly it has only afew active participants.

Bernard · ‎07-01-2012

That would be awesome if you share your work

Sergey!
If you are interested in my other classes.Ican share my work.I have twoJava classes.First class is based on Complex numbers and Complex functions implemented as a objects. This project contains only trigonometric and exponential functions in complex domain.Second classdeals with various bit-manipulations and is based mostlyon the book "Hacker's Delight" class is implemented as a static methods.

SergeyKostrov · ‎07-28-2012

Quoting Sergey Kostrov

Here is another set of performance numbers:

Application - ScaLibTestApp - WIN32_MSC
...

Hi Iliya,

I completed another set of tests and I'll post results soon. Two tests for Intel instruction 'fsin' added:

one is in ticks and another one isin clock cycles.

Best regards,
Sergey

Bernard · ‎07-28-2012

I completed another set of tests and I'll post results soon. Two tests for Intel instruction 'fsin' added:

one is in ticks and another one isin clock cycles

That's would be great.
Afaik 'fsin' latency is around 40-100 cpi.Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element.

SergeyKostrov · ‎07-30-2012

Set of Tests A:

[cpp] Normalized Taylor Series 7t Sin(30.0) = 0.4999999918690232200000 - C Macro Best time: 62 ticks Normalized Taylor Series 9t Sin(30.0) = 0.5000000000202798900000 - C Macro Best time: 74 ticks Normalized Taylor Series 11t Sin(30.0) = 0.4999999999999643100000 - C Macro Best time: 93 ticks Normalized Series 7t Sin(30.0) = 0.4999999918690232700000 Best time: 203 ticks Normalized Chebyshev Polynomial 7t Sin(30.0) = 0.4999999476616694400000 Best time: 203 ticks Normalized Taylor Series 7t Sin(30.0) = 0.4999999918690232200000 Best time: 203 ticks Normalized Chebyshev Polynomial 9t Sin(30.0) = 0.4999999997875643800000 Best time: 218 ticks Normalized Taylor Series 9t Sin(30.0) = 0.5000000000202798900000 Best time: 218 ticks Normalized Series 9t Sin(30.0) = 0.5000000000202800000000 Best time: 234 ticks Normalized Series 11t Sin(30.0) = 0.5000000000000000000000 Best time: 265 ticks Chebyshev Polynomial 7t Sin(30.0) = 0.4999999476616695500000 Best time: 266 ticks CRT Sin(30.0) = 0.4999999999999999400000 - SSE2 support On Best time: 281 ticks Chebyshev Polynomial 9t Sin(30.0) = 0.4999999997875643800000 Best time: 312 ticks Normalized Taylor Series 23t Sin(30.0) = 0.4999999999999999400000 - FastSinV3 - Optimized Best time: 343 ticks Intel instruction FSIN(30.0) = 0.4999999999999999400000 - C Macro Best time: 359 ticks Normalized Taylor Series 23t Sin(30.0) = 0.4999999999999999400000 - FastSinV2 - Optimized Best time: 406 ticks Normalized Taylor Series 23t Sin(30.0) = 0.4999999999999999400000 - FastSinV1 - Not Optimized Best time: 453 ticks CRT Sin(30.0) = 0.4999999999999999400000 - SSE2 support Off Best time: 484 ticks [/cpp]

Set of Tests B:

[bash] Normalized Taylor Series 7t Sin(30.0) = 0.4999999918690232200000 - C Macro Best time: 24 clock cycles Normalized Taylor Series 9t Sin(30.0) = 0.5000000000202798900000 - C Macro Best time: 34 clock cycles Normalized Taylor Series 11t Sin(30.0 )= 0.4999999999999643100000 - C Macro Best time: 42 clock cycles Intel instruction FSIN(30.0 )= 0.4999999999999999400000 - C Macro Best time: 140 clock cycles[/bash]

Technical Details:

- Windows XP 32-bit / Visual Studio 2005
- All compiler optimizations are disabled / Release configuration / Platform Win32
- Double data type ( 53-bit precision )
- Number of calls for every test is 4194304 ( 2^22 )
- Process priority Normal
- All implementations are portable / N/A for cases: CRT 'sin' and Intel 'fsin' instruction
- CRT function 'sin' tested with SSE2 support On ( faster ) and Off ( slower )
- C Macro means it is implemented as a macro ( inline / no call overhead )
- Ordered from Fastest to Slowest
- Values in Ticks measured using GetTickCount Win32 API function
- Values in Clock Cycles measured using RDTSC instruction

SergeyKostrov · ‎07-30-2012

>>...
>> Intel instruction FSIN(30.0 )= 0.4999999999999999400000 - C Macro
>> Best time: 140 clock cycles
>>...

A note regarding 140 clock sycles for Intel 'FSIN' instruction. The result of my test is very
consistent with what I found in Intel documentation. Here are details from Intel 64 and IA-32 Architectures
Optimization Reference Manual ( Nov 2007 edition ). Topic: Instruction Latency and Throughput

Summary for 'FSIN' instruction from Tables C-11 and C-11a from Pages C-25 and C-26:

...
Latency for a CPU with CPUID 0F_2H : 160 - 180
Latency for a CPU with CPUID 0F_3H : 160 - 200
Latency for a CPU with CPUID 06_17H: 82
Latency for a CPU with CPUID 06_0EH: 119
Latency for a CPU with CPUID 06_0DH: 119
...
Throughput for a CPU with CPUID 0F_2H : 130
Throughput for a CPU with CPUID 06_0EH: 116
Throughput for a CPU with CPUID 06_0DH: 116
...

Also, a Footnote #4 says:

...
4. Latency and Throughput of transcendental instructions can vary substantially in a
dynamic execution environment. Only an approximate value or a range of values
are given for these instructions.
...

Bernard · ‎07-31-2012

NormalizedTaylorSeries23tSin(30.0)=0.4999999999999999400000-FastSinV1-NotOptimized
Besttime:453ticks

Sergey try to pass a random value as an argument to your functions.
I see that fastsin() is the slowest function probably because of 11 terms involved in calculation.
Chebyshev polynomial approximation suffers from the lowest accuracy.

Bernard · ‎07-31-2012

4. Latency and Throughput of transcendental instructions can vary substantially in a
dynamic execution environment. Only an approximate value or a range of values
are given for these instructions

It can also depend on the value passed as an argument to 'fsin'.

SergeyKostrov · ‎07-31-2012

Quoting iliyapolak

4. Latency and Throughput of transcendental instructions can vary substantially in a
dynamic execution environment. Only an approximate value or a range of values
are given for these instructions

It can also depend on the value passed as an argument to 'fsin'.

That's a good idea to verifysome time later. Thank you, Iliya!

Bernard · ‎08-02-2012

That's a good idea to verifysome time later

While testing 'fsin' try to pass 2^22 random values to instruction.

SergeyKostrov · ‎08-02-2012

Quoting iliyapolak

...Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element.

Hi Iliya,

I'd like to verify our sources of data... Where did youseethat number? I've found a link:

http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_performanceall.html

and I don't seethe23.90 number for the sinefunction ( real ). Here is a screenshot:

SergeyKostrov · ‎08-02-2012

Quoting Sergey Kostrov

...
NormalizedTaylorSeries7tSin(30.0)=0.4999999918690232200000-CMacro
Besttime:24clockcycles
...

[SergeyK] I'd like to note thatit is just twocode lines C macroandthere is noa range reduction.

Here is your comment from aPost #278:

>>...Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element...

I think it is unfair to compare highly optimized trigonometric functions that use latest sets of Intel instructions ( SSE4 / AVX / etc )
withhighly portable C implementations of trigonometric functions.

Bernard · ‎08-03-2012

@Sergey Sorry I did not remember the correct value when I have written my post. The correct value is 20.96 cycles per element for HA double argument. Btw my source is the same as yours.

Bernard · ‎08-06-2012

Here is your comment from aPost #278:

>>...Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element...

I think it is unfair to compare highly optimized trigonometric functions

Completely agree with you.