Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
1135 讨论

Optimization of sine function's taylor expansion

Bernard
重要分销商 I
37,667 次查看

Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]

0 项奖励
1 解答
bronxzv
新分销商 II
37,476 次查看

calls 1e6 times fastsin() the result in millisecond is 63

so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRC

if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine

在原帖中查看解决方案

0 项奖励
342 回复数
SergeyKostrov
重要分销商 II
2,271 次查看
Quoting iliyapolak
Sergey Could you explain how is implemented in software Run-Time HAL.I thin that the possible implementation would be to wrap inline assembly in custom library functions.


Hi Iliya,

There is atechnique in C/C++ and I would call it as a 'macro substitution'. Actually, it is as old as C language andhere is
a small example:

int FunctionA( int iParamA );
int FunctionB( int iParamB );

int FunctionA( int iParamA )
{
printf( "FunctionA called\n" );
}

int FunctionB( int iParamB )
{
printf( "FunctionB called\n" );
}

#define _USE_FUNCTION_A
//#define _USE_FUNCTION_B

#ifdef _USE_FUNCTION_A
#define Function FunctionA
#endif
#ifdef _USE_FUNCTION_B // I always use explicit #ifdefs and Inever use #ifdef-#else-#endif
#define Function FunctionB
#endif

void main( void )
{
Function( 1 );
}

In essence, HRT AL in the ScaLib library is based on that.Remember, that this is a compile time technique. If you need
a similar substitution at arun-time it could be alsodone.

Best regards,
Sergey

0 项奖励
Bernard
重要分销商 I
2,271 次查看
Thanks for explaining. You know when I saw first time the word "HAL" I thought that this could be some kind of wrapper library which accesses hardware and on top of this library runs software which can get access to the underlying hardwars by library wrapper functions.And all this is access is performed by custom driver which reads Device Object and Driver Object data structures and hardware accesses are implemnted inn inline assembly.
0 项奖励
SergeyKostrov
重要分销商 II
2,271 次查看
Here is a Test-Case for a run-time substitution:

int ( *g_pFunction[2] )( int iParam );

int FunctionA( int iParamA );
int FunctionB( int iParamB );

int FunctionA( int iParamA )
{
printf( "FunctionA called\n" );
}

int FunctionB( int iParamB )
{
printf( "FunctionB called\n" );
}

void main( void )
{
int iCondition = 0;

g_pFunction[0] = FunctionA;
g_pFunction[1] = FunctionB;

if( iCondition == 0)
g_pFunction[0]( 0 ); // FunctionA will be executed
else
g_pFunction[1]( 1 );
}

0 项奖励
SergeyKostrov
重要分销商 II
2,271 次查看

...
void main( void )
{
int iCondition = 0;

g_pFunction[0] = FunctionA;
g_pFunction[1] = FunctionB;

if( iCondition == 0)
g_pFunction[0]( 0 ); // FunctionA will be executed
else
g_pFunction[1]( 1 );
}


Or
...
void main( void )
{
int iCondition = 0;

g_pFunction[0] = FunctionA;
g_pFunction[1] = FunctionB;

g_pFunction[iCondition]( 0 ); // FunctionA will be executed
}
...

0 项奖励
SergeyKostrov
重要分销商 II
2,271 次查看
Quoting TimP (Intel)
...
I apologize for getting confused about the sequence of posts and reverting back to the rcpps subject where the thread started...


This is a very long and interesting thread. Unfortunately, we're deviating all the time from the main subject...

Best regards,
Sergey

0 项奖励
SergeyKostrov
重要分销商 II
2,271 次查看

Here is a Template based API substitution technique:

...
class CApiSet1
{
public:
CApiSet1(){};
~CApiSet1(){};
void FunctionA(){ printf("CApiSet1::FunctionA\n"); };
void FunctionB(){ printf("CApiSet1::FunctionB\n"); };
};

class CApiSet2
{
public:
CApiSet2(){};
~CApiSet2(){};
void FunctionA(){ printf("CApiSet2::FunctionA\n"); };
void FunctionB(){ printf("CApiSet2::FunctionB\n"); };
};

template < class APISET > void ExecuteApiSet( APISET &as );

template < class APISET > void ExecuteApiSet( APISET &as )
{
as.FunctionA();
as.FunctionB();
};

void main( void )
{
CApiSet1 as1;
CApiSet2 as2;

ExecuteApiSet( as1 );
ExecuteApiSet( as2 );
}
...

Output is as follows:

CApiSet1::FunctionA
CApiSet1::FunctionB
CApiSet2::FunctionA
CApiSet2::FunctionB

0 项奖励
Bernard
重要分销商 I
2,271 次查看

The Intel compilers have default options -no-prec-div -no-prec-sqrt where, in some contexts, rather than using an IEEE accurate hardware instruction (or in some cases library function), there is explicit iteration in the generated code, starting with rcpps or rsqrtps.

Tim!
As I understood properly rcpps instruction can be used as a some fallback optionsto calculate possibly on the fly Taylor Series coefficents.
0 项奖励
Bernard
重要分销商 I
2,271 次查看

This is a very long and interesting thread

This thread is most viewed thread for the last month.Butsadly it has only afew active participants.
0 项奖励
Bernard
重要分销商 I
2,271 次查看

That would be awesome if you share your work

Sergey!
If you are interested in my other classes.Ican share my work.I have twoJava classes.First class is based on Complex numbers and Complex functions implemented as a objects. This project contains only trigonometric and exponential functions in complex domain.Second classdeals with various bit-manipulations and is based mostlyon the book "Hacker's Delight" class is implemented as a static methods.
0 项奖励
SergeyKostrov
重要分销商 II
2,271 次查看
Here is another set of performance numbers:

Application - ScaLibTestApp - WIN32_MSC
...

Hi Iliya,

I completed another set of tests and I'll post results soon. Two tests for Intel instruction 'fsin' added:

one is in ticks and another one isin clock cycles.

Best regards,
Sergey
0 项奖励
Bernard
重要分销商 I
2,272 次查看

I completed another set of tests and I'll post results soon. Two tests for Intel instruction 'fsin' added:

one is in ticks and another one isin clock cycles

That's would be great.
Afaik 'fsin' latency is around 40-100 cpi.Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element.
0 项奖励
SergeyKostrov
重要分销商 II
2,272 次查看

Set of Tests A:

[cpp] Normalized Taylor Series 7t Sin(30.0) = 0.4999999918690232200000 - C Macro Best time: 62 ticks Normalized Taylor Series 9t Sin(30.0) = 0.5000000000202798900000 - C Macro Best time: 74 ticks Normalized Taylor Series 11t Sin(30.0) = 0.4999999999999643100000 - C Macro Best time: 93 ticks Normalized Series 7t Sin(30.0) = 0.4999999918690232700000 Best time: 203 ticks Normalized Chebyshev Polynomial 7t Sin(30.0) = 0.4999999476616694400000 Best time: 203 ticks Normalized Taylor Series 7t Sin(30.0) = 0.4999999918690232200000 Best time: 203 ticks Normalized Chebyshev Polynomial 9t Sin(30.0) = 0.4999999997875643800000 Best time: 218 ticks Normalized Taylor Series 9t Sin(30.0) = 0.5000000000202798900000 Best time: 218 ticks Normalized Series 9t Sin(30.0) = 0.5000000000202800000000 Best time: 234 ticks Normalized Series 11t Sin(30.0) = 0.5000000000000000000000 Best time: 265 ticks Chebyshev Polynomial 7t Sin(30.0) = 0.4999999476616695500000 Best time: 266 ticks CRT Sin(30.0) = 0.4999999999999999400000 - SSE2 support On Best time: 281 ticks Chebyshev Polynomial 9t Sin(30.0) = 0.4999999997875643800000 Best time: 312 ticks Normalized Taylor Series 23t Sin(30.0) = 0.4999999999999999400000 - FastSinV3 - Optimized Best time: 343 ticks Intel instruction FSIN(30.0) = 0.4999999999999999400000 - C Macro Best time: 359 ticks Normalized Taylor Series 23t Sin(30.0) = 0.4999999999999999400000 - FastSinV2 - Optimized Best time: 406 ticks Normalized Taylor Series 23t Sin(30.0) = 0.4999999999999999400000 - FastSinV1 - Not Optimized Best time: 453 ticks CRT Sin(30.0) = 0.4999999999999999400000 - SSE2 support Off Best time: 484 ticks [/cpp]


Set of Tests B:

[bash] Normalized Taylor Series 7t Sin(30.0) = 0.4999999918690232200000 - C Macro Best time: 24 clock cycles Normalized Taylor Series 9t Sin(30.0) = 0.5000000000202798900000 - C Macro Best time: 34 clock cycles Normalized Taylor Series 11t Sin(30.0 )= 0.4999999999999643100000 - C Macro Best time: 42 clock cycles Intel instruction FSIN(30.0 )= 0.4999999999999999400000 - C Macro Best time: 140 clock cycles[/bash]

Technical Details:

- Windows XP 32-bit / Visual Studio 2005
- All compiler optimizations are disabled / Release configuration / Platform Win32
- Double data type ( 53-bit precision )
- Number of calls for every test is 4194304 ( 2^22 )
- Process priority Normal
- All implementations are portable / N/A for cases: CRT 'sin' and Intel 'fsin' instruction
- CRT function 'sin' tested with SSE2 support On ( faster ) and Off ( slower )
- C Macro means it is implemented as a macro ( inline / no call overhead )
- Ordered from Fastest to Slowest
- Values in Ticks measured using GetTickCount Win32 API function
- Values in Clock Cycles measured using RDTSC instruction

0 项奖励
SergeyKostrov
重要分销商 II
2,272 次查看
>>...
>> Intel instruction FSIN(30.0 )= 0.4999999999999999400000 - C Macro
>> Best time: 140 clock cycles
>>...


A note regarding 140 clock sycles for Intel 'FSIN' instruction. The result of my test is very
consistent with what I found in Intel documentation. Here are details from Intel 64 and IA-32 Architectures
Optimization Reference Manual
( Nov 2007 edition ). Topic: Instruction Latency and Throughput

Summary for 'FSIN' instruction from Tables C-11 and C-11a from Pages C-25 and C-26:

...
Latency for a CPU with CPUID 0F_2H : 160 - 180
Latency for a CPU with CPUID 0F_3H : 160 - 200
Latency for a CPU with CPUID 06_17H: 82
Latency for a CPU with CPUID 06_0EH: 119
Latency for a CPU with CPUID 06_0DH: 119
...
Throughput for a CPU with CPUID 0F_2H : 130
Throughput for a CPU with CPUID 06_0EH: 116
Throughput for a CPU with CPUID 06_0DH: 116
...

Also, a Footnote #4 says:

...
4. Latency and Throughput of transcendental instructions can vary substantially in a
dynamic execution environment. Only an approximate value or a range of values
are given for these instructions.
...

0 项奖励
Bernard
重要分销商 I
2,272 次查看
  • NormalizedTaylorSeries23tSin(30.0)=0.4999999999999999400000-FastSinV1-NotOptimized
  • Besttime:453ticks

  • Sergey try to pass a random value as an argument to your functions.
    I see that fastsin() is the slowest function probably because of 11 terms involved in calculation.
    Chebyshev polynomial approximation suffers from the lowest accuracy.
    0 项奖励
    Bernard
    重要分销商 I
    2,272 次查看

    4. Latency and Throughput of transcendental instructions can vary substantially in a
    dynamic execution environment. Only an approximate value or a range of values
    are given for these instructions


    It can also depend on the value passed as an argument to 'fsin'.
    0 项奖励
    SergeyKostrov
    重要分销商 II
    2,272 次查看
    Quoting iliyapolak

    4. Latency and Throughput of transcendental instructions can vary substantially in a
    dynamic execution environment. Only an approximate value or a range of values
    are given for these instructions


    It can also depend on the value passed as an argument to 'fsin'.


    That's a good idea to verifysome time later. Thank you, Iliya!

    0 项奖励
    Bernard
    重要分销商 I
    2,272 次查看

    That's a good idea to verifysome time later


    While testing 'fsin' try to pass 2^22 random values to instruction.
    0 项奖励
    SergeyKostrov
    重要分销商 II
    2,272 次查看
    Quoting iliyapolak
    ...Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element.


    Hi Iliya,

    I'd like to verify our sources of data... Where did youseethat number? I've found a link:

    http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_performanceall.html

    and I don't seethe23.90 number for the sinefunction ( real ). Here is a screenshot:

    0 项奖励
    SergeyKostrov
    重要分销商 II
    2,272 次查看
    ...
    NormalizedTaylorSeries7tSin(30.0)=0.4999999918690232200000-CMacro
    Besttime:24clockcycles
    ...

    [SergeyK] I'd like to note thatit is just twocode lines C macroandthere is noa range reduction.


    Here is your comment from aPost #278:

    >>...Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element...

    I think it is unfair to compare highly optimized trigonometric functions that use latest sets of Intel instructions ( SSE4 / AVX / etc )
    withhighly portable C implementations of trigonometric functions.

    0 项奖励
    Bernard
    重要分销商 I
    2,272 次查看
    @Sergey Sorry I did not remember the correct value when I have written my post. The correct value is 20.96 cycles per element for HA double argument. Btw my source is the same as yours.
    0 项奖励
    Bernard
    重要分销商 I
    2,272 次查看


    Here is your comment from aPost #278:

    >>...Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element...

    I think it is unfair to compare highly optimized trigonometric functions


    Completely agree with you.
    0 项奖励
    回复