- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello!
While coding in assembly various series expansions of many functions i tried to optimize my code mainly by using rcpps instruction instead of divaps.When performing tests oncode which calculates sine function by taylor series i did a few measurements as explained here :"http://software.intel.com/en-us/forums/showthread.php?t=52482" and the result was between 10-12 cycles per first term of sine expansion(i used 14 terms).
I would like to ask you how can i rewrite this code in order to gain speed of execution improvment.
[bash]movups xmm0,argument movups xmm1,argument mulps xmm1,xmm1 mulps xmm1,xmm0 mov ebx,OFFSET coef movups xmm2,[ebx] rcpps xmm3,xmm2 ;rcpps used instead of divaps mulps xmm1,xmm3 subps xmm0,xmm1[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
so it's 63 ns per iteration or ~ 120 clocks on your CPU, it does't match your previous reports IIRCcalls 1e6 times fastsin() the result in millisecond is 63
if you keep only the polynomial (get rid of the strange domain check) you should begin to see timings nearer than mine
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Iliya,
There is atechnique in C/C++ and I would call it as a 'macro substitution'. Actually, it is as old as C language andhere is
a small example:
int FunctionA( int iParamA );
int FunctionB( int iParamB );
int FunctionA( int iParamA )
{
printf( "FunctionA called\n" );
}
int FunctionB( int iParamB )
{
printf( "FunctionB called\n" );
}
#define _USE_FUNCTION_A
//#define _USE_FUNCTION_B
#ifdef _USE_FUNCTION_A
#define Function FunctionA
#endif
#ifdef _USE_FUNCTION_B // I always use explicit #ifdefs and Inever use #ifdef-#else-#endif
#define Function FunctionB
#endif
void main( void )
{
Function( 1 );
}
In essence, HRT AL in the ScaLib library is based on that.Remember, that this is a compile time technique. If you need
a similar substitution at arun-time it could be alsodone.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
int ( *g_pFunction[2] )( int iParam );
int FunctionA( int iParamA );
int FunctionB( int iParamB );
int FunctionA( int iParamA )
{
printf( "FunctionA called\n" );
}
int FunctionB( int iParamB )
{
printf( "FunctionB called\n" );
}
void main( void )
{
int iCondition = 0;
g_pFunction[0] = FunctionA;
g_pFunction[1] = FunctionB;
if( iCondition == 0)
g_pFunction[0]( 0 ); // FunctionA will be executed
else
g_pFunction[1]( 1 );
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
...
void main( void )
{
int iCondition = 0;
g_pFunction[0] = FunctionA;
g_pFunction[1] = FunctionB;
if( iCondition == 0)
g_pFunction[0]( 0 ); // FunctionA will be executed
else
g_pFunction[1]( 1 );
}
Or
...
void main( void )
{
int iCondition = 0;
g_pFunction[0] = FunctionA;
g_pFunction[1] = FunctionB;
g_pFunction[iCondition]( 0 ); // FunctionA will be executed
}
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I apologize for getting confused about the sequence of posts and reverting back to the rcpps subject where the thread started...
This is a very long and interesting thread. Unfortunately, we're deviating all the time from the main subject...
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is a Template based API substitution technique:
...
class CApiSet1
{
public:
CApiSet1(){};
~CApiSet1(){};
void FunctionA(){ printf("CApiSet1::FunctionA\n"); };
void FunctionB(){ printf("CApiSet1::FunctionB\n"); };
};
class CApiSet2
{
public:
CApiSet2(){};
~CApiSet2(){};
void FunctionA(){ printf("CApiSet2::FunctionA\n"); };
void FunctionB(){ printf("CApiSet2::FunctionB\n"); };
};
template < class APISET > void ExecuteApiSet( APISET &as );
template < class APISET > void ExecuteApiSet( APISET &as )
{
as.FunctionA();
as.FunctionB();
};
void main( void )
{
CApiSet1 as1;
CApiSet2 as2;
ExecuteApiSet( as1 );
ExecuteApiSet( as2 );
}
...
Output is as follows:
CApiSet1::FunctionA
CApiSet1::FunctionB
CApiSet2::FunctionA
CApiSet2::FunctionB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim!The Intel compilers have default options -no-prec-div -no-prec-sqrt where, in some contexts, rather than using an IEEE accurate hardware instruction (or in some cases library function), there is explicit iteration in the generated code, starting with rcpps or rsqrtps.
As I understood properly rcpps instruction can be used as a some fallback optionsto calculate possibly on the fly Taylor Series coefficents.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This thread is most viewed thread for the last month.Butsadly it has only afew active participants.This is a very long and interesting thread
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey!That would be awesome if you share your work
If you are interested in my other classes.Ican share my work.I have twoJava classes.First class is based on Complex numbers and Complex functions implemented as a objects. This project contains only trigonometric and exponential functions in complex domain.Second classdeals with various bit-manipulations and is based mostlyon the book "Hacker's Delight" class is implemented as a static methods.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Application - ScaLibTestApp - WIN32_MSC
...
Hi Iliya,
I completed another set of tests and I'll post results soon. Two tests for Intel instruction 'fsin' added:
one is in ticks and another one isin clock cycles.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's would be great.I completed another set of tests and I'll post results soon. Two tests for Intel instruction 'fsin' added:
one is in ticks and another one isin clock cycles
Afaik 'fsin' latency is around 40-100 cpi.Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Set of Tests A:
[cpp] Normalized Taylor Series 7t Sin(30.0) = 0.4999999918690232200000 - C Macro Best time: 62 ticks Normalized Taylor Series 9t Sin(30.0) = 0.5000000000202798900000 - C Macro Best time: 74 ticks Normalized Taylor Series 11t Sin(30.0) = 0.4999999999999643100000 - C Macro Best time: 93 ticks Normalized Series 7t Sin(30.0) = 0.4999999918690232700000 Best time: 203 ticks Normalized Chebyshev Polynomial 7t Sin(30.0) = 0.4999999476616694400000 Best time: 203 ticks Normalized Taylor Series 7t Sin(30.0) = 0.4999999918690232200000 Best time: 203 ticks Normalized Chebyshev Polynomial 9t Sin(30.0) = 0.4999999997875643800000 Best time: 218 ticks Normalized Taylor Series 9t Sin(30.0) = 0.5000000000202798900000 Best time: 218 ticks Normalized Series 9t Sin(30.0) = 0.5000000000202800000000 Best time: 234 ticks Normalized Series 11t Sin(30.0) = 0.5000000000000000000000 Best time: 265 ticks Chebyshev Polynomial 7t Sin(30.0) = 0.4999999476616695500000 Best time: 266 ticks CRT Sin(30.0) = 0.4999999999999999400000 - SSE2 support On Best time: 281 ticks Chebyshev Polynomial 9t Sin(30.0) = 0.4999999997875643800000 Best time: 312 ticks Normalized Taylor Series 23t Sin(30.0) = 0.4999999999999999400000 - FastSinV3 - Optimized Best time: 343 ticks Intel instruction FSIN(30.0) = 0.4999999999999999400000 - C Macro Best time: 359 ticks Normalized Taylor Series 23t Sin(30.0) = 0.4999999999999999400000 - FastSinV2 - Optimized Best time: 406 ticks Normalized Taylor Series 23t Sin(30.0) = 0.4999999999999999400000 - FastSinV1 - Not Optimized Best time: 453 ticks CRT Sin(30.0) = 0.4999999999999999400000 - SSE2 support Off Best time: 484 ticks [/cpp]
Set of Tests B:
Technical Details:
- Windows XP 32-bit / Visual Studio 2005
- All compiler optimizations are disabled / Release configuration / Platform Win32
- Double data type ( 53-bit precision )
- Number of calls for every test is 4194304 ( 2^22 )
- Process priority Normal
- All implementations are portable / N/A for cases: CRT 'sin' and Intel 'fsin' instruction
- CRT function 'sin' tested with SSE2 support On ( faster ) and Off ( slower )
- C Macro means it is implemented as a macro ( inline / no call overhead )
- Ordered from Fastest to Slowest
- Values in Ticks measured using GetTickCount Win32 API function
- Values in Clock Cycles measured using RDTSC instruction
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> Intel instruction FSIN(30.0 )= 0.4999999999999999400000 - C Macro
>> Best time: 140 clock cycles
>>...
A note regarding 140 clock sycles for Intel 'FSIN' instruction. The result of my test is very
consistent with what I found in Intel documentation. Here are details from Intel 64 and IA-32 Architectures
Optimization Reference Manual ( Nov 2007 edition ). Topic: Instruction Latency and Throughput
Summary for 'FSIN' instruction from Tables C-11 and C-11a from Pages C-25 and C-26:
...
Latency for a CPU with CPUID 0F_2H : 160 - 180
Latency for a CPU with CPUID 0F_3H : 160 - 200
Latency for a CPU with CPUID 06_17H: 82
Latency for a CPU with CPUID 06_0EH: 119
Latency for a CPU with CPUID 06_0DH: 119
...
Throughput for a CPU with CPUID 0F_2H : 130
Throughput for a CPU with CPUID 06_0EH: 116
Throughput for a CPU with CPUID 06_0DH: 116
...
Also, a Footnote #4 says:
...
4. Latency and Throughput of transcendental instructions can vary substantially in a
dynamic execution environment. Only an approximate value or a range of values
are given for these instructions.
...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
NormalizedTaylorSeries23tSin(30.0)=0.4999999999999999400000-FastSinV1-NotOptimized Besttime:453ticks
Sergey try to pass a random value as an argument to your functions.
I see that fastsin() is the slowest function probably because of 11 terms involved in calculation.
Chebyshev polynomial approximation suffers from the lowest accuracy.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
4. Latency and Throughput of transcendental instructions can vary substantially in a
dynamic execution environment. Only an approximate value or a range of values
are given for these instructions
It can also depend on the value passed as an argument to 'fsin'.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
4. Latency and Throughput of transcendental instructions can vary substantially in a
dynamic execution environment. Only an approximate value or a range of values
are given for these instructions
It can also depend on the value passed as an argument to 'fsin'.
That's a good idea to verifysome time later. Thank you, Iliya!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's a good idea to verifysome time later
While testing 'fsin' try to pass 2^22 random values to instruction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Iliya,
I'd like to verify our sources of data... Where did youseethat number? I've found a link:
http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_performanceall.html
and I don't seethe23.90 number for the sinefunction ( real ). Here is a screenshot:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
NormalizedTaylorSeries7tSin(30.0)=0.4999999918690232200000-CMacro
Besttime:24clockcycles
...
[SergeyK] I'd like to note thatit is just twocode lines C macroandthere is noa range reduction.
Here is your comment from aPost #278:
>>...Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element...
I think it is unfair to compare highly optimized trigonometric functions that use latest sets of Intel instructions ( SSE4 / AVX / etc )
withhighly portable C implementations of trigonometric functions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is your comment from aPost #278:
>>...Compared to Intel vectorized Library VML sin function which achieved latency of ~23.90 cycles per element...
I think it is unfair to compare highly optimized trigonometric functions
Completely agree with you.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page