vectorization of sin/cos results in wrong values - Page 3

evitan · ‎02-08-2012

if sin is autovectorized there seems to be a signbit issue.
For example the 2nd phase (from PIHALF to PI just flips the sign.)
Here is the code:
__declspec(align(16)) float input[12] = {0.f,0.5f,1.f,1.5f,2.f,2.5f,3.0f,3.5f,4.f,4.5f,5.f,5.5f};
__declspec(align(16)) float output[12] ;

for(int i = 0 ;i < 12 ; i++)
{
output = sin(input);
}

for(int i = 0 ;i < 12 ; i++)
{
char txt[32];
sprintf(txt,"%i %f %f\\n",i,input,output);
OutputDebugString(txt);
}

this outputs:
0 0.000000 0.000000
1 0.500000 0.479426
2 1.000000 0.841471
3 1.500000 0.997495
4 2.000000 -0.909297 5 2.500000 -0.598472
6 3.000000 -0.141120
7 3.500000 -0.3507838 4.000000 -0.756802
9 4.500000 -0.977530
10 5.000000 0.95892411 5.500000 0.705540

cos seems ok on 32 bit but also has a sign problem in 64 bit.
Compiler XE 12.1.2.278 on Windows with VS2008.
Same error when using double.
If i turn off autovectorization everything is fine.
Since the performance difference is huge between vectorized and nonvectorized version this is a big issue.

Evitan

SergeyKostrov · ‎07-26-2012

Quoting Sergey Kostrov

...I've done a couple of simpleverifications for a test-case used in the article (sin( 10^22 * OneRadian ) )...

A short follow up... Here are results forMicrosoft C++ compiler ( Visual Studio 2005 ):

Expected Value:
Sine of 10^22 radians: -0.85220084976718880059522602157523

Calculated Values:
Sine of 10^22 radians: 0.41214336710708466000000000000000 - Range Reduction: No
Sine of 10^22 radians: -0.85220084976718891000000000000000 - Range Reduction: Yes

Bernard · ‎07-26-2012

Windows Calculator was used:a top result iswithout a range reduction and bottom results is with a "simulated" range reduction

Hi Sergey!
How were you able to disable a range reduction in Windows Calculator?
I need to check Calc.exe imports , but it probably calls one of MSVCRT.DLL sin() functions which in turn calls x87 fsin.So disabling a range reduction is not possible because fsin performs in range reduction in the hardware.By disabling a range-reduction I mean donotmap large periodic radianvaluesto the range [-Pi/4,Pi/4] which is suitable for the accurate calculation of the sine function on the digital computer.

Out of curiosity I tested my own sine implementation "fastsin()" which does not have any range reduction algorithms.
From pure mathematical point of view the Taylor expansion of the sin function is infinitely convergeable, but when executed on the digital computer without any range reduction algorithm the result returned by the function is a disaster.

Here is the result of fastsin() called with an argument of10^22 radians.

Range reduction is not implemented.I mean mapping of large periodic values of radian argument to the range suitable for the calculation [-Pi/4,Pi/4].

radian_arg 10000000000000000000000.000000000000000000000000
fastsin called with an argument of(10000000000000000000000.000000) radians, the
result is -1.#INF00000000000000000000

Here is the result of fastsin() called with an argument of 10^2 radians.

radian_arg 100.000000000000000000000000
fastsin called with an argument of(100.000000) radians, the result is -36803877
3856910700000000.000000000000000000000000

Here is the result of fastsin() called with an argument of 8.0 radians.

radian_arg 8.000000000000000000000000
fastsin called with an argument of( 8.000000) radians, the result is 0.9871283
39487625790000000 << At 8.0 radians the result starts to diverge from the actual value.
Correct valueis 0.98935824662338177780812359824529

Here is the result of fastsin() called with an argument of 9.0 radians.

radian_arg 9.000000000000000000000000
fastsin called with an argument of( 9.000000) radians, the result is 0.3706866
03602516480000000 << At 9.0 radians the error grows.
Correctrange-reduced value is0.41211848524175656975627256635244.

Bernard · ‎07-26-2012

Here are results from Windows Calculatorfor sin( 10^1024 * OneRadian

How were you able to calculate 'sin' for such a large argument?
When I tried to dothis on my pc Calc.exe hangs.

Bernard · ‎07-27-2012

Calculated Values:
Sine of 10^22 radians: 0.41214336710708466000000000000000 - Range Reduction: No

Is this possible to disable range reduction of the periodic functions in VS C++compiler.

TimP · ‎07-27-2012

I suppose you can perform range reduction yourself before calling the library function, so as to skip the library range reduction (if you know enough about its working). Besides reducing range to |x| < Pi/4, library implementations likely use some kind of table of reductions to much smaller range segments.
I suppose one of the reasons why libraries hide such details of their implementation is to avoid encouraging end users to make applications which depend on the details. You could profile or disassemble to see whether the library function uses fsin or fsincos so as to be able to run without SSE2, or to see whether /arch:SSE2 chooses a different function from the library (if you are concerned with the 32-bit compiler).

SergeyKostrov · ‎07-27-2012

Quoting iliyapolak

Here are results from Windows Calculatorfor sin( 10^1024 * OneRadian
How were you able to calculate 'sin' for such a large argument?

Hi Iliya, Here are details on how I've tested Windows Calculator:

Task : Calculate sin( 10^1024 * OneRadian ) Select 'Degrees' radio button

Test 1: Without arange reduction

1 8 0 / pi = MS 1 0 x^y 1 0 2 4 = * MR = sin
Result = -0.98684945250044044483447009247958

Test 2: With arange reduction

1 8 0 / pi = MS 1 0 x^y 1 0 2 4 = * MR = Mod 3 6 0 = sin
Result = -0.98684945250044044483447009273862

Notes:
MS - store the displayed number
MR - recall the stored number

SergeyKostrov · ‎07-27-2012

This is an example that demostrates limits of Windows Calculator:

Task : Calculate sin( 10^1025 * OneRadian ) Select 'Degrees' radio button

Test 1: Without arange reduction

1 8 0 / pi = MS 1 0 x^y 1 0 2 5 = * MR = sin
Result = 1

Test 2: With arange reduction

1 8 0 / pi = MS 1 0 x^y 1 0 2 5 = * MR = Mod 3 6 0 = sin
Result = -0.9986091762783810016791764873118

As you can see there is a significant drop in accuracy in Test 1 without a range reduction.

Notes:
MS - store the displayed number
MR - recall the stored number

SergeyKostrov · ‎07-27-2012

Quoting iliyapolak

Here are results from Windows Calculatorfor sin( 10^1024 * OneRadian

How were you able to calculate 'sin' for such a large argument?
When I tried to dothis on my pc Calc.exe hangs.

What Windows OS did you use? Here is a screenshot with a version of Windows Calculator I used:

I've done a couple of more tests and here short results:

sin( 10^1024 * OneRadian ) - Calculated
sin( 10^1025 * OneRadian ) - Calculated ( with error )
sin( 10^1026 * OneRadian ) - Couldn't Calculate ( hangs )
sin( 10^2048 * OneRadian ) - Couldn't Calculate( hangs )
sin( 10^4096 * OneRadian ) - Couldn't Calculate( hangs )
sin( 10^8192 * OneRadian ) - Couldn't Calculate( hangs )

Here is a screenshot:

I pressed 'Continue' and it hangs, some time later the message is displayed again... I pressed 'Continue' and
it hangs... I repeatedthat 5 times and then"killed" the application.

SergeyKostrov · ‎07-27-2012

Quoting iliyapolak

Calculated Values:
Sine of 10^22 radians: 0.41214336710708466000000000000000 - Range Reduction: No

Is this possible to disable range reduction of the periodic functions in VS C++compiler.

I've spent some time today investigating it and my answer is No. However, I've discovered a very interesting
feature ofCRTmath functions from Microsoft:

- There is a CRT function '_set_SSE2_enable' and it allows to enable or disable SSE2 support in math functions

- By default SSE2 support is in enabled state and, for example,'sin' function uses SSE2 instructions to do all calculations;
I'd like to stress, that it doesn't use Intel instruction 'fsin' in that case;

-If SSE2 support is indisabled state then, for example, 'sin' function doesn't use SSE2 instructions to do all calculations;
I'd like to stress, that it uses Intel instruction 'fsin' in that case;

- In both cases there is a range reduction for input argument

Notes:
- Call _set_SSE2_enable( 1 )to enable SSE2 support
- Call _set_SSE2_enable( 0 )todisable SSE2 support

SergeyKostrov · ‎07-27-2012

Quoting iliyapolak

Windows Calculator was used:a top result iswithout a range reduction and bottom results is with a "simulated" range reduction

Hi Sergey!
How were you able to disable a range reduction in Windows Calculator?
I need to check Calc.exe imports...

Please check it andlet us knowif you find something interesting. To be honest, I'm very impressed with capabilities
of Windows Calculator andI finally understood why it is so accurate. Please take a look at its help:

WinCalc Help -> Index -> precision

and this is what it displays:

Understanding Extended Precision [SergeyK] Is that a128-bit precision?

Extended Precision, a feature of Calculator, means that all operations are accurate to
at least 32 digits. Calculator also stores rational numbers as fractions to retain accuracy.
For example, 1/3 is stored as 1/3, rather than .333. However, errors accumulate during
repeated operations on irrational numbers. For example, Calculator will truncate pi to 32 digits,
so repeated operations on pi will lose accuracy as the number of operations increases.

SergeyKostrov · ‎07-27-2012

Quoting TimP (Intel)

...I suppose you can perform range reduction yourself before calling the library function...

There is a small problem here. It is impossible to use '%' on any FP-data types.A call to 'fmod' CRT function
creates additional overhead. An example of acompilation error with '%' operator is below:

...
float fValue1 = 765.0f;
float fValue2 = fValue1 % ( int )360;
...

Compilation error:
...
..\Test\DspSetTest.cpp(824) : error C2296: '%' : illegal, left operand has type 'float'
...

Quoting TimP (Intel)
...You could profile or disassemble to see whether the library function uses fsin or fsincos...

It uses SSE2 by default ( SSE2 support is enabled ),or Intel instruction'fsin' when SSE2 support is disabled.
Please take a look at http://software.intel.com/en-us/forums/showpost.php?p=190884

SergeyKostrov · ‎07-27-2012

Quoting iliyapolak

...sin() functions which in turn calls x87 fsin. So disabling a range reduction is not possible because fsin
performs in range reduction in the hardware...

I hope that you've already looked at:

http://software.intel.com/en-us/forums/showpost.php?p=190884
and
http://software.intel.com/en-us/forums/showpost.php?p=190885

Microsoft software developers implemented CRT'sin'functionin a different way. I'll provide technicaldetails later.

Best regards,
Sergey

Bernard · ‎07-27-2012

Hi Sergey! I have also investigated MSVCRT.DLL dynamic library which is responsible for runtime math function. I disassembled this library and counted at least three sin function.One of them uses x87 fsin to perform calculation.Later I will post code snippets.

TimP · ‎07-28-2012

My search engine skills aren't up to the task of researching this, but surely after all the discussion about how fsin does range reduction you don't expect a single standard library function or x87 intrinsic (fprem) to do the job. If it did, we wouldn't have to settle for these things being buried in proprietary code, or obscured beyond belief in open source code.
I didn't think you'd take my suggestion of performing your own range reduction and then calling your compiler's library function as a way to increase speed. If I wasn't clear, I meant it as a step to comparing the accuracy of your method against the one in the library function.
Did you look at any posted code examples such as Moshier's cephes library on netlib?
You could look in glibc or newlib, but those seem to set a priority on obscurity.
Plauger's "The Standard C Library" gives the shortest useful explanation I've seen of a typical compromise between speed and accuracy.
For example, my quick and dirty method for 128-bit long double involves these steps:
calculate how many multiples of Pi/2 you must subtract to produce |x| < Pi/4.
Take a value of Pi/2 correctly rounded to double precision (best double approximation to Pi/2); working in long double, subtract the multiple of the double value, then subtract the multiple of the difference between Pi/2 and your double rounded value of Pi/2. You must use compiler options which follow the standards on associativity with parentheses (not icc default). In effect, you are using a value of Pi/2 which is correct to the best of your ability to double plus long double precision. Long double and a half, in the best case.
There have been published implemented algorithms for numerically exact range reduction; but I haven't got my hands on them.
I don't think Bill Plauger got rich trying to explain this, so I'm not expecting much.

Bernard · ‎07-28-2012

What Windows OS did you use? Here is a screenshot with a version of Windows Calculator I used

I have Windows 7.
I tried three times to calculate sine of 10^1024 radians ,but unfortunately Calc.exe always has been in "hang" state.

Bernard · ‎07-28-2012

I hope that you've already looked at:

http://software.intel.com/en-us/forums/showpost.php?p=190884
and
http://software.intel.com/en-us/forums/showpost.php?p=190885

Microsoft software developers implemented CRT'sin'functionin a different way. I'll provide technicaldetails later

Here is the disassembled code from one of the CRT library sin() functions.You can clearly see the x87 'fsin' call.
The investigated CRT sine implementation is called
_CIsin which is compiler optimized version of many sin() implementations.Please read this post:
http://software.intel.com/en-us/forums/showthread.php?t=105474post #201

[bash]               push    edx
.text:6FF759A3                 fstcw   [esp+4+var_4]
.text:6FF759A7                 jz      loc_6FF7E5AC
.text:6FF759AD                 cmp     [esp+4+var_4], 27Fh
.text:6FF759B3                 jz      short loc_6FF759BB
.text:6FF759B5                 fldcw   ds:word_6FF5C0D6
.text:6FF759BB
.text:6FF759BB loc_6FF759BB:                           ; CODE XREF: sub_6FF759A2+11j
.text:6FF759BB                 fsin ; x87 instruction call .text:6FF759BD                 fstsw   ax
.text:6FF759C0                 sahf
.text:6FF759C1                 jp      loc_6FF7E552
.text:6FF759C7
.text:6FF759C7 loc_6FF759C7:                           ; CODE XREF: sub_6FF759A2+8BC4j
.text:6FF759C7                 cmp     dword_6FFF50C4, 0
.text:6FF759CE                 jnz     loc_6FF6C003
.text:6FF759D4                 mov     edx, 1Eh
.text:6FF759D9                 lea     ecx, unk_6FFF5390
.text:6FF759DF                 jmp     loc_6FF5EE3C
.text:6FF759DF sub_6FF759A2    endp
[/bash]

Bernard · ‎07-28-2012

I've spent some time today investigating it and my answer is No. However, I've discovered a very interesting
feature ofCRTmath functions from Microsoft

MSVCRT.DLL has three sine functions:

1.__libm_sse2_sin >> This function uses SSE instructions which are part of the Horner scheme approximation of the sine function.

2._CIsin>> which is compiler optimized version of many sin() implementations please look here:http://software.intel.com/en-us/forums/showthread.php?t=105474post #201

3.sin function>> asfar as I was able to understand assembly code this function also calls 'fsin'.

Bernard · ‎07-28-2012

Did you look at any posted code examples such as Moshier's cephes library on netlib?
You could look in glibc or newlib, but those seem to set a priority on obscurity

You can also look at FDLIBM 5 free math library which implements range-reduction for trigonometric functions.
My method is slightly different.
I start with a large radian argument and divide it by 2*Pi radians.Next I have a value which represents a number of full 2*Pi cycles and the decimal part of this number I isolate and extract.
Next this decimal part is multiplied by 2*Pi in order to obtain the proper quadrant from this value I extract decimal part which is multiplied by Pi/2 and the result is subtracted from Pi/2.By knowing the exact part of the full sine cycle which is obtained from the first decimal value I use if- statement which returns a proper quadrant sign.

Bernard · ‎07-28-2012

Understanding Extended Precision [SergeyK] Is that a128-bit precision?

It is hard to tell without seeing the source code or doing the full disassembling.
It seems to me that probably when the "Extended Precision" mode is enabled Calc.exe could possibly switch to some kind of "poor's man" arbitrary precision arithmetics.
Afaik some of the arbitrary precision implementations storerational numbers as a fractions.

Bernard · ‎07-28-2012

Here is link to Microsoft page describing '_set_SSE2_enable' : http://msdn.microsoft.com/en-us/library/bthd138d.aspx
Interesting that there is no support for trigonometric functions beside the 'atan'.
From my own investigation I know that run-time library implements SSE - coded sin() function.

SergeyKostrov · ‎07-28-2012

Quoting TimP (Intel)

My search engine skills aren't up to the task of researching this, but surely after all the discussion about how fsin does range reduction you don't expect a single standard library function or x87 intrinsic (fprem) to do the job...

[SergeyK] IfI will decide to implementmy own Sine based on Intel fsin instruction I won't do a range
reduction at the beginning. There is a high probability ( let's say 95% )that an input argument in radians
will be in a range from -2^63 to +2^63. So, simply try to use fsin and check for anerror. If there is
some error than do a range reduction and call fsin again. If there is no error afterthe 1st call to fsinthen return a result.

For example, my quick and dirty method for 128-bit long double...

[SergeyK] Unfortunately,I don't use a'long double' FP-type. I've done a couple of tests yesterday
for a 'double' FP-type ( 53-bit precision ) and my range reduction codeworks well up to 10^12*Radians.
At 10^14*Radians accuracy drops ( just 2 or 3 digits are correct depending on a value of an angle ),
and after 10^16*Radians all resultsare unacceptable.