[ Accuracy of computations of

SergeyKostrov · ‎08-30-2016

*** IEEE 754 Standard Compliance: CPU vs GPU or, a War between Intel and NVIDIA *** [ Abstract ] In 2011 NVIDIA published an article about compliance of Floating-Point arithmetic on GPUs and also compared it with Floating-Point arithmetic on CPUs. The article is very good but it has lots of errors, "crafted" test cases to demonstrate that CPUs have some issues and GPUs do not, and some technical information is obsolete. Even if the article was last updated in 2015 non of errors I've found are still Not fixed. My review of the article will be submitted later on.,

SergeyKostrov · ‎09-01-2016

Floating Point and IEEE 754 Compliance for NVIDIA GPUs . http://docs.nvidia.com/cuda/floating-point/index.html Last updated September 1, 2015 Even if the article was last updated in 2015 there are a couple of errors and they are still Not fixed.

SergeyKostrov · ‎09-01-2016

[ ...For x = 1.0008, the correct mathematical result is x 2 - 1 = 1.60064 x 10 - 4... ] Here is my comment: The correct mathematical result is actually 1.60064 x 10^-3.

SergeyKostrov · ‎09-01-2016

[ FMA related inaccuracy ] ... ...At the time this paper is written (Spring 2011) there are no commercially available x86 CPUs which offer hardware FMA... ... Here is my comment: In Q1 of 2011 Intel released a CPU with a set of FMA instructions as a part AVX Instruction Set. Also, AMD announced support of FMA in 2007, and Intel in 2008.

SergeyKostrov · ‎09-01-2016

[ Related to methods of computing a vector dot product ] Serial Method to Compute Vectors Dot Product ((((a1 x b1) + (a2 x b2)) + (a3 x b3)) + (a4 x b4)) FMA Method to Compute Vector Dot Product (a4 x b4 + (a3 x b3 + (a2 x b2 + (a1 x b1 + 0)))) Parallel Method to Reduce Individual Elements Products to Compute Vector Dot Product ((a1 x b1) + (a2 x b2)) + ((a3 x b3) + (a4 x b4))

SergeyKostrov · ‎09-01-2016

[ Unknown GCC compiler option - -lm ] ... gcc test.c -lm -m64 ... gcc test.c -lm -m32 ... I've been using MinGW C++ compilers, a port of GCC for Windows platforms, for many years and I don't know what that option does. It looks like undocumented option.

SergeyKostrov · ‎09-01-2016

[ Accuracy of computations of 32-bit codes vs 64-bit codes - 1 ] ...This shows that the result of computing cos( 5992555.0 ) using a common library differs depending on whether the code is compiled in 32-bit mode or 64-bit mode... ... volatile float x = 5992555.0; printf("cos(%f): %.10g\n", x, cos(x)); ... gcc test.c -lm -m64 cos( 5992555.000000 ): 3.320904615e-07 gcc test.c -lm -m32 cos( 5992555.000000 ): 3.320904692e-07 ...The consequence is that different math libraries cannot be expected to compute exactly the same result for a given input... Here is my comment: These results are absolutely correct! This is because a Control Word of the FPU is initialized differently in 32-bit and 64-bit modes. I would consider it as a legacy issue and if you look at float.h header file you will see the following: [ float.h ] This is how _CW_DEFAULT is defined in Microsoft C++ compiler: ... #if defined(_M_IX86) #define _CW_DEFAULT ( _RC_NEAR + _PC_53 + _EM_INVALID + _EM_ZERODIVIDE + _EM_OVERFLOW + _EM_UNDERFLOW + _EM_INEXACT + _EM_DENORMAL) #elif defined(_M_IA64) #define _CW_DEFAULT ( _RC_NEAR + _PC_64 + _EM_INVALID + _EM_ZERODIVIDE + _EM_OVERFLOW + _EM_UNDERFLOW + _EM_INEXACT + _EM_DENORMAL) #elif defined(_M_AMD64) #define _CW_DEFAULT ( _RC_NEAR + _EM_INVALID + _EM_ZERODIVIDE + _EM_OVERFLOW + _EM_UNDERFLOW + _EM_INEXACT + _EM_DENORMAL) #endif ... This is how _CW_DEFAULT is defined in MinGW C++ compiler: ... #if defined(_M_IX86) #define _CW_DEFAULT (_RC_NEAR+_PC_53+_EM_INVALID+_EM_ZERODIVIDE+_EM_OVERFLOW+_EM_UNDERFLOW+_EM_INEXACT+_EM_DENORMAL) #elif defined(_M_IA64) #define _CW_DEFAULT (_RC_NEAR+_PC_64+_EM_INVALID+_EM_ZERODIVIDE+_EM_OVERFLOW+_EM_UNDERFLOW+_EM_INEXACT+_EM_DENORMAL) #elif defined(_M_AMD64) #define _CW_DEFAULT (_RC_NEAR+_EM_INVALID+_EM_ZERODIVIDE+_EM_OVERFLOW+_EM_UNDERFLOW+_EM_INEXACT+_EM_DENORMAL) #endif ... As you can see in case of 32-bit mode a 53-bit precision ( _PC_53 ) is set, and in case of 64-bit mode 64-bit precision ( _PC_64 ) is set. It means that when a C++ compiler initializes FPU in 64-bit mode by default it will be set to 64-bit precision and accuracy of computations will be better.

SergeyKostrov · ‎09-01-2016

[ Accuracy of computations of 32-bit codes vs 64-bit codes - 2 ] Note 1: cos( 5992555.0 rads ) = cos( 1.570795994704435107627171818956 rads ) = cos( 89.999980972618138833377959990966 degs ) Note 2: ~57.29 (1 rad) * 200,000 = ~11,458,000. Also, Intel recommends in a Software Developers Manual to do argument reduction before passing the value to any trigonometric functions.

SergeyKostrov · ‎09-01-2016

Take a look at Intel x64 and IA-32 Architectures Software Developer Manual ( Vol 1 ) Topic 8.3.10 Transcendental Instruction Accuracy for more technical details.

SergeyKostrov · ‎09-01-2016

[ Accuracy of computations when an argument needs to be reduced by a multiple of 2*PI ] Example A - Argument in Radians and reduction is by a multiple of 2*PI was applied: cos( 1.570795994704435107627171818956 rads ) = 3.3209046151159804582447704191223e-7 cos( 1.570795994704435107627171818956 rads + 2*PI * 10 ) = 3.3209046151159804582447704191159e-7 cos( 1.570795994704435107627171818956 rads + 2*PI * 100 ) = 3.3209046151159804582447704191489e-7 cos( 1.570795994704435107627171818956 rads + 2*PI * 1000 ) = 3.3209046151159804582447704185331e-7 cos( 1.570795994704435107627171818956 rads + 2*PI * 10000 ) = 3.3209046151159804582447704217814e-7 cos( 1.570795994704435107627171818956 rads + 2*PI * 100000 ) = 3.3209046151159804582447703554761e-7 cos( 1.570795994704435107627171818956 rads + 2*PI * 1000000 ) = 3.3209046151159804582447697826605e-7 cos( 1.570795994704435107627171818956 rads + 2*PI * 10000000 ) = 3.3209046151159804582447640545038e-7 cos( 1.570795994704435107627171818956 rads + 2*PI * 100000000 ) = 3.3209046151159804582447067729372e-7 cos( 1.570795994704435107627171818956 rads + 2*PI * 1000000000 ) = 3.3209046151159804582441339572710e-7 Precision is 31 digits after a decimal point and only 21 digits are the same. This is how precision changes: ... = 3.3209046151159804582447704191223e-7 ... = 3.3209046151159804582447704191159e-7 ... = 3.3209046151159804582447704191489e-7 ... = 3.3209046151159804582447704185331e-7 ... = 3.3209046151159804582447704217814e-7 ... = 3.3209046151159804582447703554761e-7 ... = 3.3209046151159804582447697826605e-7 ... = 3.3209046151159804582447640545038e-7 ... = 3.3209046151159804582447067729372e-7 ... = 3.3209046151159804582441339572710e-7

SergeyKostrov · ‎09-01-2016

[ Accuracy of computations when an argument needs to be reduced by a multiple of 360 ] Example B - Argument in Degrees and reduction is by a multiple of 360 was applied: cos( 89.999980972618138833377959990966 degs ) = 3.3209046151159804582447713995504e-7 cos( 89.999980972618138833377959990966 degs + 360 * 10 ) = 3.3209046151159804582447713995504e-7 cos( 89.999980972618138833377959990966 degs + 360 * 100 ) = 3.3209046151159804582447713995504e-7 cos( 89.999980972618138833377959990966 degs + 360 * 1000 ) = 3.3209046151159804582447713995504e-7 cos( 89.999980972618138833377959990966 degs + 360 * 10000 ) = 3.3209046151159804582447713995504e-7 cos( 89.999980972618138833377959990966 degs + 360 * 100000 ) = 3.3209046151159804582447713995500e-7 cos( 89.999980972618138833377959990966 degs + 360 * 1000000 ) = 3.3209046151159804582447713995503e-7 cos( 89.999980972618138833377959990966 degs + 360 * 10000000 ) = 3.3209046151159804582447713995505e-7 cos( 89.999980972618138833377959990966 degs + 360 * 100000000 ) = 3.3209046151159804582447713995504e-7 cos( 89.999980972618138833377959990966 degs + 360 * 1000000000 ) = 3.3209046151159804582447713995504e-7 Precision is 31 digits after a decimal point and only 30 digits are the same.

SergeyKostrov · ‎09-01-2016

[ NVIDIA Test Case - C Source codes ] ... union { float f; unsigned int i; } a, b; float r; a.i = 0x3F800001; b.i = 0xBF800002; r = a.f * a.f + b.f; printf( "a %.8g\n", a.f ); printf( "b %.8g\n", b.f ); printf( "r %.8g\n", r ); ... It is Not clear for me why so complicated method of initialization of union members is selected. This is because ... a.i = 0x3F800001; b.i = 0xBF800002; ... equals to a.f = 1.0000001; b.f = -1.0000002;

SergeyKostrov · ‎09-01-2016

[ NVIDIA Test Case - Outputs ] // x86-64 output: a: 1.0000001 b: -1.0000002 r: 0 // NVIDIA Fermi output: a: 1.0000001 b: -1.0000002 r: 1.4210855e-14 Here is my comment: These results are absolutely correct! In case of x86-64 output the result equals to 0 because Fast Floating Point model was used. If Precise Floating Point model is used than the result will be equal to 1.4210855e-14. Floating Point models could be controlled with a C++ compiler option or using _control87 CRT-function.

SergeyKostrov · ‎09-01-2016

[ Representation of 1.4210855e-14 according to IEEE 754 Standard Single precision 32-bit ] ... // NVIDIA Fermi output: ... r: 1.4210855e-14 ... FP-value : 1.4210855e-14 Note : Most accurate representation = 1.42108547152020037174224853516E-14 Binary-value : 0x28800000 = 00101000 10000000 00000000 00000000 IEEE 754 Parts: 0 (sign) 01010001 (exponent) 00000000000000000000000 (mabtissa)

SergeyKostrov · ‎09-01-2016

[ Modified Test Case 1 - C Source codes ] RTuint uiControlWordx87 = 0U; uiControlWordx87 = CrtControl87( _RTFPU_CW_DEFAULT, _RTFPU_CW_ALLBITSON ); // uiControlWordx87 = CrtControl87( _RTFPU_PC_24, _RTFPU_MCW_PC ); // uiControlWordx87 = CrtControl87( _RTFPU_PC_53, _RTFPU_MCW_PC ); // Verification 1 { _RTunion { RTfloat f; RTuint i; } a1, b1, r1; a1.i = 0x3F800001; b1.i = 0xBF800002; r1.f = 0.0f; r1.f = a1.f * a1.f + b1.f; CrtPrintf( RTU("Verification 1.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic:\n") ); CrtPrintf( RTU("\ta1=% .7f\n"), a1.f ); CrtPrintf( RTU("\tb1=% .7f\n"), b1.f ); CrtPrintf( RTU("\tr1=% .21f\n"), r1.f ); } // Verification 2 { _RTunion { RTfloat f; RTuint i; } a2, b2, r2; a2.f = 1.0000001f; b2.f = -1.0000002f; r2.f = 0.0f; r2.f = a2.f * a2.f + b2.f; CrtPrintf( RTU("Verification 2.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic:\n") ); CrtPrintf( RTU("\ta2=% .7f\n"), a2.f ); CrtPrintf( RTU("\tb2=% .7f\n"), b2.f ); CrtPrintf( RTU("\tr2=% .21f\n"), r2.f ); }

SergeyKostrov · ‎09-01-2016

[ Verification was done using six C++ compilers ] Microsoft C++ compiler ( VS2005 PE ) 32-bit Borland C++ compiler v5.5.1 32-bit Intel C++ compiler v12.1.7 ( u371 ) 32-bit MinGW C++ compiler v5.1.0 32-bit Watcom C++ compiler v2.0.0 32-bit Turbo C++ compiler v3.0.0 16-bit

SergeyKostrov · ‎09-01-2016

[ Microsoft C++ compiler ( VS2005 PE ) 32-bit - Debug ] ... Verification 1.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic: a1= 1.0000001 b1=-1.0000002 r1= 0.000000000000014210855 Verification 2.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic: a2= 1.0000001 b2=-1.0000002 r2= 0.000000000000014210855 ... Note: Floating Point Model option: /fp:precise

SergeyKostrov · ‎09-01-2016

[ Microsoft C++ compiler ( VS2005 PE ) 32-bit - Release ] ... Verification 1.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic: a1= 1.0000001 b1=-1.0000002 r1= 0.000000000000000000000 Verification 2.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic: a2= 1.0000001 b2=-1.0000002 r2= 0.000000000000014210855 ... Note: Floating Point Model option: /fp:fast

SergeyKostrov · ‎09-01-2016

[ Borland C++ compiler v5.5.1 32-bit - Debug ] ... Verification 1.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic: a1= 1.0000001 b1=-1.0000002 r1= 0.000000000000014210855 Verification 2.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic: a2= 1.0000001 b2=-1.0000002 r2= 0.000000000000014210855 ... Note: Floating Point Model option: Default

SergeyKostrov · ‎09-01-2016

[ Borland C++ compiler v5.5.1 32-bit - Release ] ... Verification 1.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic: a1= 1.0000001 b1=-1.0000002 r1= 0.000000000000014210855 Verification 2.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic: a2= 1.0000001 b2=-1.0000002 r2= 0.000000000000014210855 ... Note: Floating Point Model option: Default

SergeyKostrov · ‎09-01-2016

[ Intel C++ compiler v12.1.7 ( u371 ) 32-bit - Debug ] ... Verification 1.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic: a1= 1.0000001 b1=-1.0000002 r1= 0.000000000000014210855 Verification 2.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic: a2= 1.0000001 b2=-1.0000002 r2= 0.000000000000014210855 ... Note: Floating Point Model option: /fp:precise