- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
*** IEEE 754 Standard Compliance: CPU vs GPU or, a War between Intel and NVIDIA ***
[ Abstract ]
In 2011 NVIDIA published an article about compliance of Floating-Point arithmetic on GPUs and also compared it
with Floating-Point arithmetic on CPUs. The article is very good but it has lots of errors, "crafted" test cases to
demonstrate that CPUs have some issues and GPUs do not, and some technical information is obsolete.
Even if the article was last updated in 2015 non of errors I've found are still Not fixed.
My review of the article will be submitted later on.,
Link Copied
51 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Floating Point and IEEE 754 Compliance for NVIDIA GPUs
.
http://docs.nvidia.com/cuda/floating-point/index.html
Last updated September 1, 2015
Even if the article was last updated in 2015 there are a couple of errors and they are still Not fixed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ ...For x = 1.0008, the correct mathematical result is x 2 - 1 = 1.60064 x 10 - 4... ]
Here is my comment: The correct mathematical result is actually 1.60064 x 10^-3.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ FMA related inaccuracy ]
...
...At the time this paper is written (Spring 2011) there are no commercially available x86 CPUs which offer hardware FMA...
...
Here is my comment: In Q1 of 2011 Intel released a CPU with a set of FMA instructions as a part AVX Instruction Set. Also, AMD announced support of FMA in 2007, and Intel in 2008.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Related to methods of computing a vector dot product ]
Serial Method to Compute Vectors Dot Product
((((a1 x b1) + (a2 x b2)) + (a3 x b3)) + (a4 x b4))
FMA Method to Compute Vector Dot Product
(a4 x b4 + (a3 x b3 + (a2 x b2 + (a1 x b1 + 0))))
Parallel Method to Reduce Individual Elements Products to Compute Vector Dot Product
((a1 x b1) + (a2 x b2)) + ((a3 x b3) + (a4 x b4))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Unknown GCC compiler option - -lm ]
...
gcc test.c -lm -m64
...
gcc test.c -lm -m32
...
I've been using MinGW C++ compilers, a port of GCC for Windows platforms, for many years and
I don't know what that option does. It looks like undocumented option.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Accuracy of computations of 32-bit codes vs 64-bit codes - 1 ]
...This shows that the result of computing cos( 5992555.0 ) using a common library
differs depending on whether the code is compiled in 32-bit mode or 64-bit mode...
...
volatile float x = 5992555.0;
printf("cos(%f): %.10g\n", x, cos(x));
...
gcc test.c -lm -m64
cos( 5992555.000000 ): 3.320904615e-07
gcc test.c -lm -m32
cos( 5992555.000000 ): 3.320904692e-07
...The consequence is that different math libraries cannot be expected to compute
exactly the same result for a given input...
Here is my comment: These results are absolutely correct! This is because a Control Word of
the FPU is initialized differently in 32-bit and 64-bit modes. I would consider it as a legacy
issue and if you look at float.h header file you will see the following:
[ float.h ]
This is how _CW_DEFAULT is defined in Microsoft C++ compiler:
...
#if defined(_M_IX86)
#define _CW_DEFAULT ( _RC_NEAR + _PC_53 + _EM_INVALID + _EM_ZERODIVIDE + _EM_OVERFLOW + _EM_UNDERFLOW + _EM_INEXACT + _EM_DENORMAL)
#elif defined(_M_IA64)
#define _CW_DEFAULT ( _RC_NEAR + _PC_64 + _EM_INVALID + _EM_ZERODIVIDE + _EM_OVERFLOW + _EM_UNDERFLOW + _EM_INEXACT + _EM_DENORMAL)
#elif defined(_M_AMD64)
#define _CW_DEFAULT ( _RC_NEAR + _EM_INVALID + _EM_ZERODIVIDE + _EM_OVERFLOW + _EM_UNDERFLOW + _EM_INEXACT + _EM_DENORMAL)
#endif
...
This is how _CW_DEFAULT is defined in MinGW C++ compiler:
...
#if defined(_M_IX86)
#define _CW_DEFAULT (_RC_NEAR+_PC_53+_EM_INVALID+_EM_ZERODIVIDE+_EM_OVERFLOW+_EM_UNDERFLOW+_EM_INEXACT+_EM_DENORMAL)
#elif defined(_M_IA64)
#define _CW_DEFAULT (_RC_NEAR+_PC_64+_EM_INVALID+_EM_ZERODIVIDE+_EM_OVERFLOW+_EM_UNDERFLOW+_EM_INEXACT+_EM_DENORMAL)
#elif defined(_M_AMD64)
#define _CW_DEFAULT (_RC_NEAR+_EM_INVALID+_EM_ZERODIVIDE+_EM_OVERFLOW+_EM_UNDERFLOW+_EM_INEXACT+_EM_DENORMAL)
#endif
...
As you can see in case of 32-bit mode a 53-bit precision ( _PC_53 ) is set, and
in case of 64-bit mode 64-bit precision ( _PC_64 ) is set.
It means that when a C++ compiler initializes FPU in 64-bit mode by default it will be set to 64-bit precision
and accuracy of computations will be better.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Accuracy of computations of 32-bit codes vs 64-bit codes - 2 ]
Note 1:
cos( 5992555.0 rads ) = cos( 1.570795994704435107627171818956 rads ) =
cos( 89.999980972618138833377959990966 degs )
Note 2: ~57.29 (1 rad) * 200,000 = ~11,458,000. Also, Intel recommends in a Software Developers Manual to do argument reduction before passing the value to any trigonometric functions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Take a look at
Intel x64 and IA-32 Architectures Software Developer Manual ( Vol 1 )
Topic
8.3.10 Transcendental Instruction Accuracy
for more technical details.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Accuracy of computations when an argument needs to be reduced by a multiple of 2*PI ]
Example A - Argument in Radians and reduction is by a multiple of 2*PI was applied:
cos( 1.570795994704435107627171818956 rads ) = 3.3209046151159804582447704191223e-7
cos( 1.570795994704435107627171818956 rads + 2*PI * 10 ) = 3.3209046151159804582447704191159e-7
cos( 1.570795994704435107627171818956 rads + 2*PI * 100 ) = 3.3209046151159804582447704191489e-7
cos( 1.570795994704435107627171818956 rads + 2*PI * 1000 ) = 3.3209046151159804582447704185331e-7
cos( 1.570795994704435107627171818956 rads + 2*PI * 10000 ) = 3.3209046151159804582447704217814e-7
cos( 1.570795994704435107627171818956 rads + 2*PI * 100000 ) = 3.3209046151159804582447703554761e-7
cos( 1.570795994704435107627171818956 rads + 2*PI * 1000000 ) = 3.3209046151159804582447697826605e-7
cos( 1.570795994704435107627171818956 rads + 2*PI * 10000000 ) = 3.3209046151159804582447640545038e-7
cos( 1.570795994704435107627171818956 rads + 2*PI * 100000000 ) = 3.3209046151159804582447067729372e-7
cos( 1.570795994704435107627171818956 rads + 2*PI * 1000000000 ) = 3.3209046151159804582441339572710e-7
Precision is 31 digits after a decimal point and only 21 digits are the same. This is how precision changes:
... = 3.3209046151159804582447704191223e-7
... = 3.3209046151159804582447704191159e-7
... = 3.3209046151159804582447704191489e-7
... = 3.3209046151159804582447704185331e-7
... = 3.3209046151159804582447704217814e-7
... = 3.3209046151159804582447703554761e-7
... = 3.3209046151159804582447697826605e-7
... = 3.3209046151159804582447640545038e-7
... = 3.3209046151159804582447067729372e-7
... = 3.3209046151159804582441339572710e-7
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Accuracy of computations when an argument needs to be reduced by a multiple of 360 ]
Example B - Argument in Degrees and reduction is by a multiple of 360 was applied:
cos( 89.999980972618138833377959990966 degs ) = 3.3209046151159804582447713995504e-7
cos( 89.999980972618138833377959990966 degs + 360 * 10 ) = 3.3209046151159804582447713995504e-7
cos( 89.999980972618138833377959990966 degs + 360 * 100 ) = 3.3209046151159804582447713995504e-7
cos( 89.999980972618138833377959990966 degs + 360 * 1000 ) = 3.3209046151159804582447713995504e-7
cos( 89.999980972618138833377959990966 degs + 360 * 10000 ) = 3.3209046151159804582447713995504e-7
cos( 89.999980972618138833377959990966 degs + 360 * 100000 ) = 3.3209046151159804582447713995500e-7
cos( 89.999980972618138833377959990966 degs + 360 * 1000000 ) = 3.3209046151159804582447713995503e-7
cos( 89.999980972618138833377959990966 degs + 360 * 10000000 ) = 3.3209046151159804582447713995505e-7
cos( 89.999980972618138833377959990966 degs + 360 * 100000000 ) = 3.3209046151159804582447713995504e-7
cos( 89.999980972618138833377959990966 degs + 360 * 1000000000 ) = 3.3209046151159804582447713995504e-7
Precision is 31 digits after a decimal point and only 30 digits are the same.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ NVIDIA Test Case - C Source codes ]
...
union
{
float f;
unsigned int i;
} a, b;
float r;
a.i = 0x3F800001;
b.i = 0xBF800002;
r = a.f * a.f + b.f;
printf( "a %.8g\n", a.f );
printf( "b %.8g\n", b.f );
printf( "r %.8g\n", r );
...
It is Not clear for me why so complicated method of initialization of union members is selected. This is
because
...
a.i = 0x3F800001;
b.i = 0xBF800002;
...
equals to
a.f = 1.0000001;
b.f = -1.0000002;
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ NVIDIA Test Case - Outputs ]
// x86-64 output:
a: 1.0000001
b: -1.0000002
r: 0
// NVIDIA Fermi output:
a: 1.0000001
b: -1.0000002
r: 1.4210855e-14
Here is my comment: These results are absolutely correct! In case of x86-64 output the result equals to
0 because Fast Floating Point model was used. If Precise Floating Point model is used than
the result will be equal to 1.4210855e-14. Floating Point models could be controlled with a C++ compiler option or using
_control87 CRT-function.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Representation of 1.4210855e-14 according to IEEE 754 Standard Single precision 32-bit ]
...
// NVIDIA Fermi output:
...
r: 1.4210855e-14
...
FP-value : 1.4210855e-14
Note : Most accurate representation = 1.42108547152020037174224853516E-14
Binary-value : 0x28800000 = 00101000 10000000 00000000 00000000
IEEE 754 Parts: 0 (sign) 01010001 (exponent) 00000000000000000000000 (mabtissa)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Modified Test Case 1 - C Source codes ]
RTuint uiControlWordx87 = 0U;
uiControlWordx87 = CrtControl87( _RTFPU_CW_DEFAULT, _RTFPU_CW_ALLBITSON );
// uiControlWordx87 = CrtControl87( _RTFPU_PC_24, _RTFPU_MCW_PC );
// uiControlWordx87 = CrtControl87( _RTFPU_PC_53, _RTFPU_MCW_PC );
// Verification 1
{
_RTunion
{
RTfloat f;
RTuint i;
} a1, b1, r1;
a1.i = 0x3F800001;
b1.i = 0xBF800002;
r1.f = 0.0f;
r1.f = a1.f * a1.f + b1.f;
CrtPrintf( RTU("Verification 1.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic:\n") );
CrtPrintf( RTU("\ta1=% .7f\n"), a1.f );
CrtPrintf( RTU("\tb1=% .7f\n"), b1.f );
CrtPrintf( RTU("\tr1=% .21f\n"), r1.f );
}
// Verification 2
{
_RTunion
{
RTfloat f;
RTuint i;
} a2, b2, r2;
a2.f = 1.0000001f;
b2.f = -1.0000002f;
r2.f = 0.0f;
r2.f = a2.f * a2.f + b2.f;
CrtPrintf( RTU("Verification 2.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic:\n") );
CrtPrintf( RTU("\ta2=% .7f\n"), a2.f );
CrtPrintf( RTU("\tb2=% .7f\n"), b2.f );
CrtPrintf( RTU("\tr2=% .21f\n"), r2.f );
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Verification was done using six C++ compilers ]
Microsoft C++ compiler ( VS2005 PE ) 32-bit
Borland C++ compiler v5.5.1 32-bit
Intel C++ compiler v12.1.7 ( u371 ) 32-bit
MinGW C++ compiler v5.1.0 32-bit
Watcom C++ compiler v2.0.0 32-bit
Turbo C++ compiler v3.0.0 16-bit
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Microsoft C++ compiler ( VS2005 PE ) 32-bit - Debug ]
...
Verification 1.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic:
a1= 1.0000001
b1=-1.0000002
r1= 0.000000000000014210855
Verification 2.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic:
a2= 1.0000001
b2=-1.0000002
r2= 0.000000000000014210855
...
Note: Floating Point Model option: /fp:precise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Microsoft C++ compiler ( VS2005 PE ) 32-bit - Release ]
...
Verification 1.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic:
a1= 1.0000001
b1=-1.0000002
r1= 0.000000000000000000000
Verification 2.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic:
a2= 1.0000001
b2=-1.0000002
r2= 0.000000000000014210855
...
Note: Floating Point Model option: /fp:fast
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Borland C++ compiler v5.5.1 32-bit - Debug ]
...
Verification 1.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic:
a1= 1.0000001
b1=-1.0000002
r1= 0.000000000000014210855
Verification 2.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic:
a2= 1.0000001
b2=-1.0000002
r2= 0.000000000000014210855
...
Note: Floating Point Model option: Default
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Borland C++ compiler v5.5.1 32-bit - Release ]
...
Verification 1.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic:
a1= 1.0000001
b1=-1.0000002
r1= 0.000000000000014210855
Verification 2.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic:
a2= 1.0000001
b2=-1.0000002
r2= 0.000000000000014210855
...
Note: Floating Point Model option: Default
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Intel C++ compiler v12.1.7 ( u371 ) 32-bit - Debug ]
...
Verification 1.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic:
a1= 1.0000001
b1=-1.0000002
r1= 0.000000000000014210855
Verification 2.1 of IEEE-754 Standard for SP ( 24-bit ) arithmetic:
a2= 1.0000001
b2=-1.0000002
r2= 0.000000000000014210855
...
Note: Floating Point Model option: /fp:precise
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page