- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Im developing a financial application and want to use IEEE 754-2008 types to make sure that calculations are correct and without binary calculation problems. Unfortunately I could not found much documentation and samples how to use BFP numbers. I also found a header for DFB (dfp754.h), but this seems not to work with C++ and there is no documentation at all ;(
How can I get more information how to use these IEEE 754 libraries in my C++ application???
Daniel
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Regarding IEEE 754 specs and FP-precision related APIs I would definetely recommend to look at:
- en.wikipedia.org/wiki/Single_precision
- www.binaryconvert.comand www.binaryconvert.com/convert_float.html
- MSDN - CRT functions that control precision of FPU: _control87, _controlfp, _control87_2
FPU - Floating-Point Unit
- and, of course, "float.h" header file
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the links Sergey. Most of them I already know. My point here is NOT to use float or double since they cannot calculate 3.05 + 0.05 without problems. If I use IEEE 754 coded values the computer should make this calculation without problems! So I'm looking for a good and useful implementation of this specs. Intel provides this, but the documentation is not very rich. They have something line _Decimal32 (64/128) (my guess is that Decimal32 types are NOT available for C++! - only for pure C-apps) and also bid64, but again, the documentation does not really show how to use them.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>...3.05 + 0.05 without problems...
Could youprovide more details regarding problems withyour3.05 + 0.05 test case? What was wrong?
In case of float data type ( single precision )and 24-bit precision setup in FPU a loss of accuracy is expected if a mantissa is greater than 2^24 = 16777216. Here is an example:
16968000(Base10) => 0 10010111 00000010111010010100000(Base2\IEEE754)
16968001(Base10) => 0 10010111 00000010111010010100000(Base2\IEEE754)
16968002(Base10) => 0 10010111 00000010111010010100001(Base2\IEEE754)
16968003(Base10) => 0 10010111 00000010111010010100010(Base2\IEEE754)
16968004(Base10) => 0 10010111 00000010111010010100010(Base2\IEEE754)
16968005(Base10) => 0 10010111 00000010111010010100010(Base2\IEEE754)
16968006(Base10) => 0 10010111 00000010111010010100011(Base2\IEEE754)
16968007(Base10) => 0 10010111 00000010111010010100100(Base2\IEEE754)
16968008(Base10) => 0 10010111 00000010111010010100100(Base2\IEEE754)
16968009(Base10) => 0 10010111 00000010111010010100100(Base2\IEEE754)
16968010(Base10) => 0 10010111 00000010111010010100101(Base2\IEEE754)
Can you see that three differentnumbers have the same binary representation inIEEE 754 format? If I need a better precision I use double or long double data types.
I understand that you want to use an external library to do all FP-based calculations. Would you be able to upload docs, headers and libs for what you have?
>>...Decimal32 types are NOT available for C++!..
Why?..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>...Decimal32 types are NOT available for C++!..
>> Why?..
Decimal Floating Point (as specified by ISO/IEC TR 24732), which was a technical report from the C standards committee. The C++ standardscommittee has not issued a technical report on Decimal Floating Point. So there is no description of how it should be implemented in C++.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
double l_dbfirst = 3.05;
double l_dbSecond = 0.05;
double l_dbSum = l_dbfirst + l_dbSecond;
BOOL l_fIsCorrect = l_dbfirst + l_dbSecond == 3.1;
Do you expectl_fIsCorrect to be TRUE (not equal to zero)? Or l_dbSum to be 3.1???? You can expect l_fIsCorrect to be 0 and l_sbSum will be something like 3.099999999999...etc. That is my point to use IEEE 754 numbers - not the count of decimal places. Even using long double here will not make any difference in the result.
I think the using IEEE 754 will solve this problem. On the Intel Website I read that there is some dupport if this data-type. Please see this:

This waqs my reason to try the compiler - but there is not much documentation for IEEE 754 Binary and Decimal FP ;(
Thanks you for your help anyway.
Daniel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But,this is a common problem whenthere isa question like: Could I trust the data?
Aconcept of Epsilon partially resolves it. Look, here are some consolidatedresults of my investigation how different libraries and compilers declare an Epsilon:
...
Epsilon for Floats - smallest such that 1.0+FLT_EPSILON != 1.0
Epsilon for Doubles - smallest such that 1.0+DBL_EPSILON != 1.0
Epsilon for Long Doubles - smallest such that 1.0+LDBL_EPSILON != 1.0
// Intel IPL -Doesn't specify DBL, FLT or LDBL
#define IPL_EPS1.0E-12
// Intel IPP -Nothing for LDBL
#define IPP_EPS_32F1.192092890e-07f
#define IPP_EPS_64F2.2204460492503131e-016
// STL -Uses default DBL_EPSILON, FLT_EPSILON and
LDBL_EPSILON values defined by a
C/C++ compiler
// OpenGL - Nothing!
// NVIDIA SDK
#define GLH_REALfloat -No fractions and Nothing for DBL and LDBL
#define GLH_EPSILONGLH_REAL(10e-6)
// Microsoft C++ compiler- Desktop
#define DBL_EPSILON 2.2204460492503131e-016
#define FLT_EPSILON 1.192092896e-07F
#define LDBL_EPSILONDBL_EPSILON
// Microsoft C++ compiler- Mobile
#define DBL_EPSILON 2.2204460492503131e-016
#define FLT_EPSILON 1.192092896e-07F
#define LDBL_EPSILONDBL_EPSILON
// Borland C++ v5.x.x compiler
#define DBL_EPSILON 2.2204460492503131E-16
#define FLT_EPSILON 1.19209290E-07F
#define LDBL_EPSILON1.084202172485504434e-019L
// Turbo C++ v3.x.xcompiler
#define DBL_EPSILON 2.2204460492503131E-16
#define FLT_EPSILON 1.19209290E-07F
#define LDBL_EPSILON 1.084202172485504E-19
// Turbo C++ v1.x.xcompiler
#define DBL_EPSILON 2.2204460492503131E-16
#define FLT_EPSILON 1.19209290E-07F
#define LDBL_EPSILON1.084202172485504E-19
// MinGW v3.4.xcompiler -Uses magic __DBL_EPSILON__,
__FLT_EPSILON__ and __LDBL_EPSILON__
Could be verified with a simple piece of code:
...
printf( "%.48f\n", ( float )__FLT_EPSILON__ );
printf( "%.48f\n", ( double )__DBL_EPSILON__ );
printf( "%.48f\n", ( long double )__LDBL_EPSILON__ );
...
Output is:
0.000000119209289550781250000000000000000000000000- Close to Microsoft's values
0.000000000000000222044604925031310000000000000000-Exact match with everybody
0.000000000000000000000000000000000000000000000000-Oops!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For scientific and financial applications where high performance is an overriding requirement, I am afraid there will never be a way around the fact that floating-point arithmetic is imprecise and non-associative. One has to absorb this fact and learn to live with it... the same way as a student of quantum physics needs to spend at least a semester absorbing the fact that various physical quantities and objects are not infinitely divisible.
For safe conversions from double to int (or to dollars and cents), I suggest you
#define ONEPLUS 1.0+10.0*DBL_EPSILON
Then,
float a,b;
...
int n = a*b*ONEPLUS;
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Daniel

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page