- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Im developing a financial application and want to use IEEE 754-2008 types to make sure that calculations are correct and without binary calculation problems. Unfortunately I could not found much documentation and samples how to use BFP numbers. I also found a header for DFB (dfp754.h), but this seems not to work with C++ and there is no documentation at all ;(

How can I get more information how to use these IEEE 754 libraries in my C++ application???

Daniel

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Regarding IEEE 754 specs and FP-precision related APIs I would definetely recommend to look at:

- en.wikipedia.org/wiki/Single_precision

- www.binaryconvert.comand www.binaryconvert.com/convert_float.html

- MSDN - CRT functions that control precision of FPU: _control87, _controlfp, _control87_2

FPU - Floating-Point Unit

- and, of course, "float.h" header file

Best regards,

Sergey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for the links Sergey. Most of them I already know. My point here is NOT to use float or double since they cannot calculate 3.05 + 0.05 without problems. If I use IEEE 754 coded values the computer should make this calculation without problems! So I'm looking for a good and useful implementation of this specs. Intel provides this, but the documentation is not very rich. They have something line _Decimal32 (64/128) (my guess is that Decimal32 types are NOT available for C++! - only for pure C-apps) and also bid64, but again, the documentation does not really show how to use them.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**>>**...3.05 + 0.05 without problems...

Could youprovide more details regarding problems withyour3.05 + 0.05 test case? What was wrong?

In case of **float** data type ( single precision )and **24-bit precision** setup in FPU a loss of accuracy is expected if a mantissa is greater than 2^24 = 16777216. Here is an example:

16968000(Base10) => 0 10010111 00000010111010010100000(Base2\IEEE754)

16968001(Base10) => 0 10010111 00000010111010010100000(Base2\IEEE754)

16968002(Base10) => 0 10010111 00000010111010010100001(Base2\IEEE754)**16968003**(Base10) => **0 10010111 00000010111010010100010**(Base2\IEEE754)**16968004**(Base10) => **0 10010111 00000010111010010100010**(Base2\IEEE754)**16968005**(Base10) => **0 10010111 00000010111010010100010**(Base2\IEEE754)

16968006(Base10) => 0 10010111 00000010111010010100011(Base2\IEEE754)

16968007(Base10) => 0 10010111 00000010111010010100100(Base2\IEEE754)

16968008(Base10) => 0 10010111 00000010111010010100100(Base2\IEEE754)

16968009(Base10) => 0 10010111 00000010111010010100100(Base2\IEEE754)

16968010(Base10) => 0 10010111 00000010111010010100101(Base2\IEEE754)

Can you see that three differentnumbers have the same binary representation inIEEE 754 format? If I need a better precision I use double or long double data types.

I understand that you want to use an external library to do all FP-based calculations. Would you be able to upload docs, headers and libs for what you have?**>>**...Decimal32 types are NOT available for C++!..

Why?..

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**>>**...Decimal32 types are NOT available for C++!..

>> Why?..

Decimal Floating Point (as specified by ISO/IEC TR 24732), which was a technical report from the C standards committee. The C++ standardscommittee has not issued a technical report on Decimal Floating Point. So there is no description of how it should be implemented in C++.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

double l_dbfirst = 3.05;

double l_dbSecond = 0.05;

double l_dbSum = l_dbfirst + l_dbSecond;

BOOL l_fIsCorrect = l_dbfirst + l_dbSecond == 3.1;

Do you expectl_fIsCorrect to be TRUE (not equal to zero)? Or l_dbSum to be 3.1???? You can expect l_fIsCorrect to be 0 and l_sbSum will be something like 3.099999999999...etc. That is my point to use IEEE 754 numbers - not the count of decimal places. Even using long double here will not make any difference in the result.

I think the using IEEE 754 will solve this problem. On the Intel Website I read that there is some dupport if this data-type. Please see this:

This waqs my reason to try the compiler - but there is not much documentation for IEEE 754 Binary and Decimal FP ;(

Thanks you for your help anyway.

Daniel

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

But,this is a common problem whenthere isa question like: Could I trust the data?

Aconcept of Epsilon partially resolves it. Look, here are some consolidatedresults of my investigation how different libraries and compilers declare an Epsilon:

...

Epsilon for Floats - smallest such that 1.0+FLT_EPSILON != 1.0

Epsilon for Doubles - smallest such that 1.0+DBL_EPSILON != 1.0

Epsilon for Long Doubles - smallest such that 1.0+LDBL_EPSILON != 1.0

**// Intel IPL** -Doesn't specify DBL, FLT or LDBL

#define IPL_EPS1.0E-12

**// Intel IPP** -Nothing for LDBL

#define IPP_EPS_32F1.192092890e-07f

#define IPP_EPS_64F2.2204460492503131e-016

**// STL** -Uses default DBL_EPSILON, FLT_EPSILON and

LDBL_EPSILON values defined by a

C/C++ compiler

**// OpenGL** - Nothing!

**// NVIDIA SDK**

#define GLH_REALfloat -No fractions and Nothing for DBL and LDBL

#define GLH_EPSILONGLH_REAL(10e-6)

**// Microsoft C++ compiler- Desktop**

#define DBL_EPSILON 2.2204460492503131e-016

#define FLT_EPSILON 1.192092896e-07F

#define LDBL_EPSILONDBL_EPSILON

**// Microsoft C++ compiler- Mobile**

#define DBL_EPSILON 2.2204460492503131e-016

#define FLT_EPSILON 1.192092896e-07F

#define LDBL_EPSILONDBL_EPSILON

**// Borland C++ v5.x.x compiler**

#define DBL_EPSILON 2.2204460492503131E-16

#define FLT_EPSILON 1.19209290E-07F

#define LDBL_EPSILON1.084202172485504434e-019L

**// Turbo C++ v3.x.xcompiler**#define DBL_EPSILON 2.2204460492503131E-16

#define FLT_EPSILON 1.19209290E-07F

#define LDBL_EPSILON 1.084202172485504E-19

**// Turbo C++ v1.x.xcompiler**#define DBL_EPSILON 2.2204460492503131E-16

#define FLT_EPSILON 1.19209290E-07F

#define LDBL_EPSILON1.084202172485504E-19

**// MinGW v3.4.xcompiler** -Uses magic __DBL_EPSILON__,

__FLT_EPSILON__ and __LDBL_EPSILON__

Could be verified with a simple piece of code:

...

printf( "%.48f\n", ( float )__FLT_EPSILON__ );

printf( "%.48f\n", ( double )__DBL_EPSILON__ );

printf( "%.48f\n", ( long double )__LDBL_EPSILON__ );

...

Output is:

0.000000119209289550781250000000000000000000000000- Close to Microsoft's values

0.000000000000000222044604925031310000000000000000-Exact match with everybody

0.000000000000000000000000000000000000000000000000-Oops!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

For scientific and financial applications where high performance is an overriding requirement, I am afraid there will never be a way around the fact that floating-point arithmetic is imprecise and non-associative. One has to absorb this fact and learn to live with it... the same way as a student of quantum physics needs to spend at least a semester absorbing the fact that various physical quantities and objects are not infinitely divisible.

For safe conversions from double to int (or to dollars and cents), I suggest you

#define ONEPLUS 1.0+10.0*DBL_EPSILON

Then,

float a,b;

...

int n = a*b*ONEPLUS;

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Daniel

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page