Hi, I have a desire to understand the format of IEEE 754 (for example, I chose the addition operation), but I have a problem: the format is not accurately described the formation of the status bits. For example, I found an algorithm that also through the expansion of the mantissa of the result (three bits right) allows you to monitor an inaccurate result (rounding mode - to nearest). I decided to simulate the algorithm and compare the results with my processor intel i7 (Control Register "cwr") and I get different results. Is there a standard (or algorithm) in the public domain, by which Intel generates bit “inexact”?
P.S: I used the article «What Every Computer Scientist Should Know About Floating-Point Arithmetic» DAVID GOLDBERG.
In my code all exceptions occur. I understand how to form all exceptions except for "inexact". I would like to understand it better, as the phrase "inexact value" is too general. It would be ideal if having only two input arguments, I would have to determine whether to set this bit.
Here is description of Inexact Exception: http://docs.oracle.com/cd/E19422-01/819-3693/ncg_handle.html
Put it simply rounded approximated result is different from the infinitely precise result. Think about the approximation of such a value like 0.3
It is on such common phrases I said) I was hoping to find an algorithm that accurately describes the formation of this bit. That algorithm.
For example, I will summarize the two numbers (0xe39d413c6f4d7d9f and 0xe39ff6e30bcff322). Using material from the article, about which I wrote, I think that should not Having set the bit "inexact". My CPU "thinks" differently. I wanted to understand why.
I think that you try to add the smallest double precision number and hence probably as a result of such a addition CPU signals inexact exception because rounding is performed in order to fit the result into destination although I am not sure if this is the case.
I am not an expert on these issues, but I believe that the "inexact" status is raised whenever rounding causes any bits to be dropped.
Using the online conversion tool at http://babbage.cs.qc.cuny.edu/IEEE-754.old/64bit.html, I see that for the two values above, the exponents are the same and the fractional parts (with the implicit leading bit included) are:
0xe39d413c6f4d7d9f --> 11101010000010011110001101111010011010111110110011111
0xe39ff6e30bcff322 --> 11111111101101110001100001011110011111111001100100010
The sum of the fractional parts is --> 111101001110000001111101111011000111010111000011000001
Adding the fractional parts results in a carry, which means that the lowest-order-bit of the sum must be handled by rounding when the result is normalized. Since the value of the lowest-order bit is "1", rounding either up or down is clearly "inexact".
Presumably the inexact status would not be raised if all of the bits that need to be dropped in the normalization step are zero.
The algorithm for setting the inexact bit is not discussed in the IEEE-754 standard, but the definition is certainly clear -- a result is "inexact" if it differs from the result that would be obtained with an unbounded exponent field and an unbounded fraction field.
An algorithm that might work is: If any non-zero bits are dropped due to either shifting of input values OR if any non-zero bits are dropped due to normalization of output values, then it is presumed that the result does not match the infinite-precision result and the inexact exception is raised.
This may not be sufficiently precise. It might be possible for the shift of the input value to drop bits in a way that exactly counteracts the effect of the normalization of the output value, leading to a "false positive" using the algorithm above. I am sure that smart people have figured a robust way of setting this that does not require actually having the infinite precision result, but it is hard to get very excited about it -- the class of FP operations that *do not* produce inexact results is sufficiently small that the ability to trap on the exception is not useful very often.
I wonder how CPU can approximate infinitely precise "exact" result?
I think that in case of double precision FP 56-bit fractional part can represent at some degree exact result and when during the calculation there is recorded loss if significand digits then inexact exception can be raised.
Not surprisingly, people have figured out how to avoid the possible "false positive" case that I mentioned above.
A readable but reasonably thorough reference is available at http://www.cs.ucla.edu/digital_arithmetic/files/ch8.pdf
This reference shows that keeping three extra bits of precision is sufficient to guarantee that all IEEE 754 rounding modes can be performed correctly and that the inexact exception can be detected unambiguously. The trick is that the 3rd extra bit (called the "sticky bit") must be the logical OR of all of the additional bits of the intermediate computation.
- For add/subtract operations the number of "additional bits" depends on how many bits the smaller argument must be shifted to the right before the operands are properly aligned for the addition. If any of the bits being shifted "off the end" are non-zero, then the "sticky bit" will be set for use in the rounding and inexact flag setting steps.
- For multiplication operations the number of "additional bits" is equal to the number of bits of each operand -- i.e., multiplying two values with "m" bit fractions will produce a "2m" bit intermediate result, but (except for the Fused Multiply-Add operation) only three extra bits need to be kept (provided that the third one is the "sticky bit").
Using the notation of the reference above, the inexact exception is raised if G+R+T=1, where G and R are the two bits of the intermediate result immediately below the low-order bit of the final (normalized) result and T is the "sticky bit" (the logical "OR" of all additional bits of the intermediate result).