By a floating point integer I mean a normalised floating point number whose absolute value is less than (in the case of double) 2^53 for which (in the case of double)
double int_part, fractional_part; fractional_part = std::modf(z, &int_part);
will always yield (either and hence both) zero for fractional_part and equality between the z and int_part. One makes the obvious changes 53-23 etc. for float,...
My basic question is that if I do arithmetic with this subclass of floating point numbers (-, +, *, FMA) can I be completely assured that providing there is no intermediate answer that is too large in absolute value, then, regardless of any compiler optimisation switches, use of scalar or vector operations, the answers will always be predictable, identical and correspond to the answers for the usual commutative and associative integer arithmetic.
I assume the answer is yes, but cannot find the reference that guarantees the results in an optimised setting. Are there explicit statements as to when operations are / are not consistent across the different compiler options or machines. Please note I am asking about whether the binary (or for FMA, ternary) hardware operations are consistent and whether as a result the mutations that arise because of the lack of associativity or identity of these operations that influence compiler optimisations around ordering disappear. Of course I am not asking about division.
After answering this question it becomes interesting to consider the same question including 0 as an integer (with sign chosen so 0-0 == 0) which is +0 unless round-down is enabled (https://en.wikipedia.org/wiki/Signed_zero) when it is -0.
If you want to reproduce consistent results from one platform to another or from different optimizations..., then please use -fp-model consistent option.
Maybe my question was too complicated: I want to produce arithmetically correct answers for integer arithmetic in floating point numbers. Is 7. == 3. * 2. + 1. true for all compiler contexts, because there is no rounding to be done. Is it true that hardware multiplications never round when a result is already exact and no round is required?
Following up your suggestion, I looked at the link below.
it gives examples such as
(float)((((long double)4.0f + (long double)0.1f) + (long double)t1) + (long double)t2);
If the compiler only ever uses the hardware instructions and rearranges expressions in ways that remain valid expressions like the above then what I hope for is true
0.1f and 0.1 cannot be exactly represented in floating point as when expressed in binary floating point format it would require an infinite number of bits to approach the exact value. Promoting the 0.1 (or 4.1) to (long double) on architectures supporting floating point values of more than 64-bits, will use a better approximation of 0.1.
Notes: long double on targeting IA32 may use the x87 10-byte (stored) extended precision (more than 8 byte double), whereas Intel64 build may use 8-byte double or it may use something larger. MS VC++ (long double) == (double), GNU C (long double) == (80 bits stored/96 bits internal when using x87 instruction set). And (double double) for 128 bit floating point format, which is currently supported only through emulation.
You specified "arithmetically correct answers for integer arithmetic in floating point numbers" yet the example you chose from the article includes values with fractions.
When the multiplication of integers (or addition/subtraction) produces a result where the binary result (binary exponent together with binary mantissa) can be exactly represented in the chosen format (float, double, ...) then no error is produced. However, note that division of integers (can/often will) produce fractional values.
If you can rearrange expressions
A = 9 * (7/3)
A = (9/3) * 7
*** and use compiler options to respect the parenthesis
Then you will get an answer that is not produces with fractional components.
Thanks for your comments. In answer to your question - the manipulations I need require me to have the (billions of) integers stored in a way that instantly lets me know their binary exponent as if they were stored in IEEE normal floating point form; I could do this
a) by storing the exponents separately,
b) by using the incredibly nonportable clz type operations to extract information about number of leading zeros,
c) by leveraging the heavily optimised floating point algorithms that do this automatically for me.
This last approach seems far easier and I have code that does this well. I only get 53 bits instead of 64, etc. but the price in efficiency and portability are well worth it to me. This is particularly true because I am mainly performing operations like ax+b, and as we know FMA is pretty well optimised for floating point and vectorization!
I am sorry if I confused you with the link - I do indeed seek documentation about "arithmetically correct answers for integer arithmetic in floating point numbers" and yet I did point to an example which includes values with fractions; my reason was because all I need to know is that any optimisation of + and * fp operations will always be equivalent to an expression built out of basic IEEE operations (possibly rounding to different accuracy as you allude above), because in that case I can prove that the integers will be correctly computed regardless of the order the optimising compiler chooses. That documentation I pointed to seems to suggest that is the case.
You say "When the multiplication of integers (or addition/subtraction) produces a result where the binary result (binary exponent together with binary mantissa) can be exactly represented in the chosen format (float, double, ...) then no error is produced". What I need is the documentation for this fact! I would like to know that what I do, which works in practice, is not undefined behaviour that merely works in every machine I have ever tried. You will see in
that similar questions have been asked before, and people are confident "you will get exact integer arithmetic, provided that you never have intermediate results that fall outside that range. I've encountered software that took advantage of this".
For IEEE, this is clear; the issue is: Is there optimization that messes with that truth. I suspect not because I suspect default hardware respects the IEEE with perhaps modified precision for different scalar and vector options. Optimization uses this hardware.
Thanks for interest.
you cannot do exact 52-bit spanned integer contained as double, by 52-bit spanned integer contained as double if the resultant product requires more than 52 spanned bits as this would require 104 spanned bits (or there abouts). Only if the product spans less than 52-bits .AND. is a power of 2 could the result.
In the examples on the link you used, the writer mistakenly chose a 2-bit span binary (3) and multiplied it by a power of 2. IOW with a slightly lesser power of 2 (say 2^100), the product of that number with say 255 would require a 10 bit span in the mantissa. I.e. the chosen numbers will give you false assurances.
(excepting for denormal numbers) the IEEE 754 number format has the mantissa (integer in your case) left-shifted until the left most (msb) bit falls out of the mantissa field, this is an implied 1. This bit, plus all bits to the right (towards lsb of left-shifted number) to the last/least 1 bit, for the purpose of this posting, shall be defined as the spanned bits. Any product of two "spanned" bit numbers will require up to the sum of the bits (inclusive of implied bits) less one bit for the implied bit in the result. Note the sign bit has nothing to do with the product of the mantissa bits in this example unless the product produces a result requiring more than 52 bits nor exponents exceeding that of the selected precision. If you can assure that no product nor sum will exceed 52 bits of mantissa and assure that division is not used, then you should not experience issues with CPUs fully supporting IEEE 754 floating point formats.
I am not affiliated with any CPU manufacturer, it would behoove you to run an extreme limits test of numbers that will push, but not exceed, the expected bit support.
Note 2) You may also require to explicitly use compiler options that do not rearrange expressions that you explicitly construct to avoid overflows of the bits within those available in the mantissa. (e.g. tell the compiler to adhere to your use of parenthesis).