in my code I sum up the array values i and i-1 in a manner that in the end the result should be 0. Like the following
integer,parameter :: ie=1000000 real,dimension(0:ie) :: A real :: sum cccc set values for A A(0)=A(ie) sum=0. do i=1,ie sum=sum + A(i)-A(i-1) enddo
I use the intel fortran compiler ifort and openmpi compiled with the intel compilers (https://software.intel.com/en-us/articles/performance-tools-for-software-developers-building-open-mpi-with-the-intel-compilers).
If I write the result there are cases with a sum of order of magnitude 1E-1 if I take the same code and compile it with the GNU gfortran and openmpi compiled with GNU compilers I get 1E-5. Is there a way to increase this accuracy without changing to double precision?
- Parallel Computing
In the above code, the array A is uninitialized prior to line 7.
Line 7 sets A(0) to the uninitialized value of A(ie)
Do you have unreasonable expectations regarding uninitialized data?
If your question is about how to suppress simd optimization of sum reduction, 'ifort -fp-model source -assume protect-parens ' would do it. Usually, the optimized result will be more accurate. As you suggest, double precision sum reduction would be a way to check this. 'gfortran -O3 -ffast-math' is closer to ifort default setting, as it implies full simd optimization (but also sets complex limited range). I'm guessing you may not have studied which compilers flags are suitable.
Many compiler optimizations change results (at the level of the round-off error) by re-ordering operations, but these typically don't have a systematic impact on accuracy. (The optimizations often get in the way of attempts to manually organize the order of operations for very carefully analyzed systems, but for "normal" code the results of re-ordering are typically just "different", not "worse".)
On the other hand, for sum reduction, the "optimizations" used to improve vectorization and pipelining are almost always beneficial to accuracy. By increasing the number of partial sums, each partial sum tends to be smaller, so the difference in magnitude between the partial sum and the new values being added are reduced, which reduces the loss of precision in the addition operations. The final reduction of the partial sums to a single sum is typically done in a binary tree, which also has the property of combining terms of similar size and minimizing loss of accuracy.
However, if the data values depend on location in a way that is strongly correlated with the way that the different locations are assigned to different partial sums, then the result could be an unexpected loss in accuracy. My guess is that this is what is happening here. If you were to run a series of experiments using random permutations of the values in the array A, I suspect that the overall accuracy would be better with the "optimized" (vectorized + multiple vector partial sums) code.
For non-integer rational numbers, when the magnitude of the numbers differ near to or more than the precision of the number then this aggravates the round off error. Going from REAL(4) to REAL(8) may help, but for certain data it may be of little benefit.
Though it may be unworkable to do so, but the absolute best you can do is to sort the array by absolute value of the entries, then partition the array by relative magnitude of the absolute values. You would then perform partial sums on each partition from the lowest index (lowest absolute value of the partition) to highest index. Then sort the partial sums by the absolute value of partial sum, then sum those by order.
Of course the above is impractical excepting under dire circumstances.
An alternative is to select a well chosen fence value that is dependent upon the distribution of the numbers.
sum_LE=0. sum_GT=0. !dir$ simd reduction(+:sum_LE,sum_GT) do i=1,ie diff = A(i)-A(i-1) if(abs(diff) .LE. YourFenceValue) then sum_LE = sum_LE + diff else sum_GT = sum_GT + diff endif enddo sum = sum_LE + sum_GT
Note, the newer vector instructions have masked operations. Using these, the compiler can generate a vector version of the above loop without branches for the enclosed if. (assuming your CPU also supports unaligned loads). It is your responsibility to determine the appropriate YourFenceValue.
For IEEE floating-point values, it is possible to get "exact" results for accumulation and for "dot products" using a couple of different techniques.
One that I like (because it is simple enough to understand!) is the "exact accumulator" developed by Kulisch, which I discuss at https://sites.utexas.edu/jdm4372/2012/02/15/is-ordered-summation-a-hard-problem-to-speed-up/
Kulisch's exact accumulator targets the summation of products of 64-bit IEEE values. The same approach can be used to develop an an exact algorithm for sums of 32-bit IEEE values using a much smaller accumulator. The range of maximum 32-bit FP value to minimum denormal 32-bit value is about 276 bits, so a 320 bit accumulator should suffice to guarantee exact results for the accumulation of up to ~2^44 32-bit floating-point values. The implementation is left as an exercise for the reader....
Interesting article. I will admit I did not read the Patent, nor the paywall article. I do have a question regarding your post.
In there you state for double precision (IEEE 754 1985) that the exact accumulator would require 4288 bits. I would like to know how that number is derived.
The double precision mantissa is 52 bits, add 1 for the implicit 1 bit, making it 53 bits, plus sign making 54 bits. The exponent is 11 bits (signed).
Left shift max is 1023 right shift is 1024 bits thus requiring 1023+1024+54 bits or 2101 bits. This is for exact summation of doubles who's result does not overflow/underflow that representable by IEEE 754.
If we consider FMA, then the partial product requirements are 53+53+1 bits, but the summation (without overflow in eventual result) would still require 2101 bits.
A potential requirement for additional bits, could potentially be to provide for the partial FMA to exceed the eventual complete FMA with the intention (requirement) that the final summation fit within 2101 bits or potentially 2101+53 bits (2154 bits). So my question remains: why does this require ~2x the number of bits?
The smaller (2154 bits) can fit within 5 zmm registers. Exact summation might be something to consider for future instructions (probably after quad/real(16) is implemented).
The full range of IEEE 64-bit floating point numbers is roughly 2^1023 to 2^(-1027) (for denorms), so the product of 2 of these is approximately 2^2046 to 2^(-2054). This range is therefore about 4100 bits. A more detailed analysis gives 2*emax + 2*d + 2*abs(emin), where "emax" is the maximum binary exponent (1023), "d" is the number of binary digits in the mantissa (53), and "emin" is the minimum binary exponent (-1027), which works out to 4206 bits.
The exact accumulator needs to be able to accumulate maxval*maxval many times without overflowing. How many times? This is somewhat arbitrary. Taking into account the need for a few status bits (exact/inexact, infinities, NaN's, etc), Kulisch decided that using 66 64-bit words (4224 bits) would make it possible to overflow using a plausible number of maxval*maxval entries. Increasing this to 67 64-bit words (4288 bits) ensures that the accumulator will not overflow unless you accumulate >2^80 maxval*maxval pairs, and that is unlikely to be possible in the lifetime of a computer system. (If I computed it correctly, it would take 14 years using a system capable of 1 ExaFLOPs to overflow the accumulator.)
I guess it is future-proofing. Keep in mind that the end result cannot be converted back to double (even with round off). It can be converted back to IEEE 754 quadruple (15 bits exponent), for almost all of the numbers, assuming you are willing to accept round off after 112 bits.
With any case (convert to real, double, quadruple, as long as the resultant exponent did not overflow, then the resultant number will be the same regardless of the accumulation order. (Provided the 4224 bits did not overflow).
The _mm512_adc_epi32 (and like), as well as _mm_clmulepi64_si128 would be quite handy in implementing it.
One of the main points of the exact accumulator is to ensure the same result independent of accumulation order. This also means that the result is independent of the number of partial sums, which is very useful in parallel systems (so you get the same answer no matter how many processes are used). (Note that this means that the final summation of the partial sums needs to be done using the full 4288-bit exact accumulator format, but as I noted in my blog entry this is pretty fast on modern processors -- certainly much faster than the network latencies required to bring the exact accumulators together.)
There is no problem converting the result back to 64-bit IEEE float. If the exponent is too large, then it will convert back to an "infinity" value, but one of the many nice features of this approach is that the accumulator can have intermediate results of any size without losing precision, and the only time you care is at the final conversion
A related paper discussing an FPGA implementation of an exact sum and an exact multiply-accumulate (and that should be freely available) is at: