Force xeon level precision on Xeon phi or vice versa

aketh_t_ · ‎06-30-2015

Hi all,

I have been running a program where precision of doubles mean a lot to my program.

However due to some strange reason it seems like Xeon phi is rounding off a few bits(at 10^-8th bit) and this seems to be causing some instabilities to my model. A small round off error grows over my model over iteration of time step and my model fails to converge.

here is some sample differences in error.

Xeon phi value

4052003.87615914 -3800481.58877535
4743651.75216398 8579922.43342044 4677493.53261335
-1251355.38835838 -8704380.49549063 0.000000000000000E+000

xeon value(ideal values)

4052003.87615915 -3800481.58877535
4743651.75216399 8579922.43342044 4677493.53261335
-1251355.38835838 -8704380.49549063 0.000000000000000E+000

This small change over iterations grow and has made the model unable to converge.

I have tested the code without offload and works fine.

I am sure this is what causes the error, with this message from the program

POP Exiting...
POP_SolversChronGear: solver not converged
POP_SolverRun: error in ChronGear
POP_BarotropicDriver: error in solver
Step: error in barotropic

These models are very sensitive and require such precision. So how do I avoid this problem with offload.

Please help to force Xeon level precision on Xeon phi.

Thanks in advance

James_C_Intel2 · ‎06-30-2015

You don't say whether your code is in C/C++ or Fortran. However, I think the relevant flag is the same for either. Therefore, please read Using the -fp-model Option (form the Fortran compiler manual) which should give you some hints.

Another possibility, of course, is that the code generates different results when run on different numbers of threads; if that's the case you're in for a lot more "fun".

aketh_t_ · ‎06-30-2015

mine is Fortran code.

Right now I have not spawned any threads by default but I am planning to later. So I guess I am in no mood for any "fun".

aketh_t_ · ‎06-30-2015

will this hold good for offload code to???

Right now I have been offloading with strict. Also I have been using default 02 optimization.

TimP · ‎06-30-2015

Intel compiler targeting Xeon Phi is likely to be more aggressive with optimizations which diverge slightly from IEEE accuracy, because they are more critical for performance, and because Phi KNC doesn't have firmware support for IEEE divide and sqrt (don't know if that may change in future). Where I would set -prec-div -prec-sqrt when building for host, I would omit those options for the coprocessor, as they can easily double run time. On host, those -prec- options would affect only vectorized loops.

It remains important to set options which treat parentheses in accordance with language standards (ifort -assume protect_parens). icc lately has improved on past wanton disregard of parens, but icc needs fp-model options which prevent use of vector reductions to set observance of parens. As you don't have control over the number of partial sums used, or even the order in which they are combined at the end, it may be possible for vector reduction to produce a slightly different answer on Phi vs. host, although I haven't seen that myself. Vectorized reduction usually gives a more accurate result than serial reduction. In the case of threaded parallel reduction, results aren't necessarily reproduced exactly between runs with the same number of threads, let alone with differing numbers of threads. Setting 64-byte alignments may help with this. It may be impractical to get identical results on host and coprocessor without ruining coprocessor performance.

If you want better accuracy and reproducibility for float/single sum reductions, you can promote them to double, if accuracy is more important there than performance.

Frances_R_Intel · ‎06-30-2015

There are a couple articles looking at this issue of floating point precision on the coprocessor. I don't know that they will help you any more than what James and Tim have had to say but here they are:

The Floating Point Model - balancing performance with accuracy and reproducibility

Differences in Floating-Point Arithmetic Between Intel Xeon Processors and the Intel Xeon Phi Coprocessor

TimP · ‎06-30-2015

Those are good references. Among points implied there are:

Default vector math libraries (-fast-transcendentals) don't give identical results between host and MIC. They are allowed to vary by more than the relative amounts you quote. Scalar math libraries might more often give identical results, but are painfully slow, particularly on MIC.

Host AVX2 fma should give identical results to MIC fma, but if your host arch choice is not AVX2 fma, you would need the no-fma option for MIC to replicate your host results. In principle, fma is more accurate, but not necessarily so in certain combinations where it can't be used symmetrically.

There is an arch-consistency option for host to make results reproducible regardless of CPU architecture. I don't think it's available for MIC; otherwise, using that option on both might be a step toward resolving your complaint, at the expense of performance. I suppose it has the effects of -prec-div -prec-sqrt -no-fma, among others.