Difficulty in tracking down the source of a NaN

Powell__David · ‎01-24-2018

I have some code which sometimes produces NaN results for some elements of an array. Interestingly, this happens when I enable -O2, but not when I enable -O3. It doesn't seem to happen with gfortran under linux either. Of course, there may be some underlying issue with my algorithm which makes it susceptible to changes in compiler or optimisation level.

I would like to isolate the piece of code which is producing these NaNs. The most obvious thing to do is enable the '-fpe0' option, so that floating point exceptions immediately, instead of producing NaNs. Combined with the '-traceback' option, this should allow me to immediately see the part of the code causing the problem. However, when I enable either of these options, the NaNs go away.

Are there are other approaches that I could use to try and find the source of the NaNs?

Note that my code is compiled as a python module, so I'm not sure how feasible it would be to run it under the visual studio debugger.

abhimodak · ‎01-25-2018

Hi David

One try is to use /Qinit:snan /Qinit:array option. Other would be to bisect it using IEEE_Exceptions. (We have seen that latter give false positives, however.)

Abhijit

andrew_4619 · ‎01-26-2018

You could insert some diagnostic routines so that before/after a routine call you can use the ISNAN function to test results and log some diagnostics to a file maybe.

jimdempseyatthecove · ‎01-26-2018

>>my code is compiled as a python module

This may add the possibility that an interface issue (misrepresented pointer/reference) between Python and IVF is causing Python to stomp on the array element(s).

If the NaNs appear in the same array cells each run, then you might be able to insert tests in your major loop. Note, if after inserting code to test specific array cells for NaN's cause the NaNs to consistently appear elsewhere, then this is indicative of a misrepresented pointer/reference.

Often NaNs are produced with convergence routines that fail to converge. The compiler options affect which math intrinsic functions are used. You may have a convergence issue when a specific variant of the intrinsic function is used.

Try using /fp:precise

Mix with /Qfast-transcendentals-

Note trailing -

Also, please be aware that having a symptom go away with different optimization options is not an absolute assurance that this fixes a bug. It can also mean the error (misrepresented pointer/reference) moves elsewhere to a benign location.

Jim Dempsey

Powell__David · ‎01-28-2018

Thanks all for the tips, I'll definitely try them all when I get the chance.

@jimdempseyatthecove, I am using f2py tool to wrap my libraries for python. Are you suggesting that I might have got my function signatures wrong, or that there could be some bug in f2py? I've had errors with incorrect function signatures in the past, but they always results in very obvious segmentation faults etc.

I should have noted that when my code isn't producing NaNs, then the results are correct (final results compare well with analytical models, although it's difficult to check intermediate results produced by this function). So I'm suspecting that the error is numerical.

This code is actually a fixed-order, 2D integration of a hard-coded function, so it's not an iterative routine looking for convergence. I guess some intermediate result is overflowing, and performing subsequent operations on an Inf leads to a NaN result. The specific variants of compiler intrinsics, or the non-associativity of floating point operations, are the kind of problems I'm expecting to find.

jimdempseyatthecove · ‎01-29-2018

Function signatures can be problematic. Also be aware of making a call from Python (or other inter-operable language) to obtain a pointer to an array allocated in Fortran, Then making subsequent calls to Fortran that might reallocate the arrays, and without re-obtaining array base in Python. Fortran now has reallocate left hand side in effect. The particular error symptoms you disclosed are not indicative of this potential problem.

You should be able to instrument your code working backwards from the insertion of the NaN into your output arrays. This may take several iterations.

Jim Dempsey