Ok, I see that this could be

lacek · ‎12-07-2012

Dear All,

I know that to track NaN during runtime, there exist convinient compiler setting: -check all -traceback -fpe0

my traceback is:

kvec               0000000000899F43 Unknown               Unknown Unknown
kvec               0000000000837DD7 Unknown               Unknown Unknown
kvec               00000000008307B0 Unknown               Unknown Unknown
kvec               00000000007EF235 Unknown               Unknown Unknown
kvec               00000000007E7A11 Unknown               Unknown Unknown
kvec               000000000061635E eigen_mp_ev3_              90 eig.F90
kvec               000000000052EF76 mps_func_mp_mps_r        1813 mp2.F90
kvec               000000000064C5C7 propagate_                893 kvec.F90
kvec               00000000006419E2 MAIN__                    593 kvec.F90
kvec               000000000040B50C Unknown               Unknown Unknown
libc.so.6          00000038B722135D Unknown               Unknown Unknown
kvec               000000000040B409 Unknown               Unknown Unknown

the eig.F90 contains

just call zheevr(jobz, range, uplo, n1, a, lda, vl, vu, il, iu, abstol, m1, DD, U, ldz, isuppz, work, lwork, rwork, lrwork, iwork, liwork, info)

So how it is possible that it catches NaN? If I remove the -fpe0, then the zheevr completes and returns with info=0

the matrix a is just :

(0.499999999999999,0.000000000000000E+000)
(0.707106781186547,0.000000000000000E+000)
(0.707106781186547,0.000000000000000E+000)
(1.00000000000000,0.000000000000000E+000)
(0.499999999999999,0.000000000000000E+000)
(0.707106781186547,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.707106781186547,0.000000000000000E+000)
(1.00000000000000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.499999999999999,0.000000000000000E+000)
(0.707106781186547,0.000000000000000E+000)
(0.707106781186547,0.000000000000000E+000)
(0.999999999999999,0.000000000000000E+000)
(1.00000000000000,0.000000000000000E+000)
(0.888888888888890,0.000000000000000E+000)
(0.314269680527355,0.000000000000000E+000)
(0.314269680527355,0.000000000000000E+000)
(0.444444444444444,0.000000000000000E+000)
(0.666666666666667,0.000000000000000E+000)
(0.471404520791032,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.471404520791032,0.000000000000000E+000)
(1.00000000000000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.111111111111111,0.000000000000000E+000)
(0.157134840263677,0.000000000000000E+000)
(0.157134840263677,0.000000000000000E+000)
(0.722222222222222,0.000000000000000E+000)
(0.166666666666666,0.000000000000000E+000)
(0.499999999999999,0.000000000000000E+000)
(0.500000000000000,0.000000000000000E+000)
(-0.707106781186547,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(-0.707106781186547,0.000000000000000E+000)
(0.999999999999999,0.000000000000000E+000)
(-6.661338147750939E-016,0.000000000000000E+000)
(5.551115123125783E-017,0.000000000000000E+000)
(-6.106226635438361E-016,0.000000000000000E+000)
(0.999999999999999,0.000000000000000E+000)

Anonymous66 · ‎12-07-2012

-fpe0 turns on Floating-point invalid, divide-by-zero, and overflow exceptions. Underflows are flushed to 0. Without that option, the default is to disable exceptions and floating-point underflow is gradual. This is why you are aborting on the NAN with -fpe0 but the program runs to completion with out it.

lacek · ‎12-07-2012

Dear Annalee. I understand what fpe0 does, but I do not understand Why does the zheevr (I am using Intels MKL) trigger a NaN catch? The program is completely deterministic, output does not contain any NaN, stat=0, but with I use fpe0 some NaN are catched and seem to originate from MKL routine which - does not contain NaN on input - does not contain NaN on output - terminates with stat=0

Anonymous66 · ‎12-07-2012

The NAN may occur within the zheevr calculations but not cause the final result to NAN. Alternatively, flush to zero may result in a NAN that does not otherwise occur. If your question is specific to MKL, I would suggest posting on the MKL forum as well. Regards, Annalee

lacek · ‎12-07-2012

Ok, thanks. I will do that.

lacek · ‎12-07-2012

I have recompiled file containing the call to zheevr withouf -fpe0 flag and linked it this way to my program, but this did not really help: It seem that NaN trapping is unable to locate a particular line which throws NaN (in my case call zheevr) , but it is able to locate the envelopping routine containing zheevr. So the trigger is 99.9% stil zheevr but not a real source of a problem. I assume this is because the monitoring for NaN is done by observing some processor flags, which are always triggered. Is it possible to change default behaviour of the NaN catching function - make it print a warning but not stop the program?

Anonymous66 · ‎12-07-2012

There is no way to do that, but you can get more information about where the NAN occurs by compiling with -g as well as -traceback. I would also suggest running it within a debugger.

mecej4 · ‎12-08-2012

Is it possible to change default behaviour of the NaN catching function - make it print a warning but not stop the program Although one may agree with that wish in principle, there are reasons why it is not practical to implement such a change. For example, what if the number of NaNs caught during a single execution runs in the millions? Does the user want the NaN error reports mixed into the program output? What if the standard output has been redirected to a file?

lacek · ‎12-14-2012

Ok, I see that this could be a problem. Thanks for comments. Then I guess the simplest option would be to watch variables in the debugger.

tracking NaN problem