For any chance of correct Nan

piet_de_weer · ‎11-15-2014

I'm having a weird issue. I recently switched from using Intel 10.1 compiler to the latest XE2015. Some users are now reporting crashes, and after adding logging I found that the cause is that in several locations of my program NAN's are sometimes popping up. That makes sense since I don't always check for 0's in things like divisions - it would slow down the program far too much.

But what's weird is that the same code worked perfectly fine without crashing or other weird behavior before I switched to the new compiler. So, my question is: Is there anything different between these two? Ideally, I would have all NAN's automatically replaced by 0's - or any other number that responds properly to min and max calls. I usually do protect the output with min and max, but for NAN's that doesn't help; min(100, NAN) returns NAN and not 100...

(Note: I'm compiling multiple code paths, for SSE2, SSE3, SSE4.1, SSE4.2, AVX and AVX2. Some of the issues seem to happen more on older systems which would be using SSE3. In the old compiler I generated a single code path for SSE2).

jimdempseyatthecove · ‎11-15-2014

Have you enabled all the diagnostics, including runtime checks for uninitialized variables, index out of bounds, and argument checking?

Jim Dempsey

piet_de_weer · ‎11-16-2014

In debug mode, yes. Does not show any errors - not with the 10.1 compiler, nor with the XE2015 compiler, and also not with the GNU compilers on Linux and Mac, in both 32 and 64 bit modes. I recently checked the whole thing with Valgrind under Linux and fixed everything it reported.

The people who are reporting these issues are using the release version, and it's not really an option to let them test the debug version (among others because it often takes multiple days before an error occurs, and in debug mode the run speed is so much lower that it will likely take weeks). When an exception occurs I know the exact address and I can easily look up the code where it happened - but NAN's don't cause an exception.

The problems that I found so far (I added a lot of logging that checks for NAN's and logs something) were all really caused by calculations that were returning NAN values - which was correct looking at the code (most of them were division by 0 errors). In one case I saw I took the log of a negative number (someone had a bug in a configuration file) - so it makes perfect sense that that returns NAN. But, in the version of my software compiled with the 10.1 compiler it somehow didn't.

PS: Sorry about the "then" instead of "than" in the title, I can't edit it :(

TimP · ‎11-16-2014

For any chance of correct Nan handling you need fp-model strict.

jimdempseyatthecove · ‎11-17-2014

In large projects (100's of files) that exhibit this type of error. Debug build works but takes days or weeks to complete, Release build is fast but crashes, the technique to use is to compile some files in Debug mode, and others in Release mode. Actually I create a new build type, derived from Debug build, then project by project manipulate the optimization levels to find the problem project, then do the same for files within the project.

It takes a while, but is well worth it in the end. As this gives you a start point for when you add new feature components. Also, if something breaks at some optimization level, it provides you a way with getting fast production code (only quirky files are compiled with /O0).

Jim Dempsey

NAN handled differently in XE2015 then 10.1 compiler?