I'm running a large MPI job of the WRF model application (720 cores), compiled using Intel 2015u1 (15.0.1) and the MVAPICH2 MPI library.
When compiling in debug mode I'm using the following switches :
-g -O0 -fno-inline -no-ip -traceback -fpe0 -check noarg_temp_created,bounds,format,output_conversion,pointers,uninit -ftrapuv -unroll0 -u
I'm running until I have an exception overflow : error (72): floating overflow
However, the traceback of the output file form the specific core is empty with no useful info.
Any ideas on how to proceed in locating the problematic specific line of code ?
Do you see a stack trace in the program output with just the program counter (PC) addresses, or no stack trace at all? The latter would suggest that something had been overwritten. You should at least be seeing something like
Image PC Routine Line Source
wrf.exe 0000000003AB0000 Unknown Unknown Unknown
wrf.exe 0000000003ABCDEF Unknown Unknown Unknown
I would start by not trying so many debug options at once and focus on the floating-point overflow. Some of those options have side effects that might possibly interfere with each other. So try
-O0 -fno-inline -no-ip -traceback -fpe0
-g doesn't hurt, but you don't need it if you just want to get a traceback without doing interactive debugging.
I take it you are not using OpenMP. If you were building at -O2, I might suggest -fno-omit-frame-pointer, but this isn't necessary at -O0. Looking at the output file, how far has WRF progressed? Has it got beyond initialization? Are you able to get to the same point by running with a single MPI rank? (I realize that might be slow unless the error occurs early on). If so, you could probably try interactive debugging. You could also try inserting CALL TRACEBACKQQ('location',-1) at one or two strategic locations in the code to see whether you can get a normal-looking traceback from there. You'll want USE IFCORE to get access to the interface.
-traceback works for compiled Fortran user code. The call stack can't be unwound through C functions unless these have also been compiled with -traceback. This doesn't give line numbers and names for the C functions, but it should allow access to Fortran function names and line numbers further up the stack.
If you want to detect uninitialized variables, I recommend -init snan,arrays instead of -ftrapuv and/or -check uninit. This works for some types of floating-point variables in the 15.0 compiler, but works for a much wider range in 16.0, if you have access to a 16.0 (or 17.0) compiler.
If none of this gets you anywhere, I'd remove the -fpe0 and -ftrapuv options and see whether you learn anything from the -check options, especially the bounds checking.
Thanks for your suggestions (I'm not sure how the question has been duplicated; Maybe the IDZ guys can merge them together).
I'm getting the stack trace in the program output with just the program counter (PC) addresses.
The WRF output is pretty far from initialization (Its deep within the physics part of the program). I know that it aborts following my changes, but my new code is pretty large (~20k line of code) so the stack trace info is necessary. I can try interactive debugging, but for a large job (originally 720 cores) I should choose smaller number of cores (~100 cores) to have enough memory for the domain, but in turn it may be cumbersome to interact with (surely not a single MPI rank).
Its definitely a pure Fortran code. I will try a different compiler with changing the switches.
NOTE: I have succeed with the good-old way of multiple printings in strategic points along my added code to narrow down the problematic block.