I'm running a large MPI job of the WRF model application (720 cores), compiled using Intel 2015u1 (15.0.1) and the MVAPICH2 MPI library.
When compiling in debug mode I'm using the following switches :
-g -O0 -fno-inline -no-ip -traceback -fpe0 -check noarg_temp_created,bounds,format,output_conversion,pointers,uninit -ftrapuv -unroll0 -u
I'm running until I have an exception overflow : error (72): floating overflow
However, the traceback of the output file form the specific core is empty with no useful info.
Any ideas on how to proceed in locating the problematic specific line of code ?
Please present the traceback information that you dismissed as being "empty with no useful info". (If it is really empty, we can't know whether the absent information is useful or not, can we?)
You're right -- here is the traceback which is not useful from my perspective :
forrtl: error (72): floating overflow
Image PC Routine Line Source
wrf_FAST_43bins_V 0000000011072C41 Unknown Unknown Unknown
wrf_FAST_43bins_V 0000000011071397 Unknown Unknown Unknown
libnetcdff.so.6 00002AAAAAB9FD12 Unknown Unknown Unknown
libnetcdff.so.6 00002AAAAAB9FB66 Unknown Unknown Unknown
libnetcdff.so.6 00002AAAAAB865BC Unknown Unknown Unknown
libnetcdff.so.6 00002AAAAAB8AE12 Unknown Unknown Unknown
libpthread.so.0 0000003BB080F7E0 Unknown Unknown Unknown
wrf_FAST_43bins_V 000000000E9BA787 Unknown Unknown Unknown
wrf_FAST_43bins_V 0000000006547964 Unknown Unknown Unknown
wrf_FAST_43bins_V 000000000639E156 Unknown Unknown Unknown
wrf_FAST_43bins_V 00000000043C54DE Unknown Unknown Unknown
wrf_FAST_43bins_V 0000000003A75634 Unknown Unknown Unknown
wrf_FAST_43bins_V 0000000003496833 Unknown Unknown Unknown
wrf_FAST_43bins_V 000000000050C3D5 Unknown Unknown Unknown
wrf_FAST_43bins_V 0000000000408572 Unknown Unknown Unknown
wrf_FAST_43bins_V 0000000000407AC5 Unknown Unknown Unknown
wrf_FAST_43bins_V 0000000000407A7E Unknown Unknown Unknown
libc.so.6 0000003BB001ED5D Unknown Unknown Unknown
wrf_FAST_43bins_V 0000000000407989 Unknown Unknown Unknown
From the PC (program counter) values in the traceback and compiler listings one could find the line number where the fault occurred. That, however, is a bit time-consuming and cumbersome, and is necessary only if the fault occurs in optimized code.
Did you build the object/library containing wrf_FAST_43bins_V yourself? Did you specify -traceback when compiling that, as well as when compiling libnetcdff.so.6? If not, you could recompile at least the source containing wrf_FAST_43bins_V with -traceback (and relink with -traceback) and run again to obtain a traceback with line numbers displayed.
Thanks for your time taken answering my question.
The WRF model *.exe file, linking the *FAST_43bins* module as well as the NetCDF library, is being generated using a special make file. In that file I used the debug mode as mentioned above.
I have attached the make file "configure.wrf_IDZ" that I used for compilation. You will find a variant of the above compiler switches under "FCDEBUG" (I tried some changes since then, no luck).
Do you see any issues that I should consider changing that might limit the ability of -traceback switch to show the maximum details including line number ?
I'm afraid that I cannot help you directly with WRF, since I have never used it or attempted to build it. In fact, as of now the WRF site is not working, and the sign-up page is off-line.
However, you can find out if a particular object file, e.g, wrf_FAST_43bins_V.o, contains line-number information by using the command:
readelf --debug-dump=decodedline wrf_FAST_43bins_V.o
If the file contains line-number debugging information, you should see output resembling:
Decoded dump of debug contents of section .debug_line: CU: wrifes.f: File name Line number Starting address wrifes.f 1 0 wrifes.f 9 0x1c wrifes.f 10 0x26 wrifes.f 11 0x84 wrifes.f 13 0xa0 wrifes.f 14 0xaa wrifes.f 15 0x11b
I used a test file, wrifes.f, just to give you concrete information; this file has nothing to do with WRF. Had I compiled the file without specifying -traceback or -g, the table that I just showed would have been empty.
Note that you now have a table of line-numbers and offsets relative to the routine entry points. You can obtain entry point offsets by running nm -g on the executable that you ran before you obtained the traceback, or you can tell the linker to produce a map when you build your a.out the next time. For each address in the traceback, you can subtract the entry-point address of the routine to obtain the relative address, which you can use in the table produced by readelf to find the corresponding line number in the pertinent source code.