Strategy to find a difficult bug

Francois_F_ · ‎05-29-2019

Hi,

I've been given a program written in a mix of Fortran 77, Fortran 2003, and C++. There are a lot of COMMONS and Fortran includes. It consists mostly of Fortran files and is spread on 150 files. The C++ code is compiled into a DLL and is called from Fortran. I don't know the program, and I've been told that it was giving wrong results with O3.

My findings got worse than that. I've managed to get segfaults with some compiler flags that could dissapear when using some others (-ipo can remove some bug). I've never managed to get a segfault with -O0 but I got some of them at -O2. I cannot trust the debugger and adding a write statement on a variable can remove the segfault ! I think that the stack is corrupted or something like that.

Is there any tool or strategy I could use to find the bug(s)? I have Parallel Studio XE 2019.

Should I look for a bug close to the segfault? Because there is clearly no bug in the code here. It come from before but I have no idea if there is a chance that the bug is close or somewhere else in the code. As adding a write statement "removes" the segfault, I am kind of lost.

Best regards,

François

mecej4 · ‎05-29-2019

I suspect an optimizer bug, from the description.

Build and run with a different compiler (older version of IFort, Gfortran, etc.) and see if the error is not encountered.

Build using IFort with profiling enabled at the subroutine level, and run until the segfault occurs. From the traceback and the profiler output, identify which subprograms were not executed. Comment out the calls to those subprograms, remove the files containing the subprograms, and verify that the pruned down source code still generates the segfault. You can now profile line-by-line, and identify portions of subprograms that were not executed, enabling further reduction in code size. You now have a "bug reproducer". Zip up all the files (sources, include files, data files, project files) needed to build and run the reproducer, and submit the set to Intel, along with instructions to build and run.

FortranFan · ‎05-29-2019

@François,

An option you can consider is to share your code with Intel support who can maintain your required confidentiality while reviewing it and offering you suggestions. This is particularly relevant given the issue with /O3 optimization but not with /O0 or even /O2.

Assuming floating-point calculations are involved, review the material by Dr Fortran, especially toward -fp compiler options: http://sc13.supercomputing.org/sites/default/files/WorkshopsArchive/pdfs/wp129s1.pdf

Separately, can you confirm the Fortran code adopts explicit typing e.g., IMPLICIT NONE in all the relevant scopes?

Also that it makes use of explicit interfaces especially if there is interoperation with C++ (e.g., https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/808879 especially Quote #11)?

An additional aspect which helps is the use of consistent kinds of intrinsic types in Fortran: see this Dr Fortran blog: https://software.intel.com/en-us/blogs/2017/03/27/doctor-fortran-in-it-takes-all-kinds

As you know, the above 3 aspects can help you locate an overwhelming majority of coding issues. However if you think your code is bug-free (!!) these may not be of much help!

By the way, have you tested your code with run-time checks offered by Intel Fortran /check:bounds, etc.?

https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-check

don't forget the compile-time checks which might reveal any issues you may have overlooked:

https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-warn

jimdempseyatthecove · ‎05-29-2019

Backup you development folder first.

In VS, start with the Debug Build, perform Clean, then perform Build, then run your test suite and verify correctness. If, and only if, you have correct results.

a) Make a list (notepad) or spreadsheet of all the source files in your solution.
b) Using the Solution Explorer, highlight (select as you do in Windows Explorer) half of the source files, then right click on the highlight. You should have an Pop-Up option for Properties, then use properties to set optimization level to O2 (the segfault one). *** keep note of which ones are altered. On older versions of VS the source files with altered properties had a red tick mark on the file type icon. You may want to keep track of which ones are O2'd.

Now perform a build and retest.

If no error, then mark half of the remaining files and repeat. (IOW do a binary search)
If error, then unmark half or the marked files and repeat. (IOW do a binary search)
...

I've found this quite useful on a Solution with 13 Projects, and 750 files. (9 or 10 cycles, at worst will find the problematic file).

When you find the problematic file(s), and if it is a compiler error, then you can go to the effort of making a reproducer for submission to Intel.

*** lot of COMMONS and Fortran includes... if it is a compiler error...

If you are in the process of "modernizing" the code and converting functioning COMMONs into MODULEs, then you may have a coding error in your conversions.

Jim Dempsey

David_Billinghurst · ‎05-29-2019

I support Jim Demsey's divide and conquer strategy. It efficiently identifies the problem file. Even easier with a command line build process. You just mix and match the object files you feed the linker.

You can extend it by asking yourself "What is the simplest way to rule in/out half (or a decent chunk) of the code?".

Francois_F_ · ‎06-01-2019

Thanks. I'll try your suggestions on Monday.

A few things :

- I don't suspect a compiler bug. The code has enough of rough edges so that it can mess up the stack by itself. In my initial post I've just said that the code crashes at a point where there is no bug. The bug must be somewhere before.

- There is no unit test. It would be too easy :-)

- The code only compiles with Intel Fortran as it is full of DEC pragmas. It has been compiled with different versions of Intel Fortran and it fails with different compiling options when switching from one version to another.

- I am quite sure that there are parts of the code without "implicit none".

I'll be back to work on Monday. I'll try another full day before I suggest that a major refactoring should be done.

GVautier · ‎06-01-2019

Hi

The description of your problem mostly "adding a write statement "removes" the segfault" suggests a stack corruption that may be caused by array or character overflow.somewhere during the execution.

It's very difficult to detect because the location of the error and even it's occurence depends of the generated code which depends on compilation flags and added write statements.

If there is no interface, first, check all procedures array and character arguments declared with fixed size because if they are called with an argument larger than the declared size, it will cause a stack corruption.