I am experiencing an issue that is hard to debug, and I wanted to ask for help. Unfortunately, the code base I am talking about is large, so I cannot really share much of the code. I am using ifort20 on Windows, but I can observe a similar issue on Mac.
What is happening? I get an access violation (heap curruption).
What have I tried? All of the following attempts to pinpoint the issue have caused it to disappear:
- compile in debug mode
- compile in release mode, but enable runtime checks
I have never had an issue that disappears when I enable the runtime checks (the later have been often been very helpful). I have tried ifort19, but it shows the same behavior.
Any help would be appreciated!
This is sometimes known as a Heisenbug: it is there but observing it with debugging .means causes it to change behaviour.
- Use a different compiler - this may cause slightly different behaviour in the program and with any (or a lof of) luck you get a more useful indication of where things go awry.
- Compile on Linux and run the program via valgrind. That program is very good at detecting memory bugs
Have you any clues as to where in the program it occurs (I do not mean where it is caused) or not even that? Have you tried write-statements to pinpoint the location? Write-statements have a tendency to change the behaviour as well, especially with certain memory problems, but it may give an idea where it goes wrong and due to the mentioned change perhaps even where it is caused.
Unfortunately there is no silver bullet or panacea for this sort of problems.
I would recommend enabling all compile time checks particularly interface checking that will identify mismatches between calls and routines. With an old code this may throw up quite a few errors some which will probably not cause a problem but maybe some that will. When a mismatched call clobbers some memory the effects are unpredictable because it depends what gets clobbered. Small changed to the code or build options can change what memory gets effected so can cause the problem "to go away" but it is always there and can coming back with other changes in the future.
Optimizer bugs and bugs related to uninitialized variables, mismatched interfaces, etc., can be hard to reproduce and fix. Here are a couple of suggestions.
If you are lucky enough to have the access violation always occurs in the same place, see if the subprogram where the violation occurs has a compact state. That is, see if it has a modest number of dummy arguments and few variables in COMMON or in modules.
Capture the values of all these variables at the beginning of the subprogram in the instance where you expect the access violation, and dump those values into a file. Write a small driver program to read that file and call the subprogram where the fault occurs. If the access violation is preserved, you would then have a reproducer that you can submit here.
>>All of the following attempts to pinpoint the issue have caused it to disappear:- compile in debug mode
You should be aware that you can compile some source files in Debug mode (no optimizations) and other source files in Release mode (full optimizations).
This is relatively easy to do in MS VS by right-clicking on the source file in the Solution Explorer and then selecting the desired optimization level. You should (may) see a colored (red?) tick mark on the files that do not have the default options for the selected build.
Using this process you can isolate the problem to one or a few source files.
1) Using your release build, select the computational heavy and/or complex code source files, and for each of those do the right-click and property them to optimizations disabled.
2) Run a test that is known to cause problems.
3) Should the problem appear, then the issue resides in the code still running as Release build. In this case, make a checklist of the collection of sources that are optimized, then arbitrary select half to be propertied with no optimizations. Go back to step 2
4) Should the problem go away, then the issue resides in the propertied no optimizations code. In this case, make a checklist of the collection of sources that are not optimized, then arbitrary select half to be propertied with optimizations. Go back to step 2
Essentially you are performing a binary search. On a large application, say with 1000 source files, you have a good chance in isolating the a problem source in 10 iterations.
In Linux, you could manipulate the make file such that instead of having a single object rule you have two (one optimized and the other not), then the link draws from both. You simply move files between the two rules.
With luck, the problem can be identified. iow the bug is code generation and not source code bug. With code generation issues, the problem likely will not be sensitive to placement (though the symptiom may occur at a different point in the program). With source code issue, the symptom (crash) may go away *** but the problem still remains (and may show up again).
When the error can be confirmed to be a code generation issue, you can experiment to see if changing from /Od to /O1 generates working code, if so go with that, else go with this one (or few) file unoptimized. Then document this as having issues with full optimizations and that a re-test should be made when you get an update for the compiler.
Thank you all for your suggestions!
I think I have confirmed, at least experimentally, that it is a code-generation issue. The following should give you an idea of the structure of the code that is related to this issue. I did not check whether I am actually able to reproduce the runtime error with this example:
module example ... contains ... #define unscaleDp(arg, inputDp, inputInt) (merge(inputDp, (inputDp) / arg%scalars(inputInt), dabs(inputDp) .ge. arg%infinity)) ... subroutine myMainSub (myDerivedData) type (someDerivedDataType) :: myDerivedData ... contains subroutine driverSub (inputInteger) use scaling integer, intent(in) :: inputInteger ... integer :: myint double precision :: dp, unscaledDp ... call getData (inputInteger, myint, dp) ... ! Using the function from the 'scaling' module always works. unscaledOutputDp = unscaleDp (myDerivedData, dp, myint) ! Using a preprocessor macro instead of a function call in debug mode ! creates a run-time access violation. end subroutine driverSub subroutine getData (inputInteger, outputInteger, outputDp) integer, intent(in) :: inputInteger integer, intent(out) :: outputInteger double precision, intent(out) :: outputDp ! compute the output values ... end subroutine getData end subroutine myMainSub end module example module scaling double precision function unscaleDp (myDerivedData, inputDp, inputInteger) type (someDerivedDataType), intent(in) :: myDerivedData double precision, intent(in) :: inputDp integer, intent(in) :: inputInteger ! unscale unscaleDp = merge(inputDp, inputDp / myDerivedData%scalars(inputInteger),& dabs(inputDp) .ge. myDerivedData%infinity) end function unscaleDp ... end module scaling
I get a runtime access violation when I use the preprocessor macro instead of the function call. Now, I double-checked that all outputs from function "getData" are correct. However, adding an unnecessary initialization at the beginning of this subroutine fixes the runtime error.
subroutine getData (inputInteger, outputInteger, outputDp) integer, intent(in) :: inputInteger integer, intent(out) :: outputInteger double precision, intent(out) :: outputDp ! unnecessary initialization outputInteger = huge (outputInteger) outputDp = huge (outputDp) ! compute the actual output values ... end subroutine getData
I still do not understand what exactly causes the issue.
>>I get a runtime access violation when I use the preprocessor macro instead of the function call
It may help if you show the preprocessor macro.
You should be aware then
end subroutine foo
*** macros continue to expand in following code ***