release crashes but debug does not

Brian_Murphy · ‎04-16-2017

I have some IVF fortran code compiled into a DLL that I call from Excel VBA. I've got a case that crashes the DLL in Release build with no indication of where it's crashing, but runs fine in Debug build. I tried a Debug build with optimizations set identically to Release, but that runs fine. I then added write statements to the Release build, and I think I know roughly where the crash is happening, but when I start closing in, the write statements also prevent the code from crashing. I've examined the code, but I don't see any likely culprits for the crash.

Are there any tricks I can try to figure out where the code is crashing?

Steve_Lionel · ‎04-16-2017

Well, first you need to explain what you mean by "crashing". Does it cause the monitor to fall on the floor and crack? What is the exact and complete text of the error message?

The symptoms you describe are common with use of uninitialized variables and/or data corruption. Since you are calling from VBA, you need to make sure that the called routines have the STDCALL calling convention.

Brian_Murphy · ‎04-16-2017

Thanks for the reply, Steve. I'm glad I didn't say it "blows up" :)

When I run the Release code from Excel in the normal way I get Microsoft Excel has stopped working, and the problem details are as follows. HSEALH.DLL is the name of my IVF compiled DLL.

Problem signature:
  Problem Event Name: APPCRASH
  Application Name: EXCEL.EXE
  Application Version: 11.0.8169.0
  Application Timestamp: 465f27bd
  Fault Module Name: HSEALH.DLL
  Fault Module Version: 0.0.0.0
  Fault Module Timestamp: 58f4091a
  Exception Code: c0000005
  Exception Offset: 0004558f
  OS Version: 6.1.7600.2.0.0.256.1
  Locale ID: 1033

Additional information about the problem:
  LCID: 1033
  Brand: Office11Crash
  skulcid: 1033

If I run the Release code from within the Visual Studio debugger, I get this:

Unhandled exception at 0x5db2558f in EXCEL.EXE: 0xC0000005: Access violation reading location 0x00000000.

The DLL is passed two dummy arguments from Excel which aren't used for anything. Data required by the DLL is read from text files on disk, and it writes its calculation results to text files on disk.

I would say the program must be overstepping array bounds, but the case runs fine as a Debug build.

With write statements I've been able to determine what subroutine in the DLL is running when it crashes. But write statements added to that subroutine results in it running without crashing. If I could just get the Debug build to trigger the same fault, I ought to be able to fix it.

mecej4 · ‎04-16-2017

Your code is attempting to dereference a null pointer. The offending code offset is shown. You should be able to locate the line number from the offset and find out what the Fortran code is doing to make that happen.

Brian_Murphy · ‎04-16-2017

How is that done? Is the offset you are referring to 0004558f ? In decimal that equates to 284047. None of the source code files are that big.

mecej4 · ‎04-16-2017

That is the address offset from the base address, not the line number. You can get the compiler to give you an assembly listing annotated with line numbers, or you can set a break point near that address in the debugger.

If, as you say, the code is not "that big", you could make the source code available. If that is not feasible, you can hunt for the offending line by using binary search: add a PAUSE or STOP statement after N/2 lines, where N is the number of source lines, and see if that statement is executed before the access violation occurs. If so, move the STOP to 3N/4; if not, move the STOP to line N/4; and so on.

Brian_Murphy · ‎04-17-2017

Thanks for the reply, mecej4.

You said "That is the address offset from the base address, not the line number. You can get the compiler to give you an assembly listing annotated with line numbers, or you can set a break point near that address in the debugger." Where can I learn how to do this?

The source code is in a collection of 20 fortran files. Some are older .f and some are .f90. I will email them to you if you would like to see them, but I can't post them in this forum. You can contact me by going to www.xlrotor.com and use the email address appearing at the very bottom of home page. I have attached the one source code file where I think the code is crashing, in SUBROUTINE CALCV, somewhere in loop DO J=JVMIN, JVMAX starting at line 883 in this file.

I have done what you suggested with a STOP statement, however I used WRITE statements to unit 6 instead of STOP. The code doesn't crash when I put write statements near the place where it is crashing. You will see some of these statement in the attached file. The code is conducting an iterative calculation to solve a version of the Navier Stokes equations, and a STOP statement would stop the code too soon.

mecej4 · ‎04-17-2017

If one is hunting for a bug that occurs only when optimized (i.e., "release") code is run, adding WRITE, PAUSE and STOP statements may inhibit some optimizations and in turn the bug may stay latent.

If you use the /FAsc compiler option, but otherwise keep the optimization level the same, the compilation will give, for each source code, a corresponding *.COD listing file containing code offsets and line numbers. Here is an excerpt:

;;;         DO J=1, NX
;;;           P(J, NY)=0.0D0               ! Zero differential end pressure at y=L/R
;;;         END DO
;;;       END IF
;;;       END

  0016d 8d 65 f8         lea esp, DWORD PTR [-8+ebp]            ;s:\lang\Calcsoln.f:68.7
  00170 5f               pop edi                                ;s:\lang\Calcsoln.f:68.7
  00171 5e               pop esi                                ;s:\lang\Calcsoln.f:68.7

The first number is the code offset from the base of the subprogram, in hex. The end of the line shows the corresponding line number (68) and column (7). Similarly, using the /map linker option will give you a map file that will give you the crossover list between routine names and RVAs (addresses of entry points in the DLL or EXE). With this information, finding the line number where an access violation occurred is a two step process. First, find from the map listing the largest code symbol address (A_base) that is less than the address at which the crash occurs (A_crash). Note the name of the subprogram name of that symbol. Take the difference between these two addresses, and use that as the offset in the .COD file for that routine. That will give you the line number (may be off by a couple of lines) where the crash occurred.

This is how debugging used to be done in the 1960s and 1970s. Today's symbolic debuggers are of limited help in debugging optimized code or finding compiler optimization bugs.

It will be helpful to have a detailed printout of variable values from an unoptimized run (which you should examine to see if the values are correct or at least believable).

Brian_Murphy · ‎04-17-2017

Here's the scoop. The main thing I'd like to know if I am using the address information correctly.

The offset given in the crash message is 000456c2

The map file contains this
0001:000429b0       _CALCV                     100439b0 f   Calcsoln.obj
0001:00044920       _FLOWS                     10045920 f   Calcsoln.obj
0001:00044be0       _ENDS                      10045be0 f   Calcsoln.obj
0001:00044e20       _EDGEV                     10045e20 f   Calcsoln.obj
0001:00045070       _CALCP                     10046070 f   Calcsoln.obj
0001:00045980       _CALCU                     10046980 f   Calcsoln.obj
0001:00046540       _SETLIM                    10047540 f   Calcsoln.obj

I think the offset is indicating CALCP is where the crash is.
I was expecting it to be in CALCV based on my use of WRITE statements.
But anyhow, subtracting 45070 from 456c2 gives 652
In Calcsoln.cod I see this:

;;;
;;;       Ue=Mspeed*(1.0D0+B*DEXP(-L*Y))      ! exit circ. velocity
;;;       IF (Speed.gt.(0.0D0)) THEN          !...!


  00639 f2 0f 10 3d 18
        00 00 00         movsd xmm7, QWORD PTR [_FACTORS+24]    ;...\HsealH DLL\Calcsoln.f:225.7
  00641 66 0f ef c0      pxor xmm0, xmm0                        ;...\HsealH DLL\Calcsoln.f:225.16
  00645 f2 0f 10 35 10
        00 00 00         movsd xmm6, QWORD PTR [_FACTORS+16]    ;...\HsealH DLL\Calcsoln.f:224.7
  0064d f2 0f 11 94 24
        40 01 00 00      movsd QWORD PTR [320+esp], xmm2        ;...\HsealH DLL\Calcsoln.f:221.15
  00656 f2 0f 11 9c 24
        80 00 00 00      movsd QWORD PTR [128+esp], xmm3        ;...\HsealH DLL\Calcsoln.f:222.7
  0065f f2 0f 11 ac 24
        30 01 00 00      movsd QWORD PTR [304+esp], xmm5        ;...\HsealH DLL\Calcsoln.f:222.7
  00668 f2 0f 11 b4 24
        38 01 00 00      movsd QWORD PTR [312+esp], xmm6        ;...\HsealH DLL\Calcsoln.f:224.7

So I think this tells me the crash is in line 221 near column 15.
Looking in the actual source code file, I see this with their respective line numbers:

220      Kx=FNKX(Uin,Vin,Hi,Speed,Rep,Rho,Emu,Kxrot)
221      L=Kx/Rey/Rho/Vin                    !
222      B=(2.0D0*Alpha-1.0D0)*Inerl         !
223
224      Ue=Mspeed*(1.0D0+B*DEXP(-L*Y))      ! exit circ. velocity
225      IF (Speed.gt.(0.0D0)) THEN          !...!
226          Uee=Ue/Speed                        !
227      ELSE                                    !
228          Uee=0.0D0                           !
229      END IF                              !...!

The function call to FNKX is being passed only scaler variables, and returns a scaler value.

mecej4 · ‎04-17-2017

Communication is error-prone when you and I are looking at different versions of objects, listings and map files. I don't know which compiler version and which libraries you are using, but I think that you should use the "Rva+Base" column of the map file. That means that the break-on-violation address,456C2, is in CALCV, and is at offset 456C2-439B0 = 1D12 from the start of CALCV.

At this point I would need to see the /FAsc listing (*.COD) file, because the instructions and addresses depend a lot on the specific compiler version and compiler options. In fact, when I compiled your source file with /FAsc /c /MD /O2 /QxSSE3, the extent of the addresses in CALCV was only 1AD0, which is not compatible with the offset 1D12 noted earlier. You may have also noted that the crash offsets in #3 and #9 are different, probably because you used slightly different compiler options.

Please attach the map and .COD files, along with a screen capture of the abort message, if you don't mind.

You can also look up the lines of the source code in CALCV that correspond to an offset of 1D12 from the base of _CALCV. Finally, note that the crash address reported is the address of the very first instruction after the one that caused the crash.

Brian_Murphy · ‎04-17-2017

I think the offsets in #3 and #9 are different bc of WRITE statements which I added.

I had used the addresses appearing to the left of the column of routine names in the MAP file. Using instead addresses to the right then points me to CALCV, and looking for 1D12 in the .COD file then leads me to line 421 which is in a routine named SLAND. This is the routine that calls CALCV. But line 421 is a call to routine CALCU which has no arguments (it gets it data from COMMON blocks). Line 421 happens to be between a pair of calls to CALCV (lines 403 and 435).

I have attached the COD and MAP files that go with the source file attached in post #7. I had to add change their extensions to protect the innocent.

Considering what we know up to now, do you think it is reasonable to conclude this is not a math related error like divide by zero or square root of a negative number?

For the Release build, I could try changing the optimization options. I see there are separate options for compiling and linking.

mecej4 · ‎04-17-2017

Based on the listings that you attached, I think that the crash occurred on the line

0638a     movaps xmm0, XMMWORD PTR [144+esp]     ;...\Calcsoln.f:994.11

which corresponds to the source line

          BV(J)=HV(J,K)*DXP(J)/BV(J)              !

Beyond that, I cannot say why there should be a problem. Your code contains clean Fortran 77, with all variables declared explicitly. On the other hand, you use /Qsave and almost all subroutine data exchange is done through COMMON blocks rather than argument lists, which makes debugging quite difficult.

You used the 13.1.3.198 compiler, which is a few years old. I would be curious to know if the problem is still present if a current compiler is used.

Brian_Murphy · ‎04-17-2017

Thank you for looking at the code. 13.1.3.198 is the latest compiler version I have available to me. I did not write any of this code. It was all written at Texas A&M University after I was a student there myself. Some of it is over 25 years old, and is used in many other codes. The /Qsave may have come from importing this from a Compaq fortran project.

I would like to understand what led you to line 994. What offset did you use, and how did you get it?

I put Write statements at two places in CALCV. Near the entry to the routine, and right before the DO loop containing line 994 which is almost at the very end of the routine. The Release build crashes with that, and the last thing written to unit 6 is from the Write statement at the beginning of the routine. So if the Write's are getting flushed correctly by calls to Flush(6), the crash was triggered somewhere in CALCV, but before execution reached line 994. Do you agree?

mecej4 · ‎04-17-2017

Here is the address calculus:

(i) You gave the last five hex digits of the crash location as 456C2 in #9.

(ii) The next lower routine address in the MAP file is that of CALCV, at 439B0.

(iii) The difference between those numbers is 01D12.

(iv) We look for CALCV in the .COD file. It has a base address of 04680 (in the COD/OBJ file, prior to linking)

(v) We add the offset found in (iii) to that base address to obtain the post-crash address as 04680+01D12 = 06392

(vi) Now look up the post-crash address in the COD file. The preceding address is 0638A. That is the address of the instruction that caused the access violation. At the end of that line in the COD file, you see the corresponding Fortran source line/column numbers: 994.11.

Now that all these calculations have been done, you should be able to place a breakpoint at that location and run the debugger. When the breakpoint is hit, note the ESP value and do a single instruction step. The crash should occur and you should be able to see a stack trace.

All that, however, is a narrowly focused answer to the question of how to find the source line where the crash occurred. Now, given the size and code style of your program, I would switch to a different approach, since address chasing is tedious and necessary only when everything else fails. Instead of that, I suggest that you prepare a "reproducer" by removing as much of the code as possible: (a) replace the Excel front end by a Fortran driver that passes the same arguments to the routines in the DLLs that your VBA code calls. (b) When this has been done, you should be able to post the reproducer code here without worrying about disclosing proprietary information. If that is not possible, you can (c) provide the reproducer by private mail to one of the Intel people (which excludes me -- please note that I am a forum user just like you).

Brian_Murphy · ‎04-17-2017

The case that triggers the crash used to run without crashing. I made one change to the code in an include file that contains array sizes (see below). The original code had MAXNXT=491, and I recently increased it to 1001 to enable running larger models. The code has been running all cases just fine until I stumbled onto the one that crashes. If I change MAXNXT back to 491, the case runs fine without crashing. I tried 1000, and that runs fine, too. Strange, but true. So I'm going to leave it at 1000 and continue testing, and be on the lookout for more crashes. With luck, if it happens again maybe a Debug build will snag on it, too. A Release build runs in 2/3 the time of a Debug build.

!C Maximum number of grid points in circumferential direction
      INTEGER MAXNXT
      PARAMETER (MAXNXT=1000)
!C Maximum number of grid points in axial direction
      INTEGER MAXNYI
      PARAMETER (MAXNYI=51)
!c maximum number of interpolation points for clearance evaluation
      INTEGER NSL
      PARAMETER (NSL=21)

      INTEGER MAXNXTP2
      PARAMETER (MAXNXTP2=MAXNXT+2)

mecej4 · ‎04-18-2017

In such a situation I would check that none of the integer parameters defined in the PARAMS.f file are to be found in other PARAMETER statements in your source file. Multiple inconsistent definitions of array sizes and common block variables are almost sure to cause problems

Brian_Murphy · ‎04-18-2017

Thanks for the suggestions, mecej4. I will try these as soon as I can, hopefully later today.

jimdempseyatthecove · ‎04-18-2017

mecej4,

Great analysis work. Here is something else to consider.

Most of the code preceding the fault location is using scalar SSE instructions (sd suffix). The error section is using aligned vector instructions (aps suffix), and further, the comment states code entered infrequently "; Infreq". It may be the case that the issue is an alignment issue as opposed to null pointer reference (though the error display at location 0xC0000005 does not indicate this).

The code that I suspect is causing the problem is in the infrequently called section:

  06392 0f 29 84 f1 60 
        1f 00 00         movaps XMMWORD PTR [_DPUV+8032+ecx+esi*8], xmm0 ;C:\Users\Brian\Documents\Visual Studio 2010\Projects\Xlrotor\HsealH DLL\Calcsoln.f:994.11

Where location of _DPUV and/or the offset 8032 was relocated to an inopportune (not alignable) address as a result of changing MAXNT from 491 to 1000. and thus exposing a compiler bug relating to alignment issues (pleanty of those over the past few years).

Try this:

Change MAXNT (and nothing else from the working set) from 491 to 982 (2x491). IOW increase the array size by a power of 2 (next step would be 4x491).

Jim Dempsey

mecej4 · ‎04-18-2017

Thanks, Jim, you have raised a good question, and I see that there are some long forum threads on misaligned operands and their effects on the running of code on various CPUs, some of them started by you.

I wonder if those problems could be mitigated by doing away with the common blocks and putting the common block data into modules that are used where needed. That would give the compiler more freedom to align individual data items.

jimdempseyatthecove · ‎04-18-2017

Brian is using (must use) an older version of the compiler, presumably with the alignment bug(s). Doing away with COMMON blocks would be difficult to achieve (as the above disclosed code may be a small fraction of the complete application). Brian's only controls at this point are:

a) don't mess up alignment of things that formerly were accidently aligned (and referenced with requirement to be aligned).
b) Use instruction generation options that for that compiler does not generate instructions that require alignment (perhaps to use the FPU as opposed to SSE).

Option a) may be problematic in that this does not remove the code generation problem, it just avoids execution faults for test data actually used (not necessarily those used in the field).

As for options to affect alignment, Brian will have to issue

ifort /?>ifort.txt
notepad ifort.txt

and search for alignment options

/Zp16 /align:commons /align:array16byte

may work for current compiler, these options may or may not be available on this older compiler. If the above options are available, then (Brian) configure for the failing situation (MAXNXT=1000) and add options /Zp16 /align:commons /align:array16byte to test for fix.

Jim Dempsey

Brian_Murphy · ‎04-18-2017

Some of the suggestions are over my head. But I was able to run the crash case with the following compiler options: /Zp16 /align:commons /align:array16byte which I found are present by looking at ifort /?

I ran that with the array size parameter set to 1001, and it still crashes.

I checked all code files for other PARAMETER statements, and found none.

The source code file Calcsoln.f is one of 20 files which make up this program, and it's not the biggest file.

The idea I like the best is to increase 491 by a power of two. So I'll make it 982 and continue with that. The professor that guided this program's original development tells me he guesses that 491 came from the maximum allowable value in some old compiler, probably running on an old PC in the days of DOS (the name Ryan-McFarland rings a distant bell).

The line number 994 executes over 10,000 times in a Debug run of the case that crashes with a Release build.

I might be able to easily make this into an EXE instead of a DLL because no necessary arguments are passed in the CALL statement to the DLL. Is there an easy way to change a Visual Studio project to make an EXE? Or do I have to make a new project?