Help/suggestions debugging memory corruption problem

j0e · ‎08-06-2011

Hi All,

I have an all fortran coded program that solves an optimization problem where the objective function is defined by a set of ODEs. The problem can take hours to run, and most of the time it runs fine. However, on occasion the program effectively locks up (no violations, just stuck in some loop or something like it, as the process continues to run, but doesn't do anything).

If I attach to the running process and break it (release code compiled and linked with debugging info), the processes is stuck at one statement, off executing some presumably infinite loop of assembly code. I assume some memory has gotten corrupted by writing to an array beyond its bounds, or something similar. Here are some of the problems I have in trying to debug:

If I run the propram in full debug mode, it simply takes too long to get to the problem. Furthermore, there is no garrentee that the problem will arise in debug mode since slightly different rounding errors can lead the solution to a different local optimum, where the lock-up does not occur.
When I attach to the running process, many of the variables cannot be examined because the debugger says "cannot view register variable". So I have not been able to get much info from the debugger as yet, other than the statement location of where the problem arises on.
I have run the release code with "check array bounds" turned on. Interestingly, this causes the code to execute differently (sensitivity problem mentioned above), but the code still locks up w/o generating any errors.

Any general help/suggestions on compiler options or features of the VS debugger (I'm sure there are many I' unaware of) that may help ID the problem would be greatly appreciated.

Thanks for you help!
-joe

TimP · ‎08-06-2011

Only full debug mode could avoid that "register variable" visibility problem. If it doesn't hang at -O1 -fp:source, get a full set of subroutines compiled individually with those options and with your normal optimization, and do a binary search to find out which subroutine is hanging.

Stating your compiler version and compile options would help narrow down our responses.

mecej4 · ‎08-07-2011

Is it feasible for you to provide the source code, at least the part of it where execution is getting stuck? Even better would be a stripped down version of your code that can be compiled and run.

j0e · ‎08-07-2011

@TimP
I like this idea, but I'm not sure if it will work, because changing conditions even a little can cause the problem not to occur. For instance, if I compile with /O2 optimization, code runs OK, at least with the run parameters at the values that cause a hang with the O3 optimization. Code also runs ok with O3 optimization if I set compiler option /check:all (i.e., check for all run-time errors). Similar small changes (?) to compiler options can also allow the code to run OK. That doesn't mean the problem is not there, it just doesn't occur with the particular model parameters that cause the problem before. To restate, sometimes the model runs fine, even with 1,000-10,000 of object evaluations. In fact, I have another version of the code that uses a different optimization routine that takes advantage of MPI. I have that code running on an 8 core workstation for sever days at a time. That code has never hung even thought the objective function has been called 10^7 times or more. The code for the objective function is almost the same, except for a few MPI statements.

I do know what statement line the code hangs on, when it does occur, because 1) when I attach to the process, it always points to the same statement, and 2) If I put print statements around the subroutine, such as "Entering" and "Exiting", I see the "Entering" but no exit.

Compiler info: OS, Windows 7 Pro x64; VS2008; Initially, I was using Parallel Studio XE for windows, Version 2011, Update 2. Last night I upgraded to Update 5 on the C and Fortran composers. Both versions had the program hang at the same statement line; however, each produced different results and the hang occurred at different times in the iteration.

All code is compilied for x64. All source code is compiled together, no other libraries than the default ones are used, and I have cleaned and rebuilt the code numerous times.

Sorry to be long winded here, and I realize most of the above info is of little to no help. I will try compiling subroutines with differing optimization levels and sub them in and out and see if I can spot the problem that way.

j0e · ‎08-07-2011

@mecej4
Yes, this would normally be the way I would proceed as well and it is a good suggestion; however, I have only been able to get the problem to occur with the full version of the code. I've tried pulling pieces out and conduct Monte Carlo runs on them, but they don't hang...

Where the code hangs in is in the loop:

[fxfortran]        DO 2 K = 1,5
          BND = B(K,I)
          IF (K .LT. 5  .AND.  ABS(BND) .GE. BMX) GO TO 1
          IF (K .LE. 2) THEN
            IFL = 3 - 2*K
            S = SIG0 (X(I),X(I+1),Y(I),Y(I+1),YP(I),YP(I+1),
     .                IFL,BND,TOL, IERR)
          ELSEIF (K .LE. 4) THEN
            IFL = 7 - 2*K
            S = SIG1 (X(I),X(I+1),Y(I),Y(I+1),YP(I),YP(I+1),
     .                IFL,BND,TOL, IERR)
          ELSE
            IF (BND .EQ. 0.D0) GO TO 1
            IFL = -1
            IF (BND .GT. 0.D0) IFL = 1
            S = SIG2 (X(I),X(I+1),Y(I),Y(I+1),YP(I),YP(I+1),
     .                IFL,TOL, IERR)
          ENDIF
          IF (IERR .EQ. -2) THEN
C
C   An invalid constraint was encountered.  Increment
C     ICFLG(I).
C
            ICFLG(I) = ICFLG(I) + ICFK
          ELSE
C
C   Update SIG.
C
            SIG = MAX(SIG,S)
          ENDIF
C
C   Bottom of loop on constraints K:  update ICFK.
C
    1     ICFK = 2*ICFK
    2     CONTINUE[/fxfortran]

The code appears to hang on line 6, where it makes the call to the function SIG0. This code is from the TSPACK published in TOMS, and I do not think it is the cause of the problem, but rather where something is getting corrupted from my own code.

TimP · ‎08-07-2011

Apparently, you've identified a subroutine which gives you trouble at -O3 but not -O2, which takes care of the subroutine level search question. /check could easily cancel any difference between O2 and O3.
A number of issues of superficially similar nature have come up lately; I doubt we can give more advice without at least seeing some of the source code there. If necessary, you could submit a problem report on premier.intel.com

j0e · ‎08-07-2011

OK, thanks. Because of the erratic nature of the problem, I didn't expect any problem specific help. I'm just looking to see if anyone has some general debugging ideas that I might be able to use to locate the problem.

jimdempseyatthecove · ‎08-07-2011

I am assuming you have done these preliminary tests, I will ask anyway

What is I at crash?
Is I and I+1 within the index ranges of B,X,Y,YP?
What is TOL?
Is SIG0 a convergence routine to within tolerance of TOL?
if so is TOL too small for convergence?
(e.g. used to work with 80-bit FPU (68-bit mantissa), but will not work with 64-bit SSE (52-bit mantissa))

Can you capture the arguments passed to SIG0 (and produce the hang)?
(e.g. make an enveloping subroutine that stores the args into a findable structure)
If so then does a direct call to SIG0 with those args immediately hang?
if so expect bitness sensitivity to too small of TOL
else
if not suspect SIG0 (or its called code) corrupting stack copy of loop control variable(s) of your DO 2 K loop

Jim Dempsey

j0e · ‎08-07-2011

I've been working on this for several hours now, and this is not so easy to assess. I have yet to get code compiled for the debugger that reproduces the problem. It would take days to get to the problem with debugger code, assuming the problem would even occur. Consequently, I must run with release code.

However, I did start inserting more hard code for debugging. I have found the following so far:

The variable values (and array bounds) passed to SIG0 are fine. No problems with I+1, and TOL can be exactly zero (and it is passed as such). There appears to be no problems with the arguments passed, and those arguments print as expected from within SIG0
An infinate loop DOES occur within SIG0, and explains the hang up.
However, if I pull the code for SIG0 out (and two dependent routines it calls), compile it and pass to it the exact same arguments that cause the hang up in the full code, the hand up never occurs. In fact, the code exits before it ever gets to the point were the infinite loop occurs. The test that determines whether the routine should exit appears to not work correctly when the whole code is compiled. I'm still working on why this happens, but it is a rather slow processes with out the use of the debugger.

An interesting side note: it's almost like i'm working on a quantum system with entanglement. On some occasions when I have entered debugging statements, such as printing out variable values, but nothing else, the iteration count when the hang up occurs changes. It seems sampling the system causes it to behave differently, like collapsing the wave function, lol. My current suspicions are that the O3 optimization is changing execution order that causes the problem, but I can't say that yet.

mecej4 · ‎08-07-2011

Is the source code for function SIG0 short enough to post?

I second Jim's suspicions regarding the possibility that a tolerance value has been specified that too small to satisfy with the algorithm used. Seeing the source for SIG0 may help address that doubt as well.

j0e · ‎08-07-2011

OK, I think I've found the problem. Code in TSPACK (which was published as TOMS Algorithm 716 in 1993 for producing constrained splines) has the following test:

IF (A0 .EQ. 0.D0) THEN

Where A0 = 3.D0*(S-T0) and S and T0 are derived from arguments passed to SIG0. The values I get for S, T0 and A0 where the problem occurs is:

S = -0.252797068929040
T0 = -0.252797068929040
A0 = 1.665334536937735E-016

Even though A0 should really be treated as 0.0D0 here, it is not. This ultimately, causes the infinite loop in the code. Interestingly, when I try to duplicate this with extracted code, I get

S = -0.252797068929041
T0 = -0.252797068929043
A0 = 6.661338147750939E-015

But the value of A0 in this case does not cause the infinite loop.

This explains all the subtle rouding and compiling problems....

Thanks for everyones help, your suggestions did allow me to isolate the problem!
cheers,
-joe

DavidWhite · ‎08-07-2011

Joe,

Comparisons involving reals against zero are almost never going to work. If you can possibly determine an appropriate scale and tolerance, you should make the comparison like
IF (ABS(A0).LT.TOL) THEN ...

Given that this is old code, I suggest you search carefully for any more traps like this one. The code could work perfectly for years, but is inherently flawed and will surprise you by failing at some time in the future.

Regards,

David

jimdempseyatthecove · ‎08-08-2011

Joe (and anyone else reading this post)...

This is an example why a convergence routine to loop until convergence .EQ. 0.0D (or .EQ. x). These routines should be written to converge when the intermediary result is within proximity of the desired convergence .AND. where variance in the proximity (from iteration to iteration) is calculable. See EPSILON in IVF documentation. In this case consider using:

IF(ABS(A0) .LT. MAX(ABS(S),ABS(T0))*EPSILON(A0)) RETURN

Note, A0, S, and T0 are assumed to be same type and you might want to consider convergence at some multiplier above or below 1.0DO * MAX(ABS(S),ABS(T0))*EPSILON(A0)
and/or conditioned on the sign of the convergence variable (A0)

These routines should also be tested to assure convergence regardless of input data. And optionally handle NaN (at least in diagnostic build).

Jim Dempsey

j0e · ‎08-08-2011

Yes, I completely agree with the previous posts. The code that caused the problem was not mine, but rather one published by TOMS (see http://portal.acm.org/citation.cfm?id=151277). Testing any real number against any specific value is a bad idea.

I have modified the test to

IF (abs(A0) .le. epsilon(a0)) THEN ...

But this may not be a sufficient fix for all cases. I have decided just to drop TSPACK altogether and use my own linear interpolation routines.

Thanks for your help!
cheers
-joe