- Intel Community

David2 · ‎12-04-2013

I am trying to debug a long running windows fortran model. The problem seems to present only after the model has been running for over 100 hours. The model is a single threaded fortran application with lots of places to go wrong...

The process stops and a windows error dialog appears stating:

Problem Event Name: APPCRASH

Fault Module Name: KERNELBASE.DLL

Exception Code: c0000005

This is a generic Access Violation code and I need more information. Even though I compiled with traceback on no additional information is available. The process is still running till i close the dialog. I have tried dumping the stack and looking at it with windbg, but it seems only to tell me that I have requested a core dump. See attached file. Is there any way to get more information about where this error is occuring?

I recently found a solution to a similar problem by replacing an inline call with an explicit memory allocation:

Original -

[fortran]

Call write_stuff(group, &

arpart(:,pack(idx_part, mask_part)), &

alpart(:,pack(idx_part, mask_part)) )

[/fortran]

Replacement

[fortran]

allocate(ar(npseudo,records),al(npseudo,records))

ar = arpart(:,pack(idx_part, mask_part))

al = alpart(:,pack(idx_part, mask_part))

Call write_stuff(group, ar, al)

deallocate(ar,al)

[/fortran]

In this case - it seems that the call to write_stuff was not working well with the temporary arrays in a strange way that depended on the size of the array. This kinds of black magic - works sometimes stuff is extremely painstaking to find - especially when the error takes 100 hours to present.

Is there some way to get more information when the dialog appears - a different core dump or some other analysis in windbg?

David

IanH · ‎12-04-2013

This is bit of a wild guess... but I wonder if your program is running out of stack space, which then constrains the traceback error reporting in such a way that it cannot complete its job. In the past I have experienced a variety of atypical methods of crashing that have been associated with stack exhaustion.

Certainly - the explicit memory allocation solution that you show could be part of stack exhaustion pattern. In the original code the compiler would (perhaps "might" - given I'm not 100% sure about this...) create a temporary for the result of the PACK intrinsic references and also for the vector indexed reference to bits of arpart. If that section of code was in a loop in a procedure that hung around for a lengthy period of time then the progressive creation of temporaries might ultimately exhaust your stack - and boom - your program dies. When you move to explicit allocation, you take over management of the lifetime of the temporary arrays (plus the arrays are on the heap), and the lifetime of each temporary is explicitly confined to that stretch of code, not the lifetime of the procedure that contains that stretch of code.

However, the last paragraph might be complete nonsense.

Are you using the /heap-arrays:0 compiler option? If not, give it a go.

Exhaustion of heap space or virtual memory address space can also happen in long running programs, and might cause a bit of chaos in the subsequent error reporting.

You could perhaps set a break-point on the fortran runtime signal handler before you run your program - inspecting its arguments might tell you what it was trying to report.

David2 · ‎12-04-2013

Thanks Ian

I like the idea of setting a break point on the fortran runtime signal handler - but I don't know how do that - or even what to google? I have only ever set a break point in VS by selecting a line number but how do you break on the runtime? Also I am a little reluctant to think of running this large model case in my development box inside VS? Is there a lighter weight debug too I can use to set the break point on the server where I normally run the models?

As for the stack size - I will try that out tomorrow, but it will take me 4 days to find out the answer. Is there no way to tell what happened from a core dump, or more to the point generate a useful core dump from an application in that limbo AppCrashed state?

David

Steven_L_Intel1 · ‎12-05-2013

Yeah, it looks as if there's a problem while the Fortran RTL is trying to issue a traceback. You might want to try setting the environment variable FOR_DIAGNOSTIC_LOG_FILE to point to a writeable file path. Defining TBK_ENABLE_VERBOSE_STACK_TRACE to 1 might give you additional clues. As suggested, running out of stack might be a problem here.

Bernard · ‎12-06-2013

It looks more like hardcoded breakpoint which has been hit.At least .cxr trap frame command reveals the exception code.@David can you upload full process minidump(not triage file).I would like to check the raw stack and search for the exception code.

David2 · ‎12-09-2013

Hi Iliyapolak

Thank you so much for your offer of assistance. Unfortunately I am unable to provide the full process minidump as there is proprietary information included in the state of some of the model variables. I am trying to reproduce the error with values that are not subject to these restrictions. I will let you know what progress I make. In the mean time, can you recommend any additional steps I can take in windbg to get information from the minidump? I am somewhat limited as the exe is 32bit running on a 64bit os.

David

Bernard · ‎12-11-2013

Hi David

I understand your information security related concern.Unfortunately I do not know any more efficient steps beside those which I had recommended you.Bear in mind that power of windbg lies in its intrusive debugging capabilities and usage of heuristics in automated analysis.In windbg there is option which disables extensive debugging,but this is bound to only kernel mode debugging (.secure).Afaik triage dump file hides some of the information like parameters passed to the called functions.

Bernard · ‎12-11-2013

@David

Consider using Application Verifier to perform extensive testing on your application.

David2 · ‎12-12-2013

@Ilyapoak
Can you suggest any tools or methods to extract information about which array has caused heap corruption in a program like this gist? This is obviously a trivial case.

I have posted this code with a separate question about why the behavior is different depending on the index which exceeds the bounds here.

Bernard · ‎12-12-2013

Hi David

The best tool for troubleshooting application run time errors like heap corruptions or access violation is windbg.Heap corruption investigation techniques are heavily automated in windbg .

Debugging APPCRASH windows dialog