Stack overflow in a recursive subroutine

krishnabc · ‎08-09-2011

Hello,

I encountered a stack overflow in a recursive sorting subroutine for three 1-D arrays; two interger(4) and one real(8) arrays. The size of the arrays was 122,230,324. I expect even larger arrays for some other problems. I had used /heap-arrays:0.

Any suggestions would be greatly appreciated.

Thank you.
Krishna

mecej4 · ‎08-10-2011

Sorting algorithms that use comparisons have a best case recursion depth of order lg N when sorting N items. Algorithms without safeguards can degenerate to recursion depth of order N. Bugs in coding the sort algorithm can also lead to similar pathological behavior.

Even when you use the /heap-arrays option, recursion involves consumption of stack space. It would help if you give a short example with source code where the problem occurs.

Please state the compiler version and OS version.

jimdempseyatthecove · ‎08-10-2011

Some recursive routines are short and consise to state but are resource hogs.
Most of these recursive routines can be replace with a loopiterative process that may have small demands on resources (memory - either allocated or stack).

You may be too young to know what a card sorter is or how it operates. A card sorter is a relatively simple machine. It uses an iterative process whereby it looks at one column of the card as it passes through a reader then picks one of n-bins to drop the card into. After a pass (or during the pass if you are deft enough with your hands) the bins are collected in bin order for use in the next pass selecting on the next column of the card. (In the case of alphanumeric multiple passes on portions of the column may be necessary).

Many good large data set sorting algorithms are based on the card sorter (distribute sub-set to n buckets, then consolidate buckets, move onto next sub-set). Fortext keys or integer keys a radix sort bin distribution can be used. For REAL data you may need to know the dynamic range and or distribution of the data for bin selection chriteria. While this approach may require multiple passes on the data, the passes are sequential memory accesses and are friendly to cach access and hardware prefetching.

A fast sort that stack overflows is useless.

Jim Dempsey

krishnabc · ‎08-14-2011

Thank you mecej4 and Jim Dempsey for your insights.

Attached are the codes requested by mecej4. Basically the code is modified from ORDERPACK (authored by Michel Olagnon). I isolated the code from the bigger section and tried to reproduce the error in another computer with Windows 7 32-bit OS, 4 GB RAM, but couldn't succeed. It may be because when I was solving a big problem, some of the stack space might have been used by other parts of the code or common blocks. [I cannot use the same computer (Windows 7 64-bit OS, 16 GB RAM) to test at this moment where the error was occured last timebecause that computer is being used for an another big problem analysis and I have to wait fora week or so.]

If the error is OS dependent, I may have to consider using a loop iterative process suggested by Jim Dempsey. I am using the latest IVF 12 update 3 compiler.

To Intel people,

I did not get any runtime error for this problem in release mode.Only upon debugging (after an unusual result), the output window showsa stack overflow message and the attachedunhandled exception popped up at the recursive routine.

Thank you.
Krishna

jimdempseyatthecove · ‎08-15-2011

Your array size ~123 million would recurse 26 times if split in half but you terminate splits at 16 (2^4) so your call stack would require ~22 levels. From what I see in the calling parameters (references) and stack locals (all scalars)this should not have caused a stack overflow. This leaves subscript out of bounds. I suggest you insert an array index test into your code to assert your various indecies are what you expect (1:Nzua). What you are likely seeing is one of your index arrays (Irr, Jrr) is not having an element filled.IOW odd record count results inan odd number of result indexes and your mege code assuming match pair of results.This in turn results in your code in using an Irr(x)/Jrr(x) where the x'th was not written with a valid index. This not only will present you with wrong results but may intermittantly result in Segment failure. Asymptom of accessing outside of your process Virtual Memory valid address.

Jim Dempsey

krishnabc · ‎08-15-2011

Thank you very much Jim Dempsey. I'll check the array indices. However, I believe that you are touching the problem exactly. Does it mean that there is a problem in the subroutine where I generate Irr, Jrr? Or, is it related to split termination at 16 (which I do not understand well)? It only happens when the array size is very big. Small or medium sized arrays are fine. Hence, I believe that my array generationroutine is ok. Or, will an explicit interface help? Appreciate your suggestion onpossible resolution?

mecej4 · ‎08-16-2011

It is a characteristic of many sorting algorithms (especially those that use a switchover from one algorithm to another, as in your program) that, when there is a bug in their implementation, still sort the input correctly but at the expense of drastically increased computing time and/or stack consumption.

Knuth comments about this characteristic in Volume 3 of his book The Art of Computer Programming.

As Jim has already written, quicksort involves O ( lg N) recursion levels and O(N lg N) comparisons on average. Without safeguards, a naive implementation can see a degradation of performance to O(N) recursion levels and O(N²) comparisons. You may consider whether something is causing such behavior in your implementation. In particular, is would be easy to test if the recursion level is O( lg N ) or not, using values of N that do not cause the stack overflow that you reported.

jimdempseyatthecove · ‎08-16-2011

The problem (IMHO) is not Stack Overflow. The error message in the .png file stated Access violation writing to location 0x0000000000030ca4. This is a virtual address that is not mapped and does not abutt typical stack addresses, and if it did the address offset (0ca4)is not within the relative size of the program stack consumption (per nest level). This leaves only invalid address generated by the user application

My best guess is one of his index tables is not filled in but is used under the assumption that it was filled in. A guess as to what is going on is the program quits splitting at a threshold of 16 records (nothing wrong with this). However the merge phase (or upper level merge phases) may assume multiples of 16 records... which is not always the case. When it is not the case and the errant code runs there is a possibility where the extracted index represents a valid record number that is not the correct record number and no GP fault occures (bad results occure instead). At other times the incorrect index extracted results in an invalid address and a GP fault occures.

If running in Debug build (_with_ array bounds checking enabled) does not locate the bug then I suggest the user adds

subroutine bugcheck(Nzua, i)
integer:: Nzua, i
if((i .le. 0) .or. (i .gt. Nzua)) then
write(*,*) "Put break point here as there is a bug in your program"
end if
end subroutine bugcheck

Then sprinkle into his code

call bugcheck(Nzua, imil)
call bugcheck(Nzua, ideb)
call bugcheck(Nzua, Irr(imil))
call bugcheck(Nzua, Irr(ideb))
IF (Irr(imil)THEN

xwrk = Irr(ideb)

call bugcheck(Nzua, Jrr(ideb))
ywrk = Jrr(ideb)

call bugcheck(Nzua, Crr(ideb))
zwrk = Crr(ideb)

...

IOW add code to verify the indicies and the contents of the indicies in a manner that yields a debugger break point. In examining the state of the program at the point of the discovery will inevitably yield clarity as to the problem.

Jim Dempsey

jimdempseyatthecove · ‎08-16-2011

I should add you can use the preprocessor and

#define chk(x) call bugcheck(Nzua, x)

then use

chk(imil)
chk(ideb)
chk(Irr(imil))
...

I am not sure about contains subroutine within recursive subroutine. If the contains subroutine is not an issue then the bugcheck subroutine can be a contains subroutine and Nzua need not be passed. Efficiency is not the issue since the diagnostic code will be removed after the bug is fixed.

Jim Dempsey

krishnabc · ‎08-16-2011

Many thanks to both of you for the suggestions. I'll update you the debuggingoutcome in a few days (possibly next week).

Best regards,
Krishna

krishnabc · ‎02-28-2012

Dear Jim Dempsey,

Sorry for not responding for a long time.

My problem is still there. I sprinkled the 'bugcheck' at different places in the code but, it doesn't seem to have an array bound problem. I also checked the arrays before calling the sorting routine. It couldn't find any issue with the bounds. For the problem I am trying to solve, it is trying to sort the arrays of sizes NZUA = 122750968. With bugcheck, it breaks at:

When it breaks, the values of variables are:

imil = 97239581

ideb = 97239548

ifin = 97239615

Irr(imil) = 831990

Irr(ideb) = 831990

Nzua = 122750968

In debug output window, the message was:

First-chance exception at 0x000000013fc69b29 in Program.exe: 0xC00000FD: Stack overflow.

First-chance exception at 0x771c3560 in Program.exe: 0xC0000005: Access violation writing location 0x0000000000180cf4.

Unhandled exception at 0x771c3560 in Program.exe: 0xC0000005: Access violation writing location 0x0000000000180cf4.

I also tried with all the dummy arrays as assumed shape declarations. For this case, the message in the debug output window was:

First-chance exception at 0x000000013f6a8598 in Program.exe: 0xC00000FD: Stack overflow.

First-chance exception at 0x771cfcab in Program.exe: 0xC0000005: Access violation writing location 0x00000000001d0fe8.

Unhandled exception at 0x771cfcab in Program.exe: 0xC0000005: Access violation writing location 0x00000000001d0fe8.

Hope you or someone else has any idea. Many thanks in advance.

Krishna

mecej4 · ‎02-29-2012

> Hope you or someone else has any idea?

Not me. Considering that, as far as a cursory glance shows, this is about sorting/partial-ordering, which is a well-worn problem, I'd guess that you have pushed the size of the data set so far beyond what is reasonable that many limits (register size, address limits, stack limits, OS llimits, compiler limits) may have been exceeded.

If you feel that it is worth your time to pin down the reason for the failure, or course, you may do so. If the failure is caused by limits beyond the user's control, I feel that limiting the data set size is the sensible thing to do.

Considering the 4-1/2 month gap between postings, it seems that you may have reached a similar conclusion.

jimdempseyatthecove · ‎02-29-2012

Your screenshot indicates either

a) Line 161 imil = ... is getting the error writing to memory
IOW the address of imil got bunged up

b) Line 162 the call statement is getting the error writing the return address (args are likely passed in registers).

Use the dissassembly window.

Report back:

stack register (rsp)
frame pointer (rbp)

use goto address, 0x771c3560 and see what is in the instruction

use Memory, examine 0x771c3500 (starting 60 bytes before the problem), set view to bytes
The purpose being to see if the code has been modified.
Note, you can copy a representative window around the address 0x771c3560, say -100:+100 bytes and save to text window (Readme.txt)

Start new program run, set break point at lines 160 and162. Break on first time through that section of code, use memory window, around the address 0x771c3560, say -100:+100 bytes and save to text window (Readme.txt) note as "Before". Remove break at 160, press F5 (continue), see if you run to break again or get error on first pass through THEN section. Press F5 to check for error on second pass. No need to run more than 2-5 passes.

Note, report if error occures on any of the first fewpasses. After say 5 passes, remove the break, press F5 see if crash. If so, see if address the same (0x771c3560) recapture instructions as before (making 3rd report). See if code bunged up.

A few years ago I had a similar problem with a bug in Visual Studio (2005). Where the inspection of the code bytes did not show a problem. However the code bytes were altered during program execution, and restored at break. The problem was tricky to find. I ended up adding code to compare the code bytes during run time. I found out that the debugger was setting a break point (one not shown in the break points window) at an address that was not at the start of an instruction byte stream. This resulted in the instruction being executed with incorrect address information. The fix was relatively simple: Open the break points window and select "Delete all break points" (do not delete one by one, or select all and delete selected).

Jim Dempsey

krishnabc · ‎02-29-2012

I'll try a run by "deleting all break points" (Ctrl + Shift + F9). However, I think I am close to agree with mecej4. Looking at my problem size and expecting even larger problems in the future, I need to look for alternative ways to either avoid sorting or do it otherway.

Many thanks Jim and mecej4.

SergeyKostrov · ‎03-04-2012

Quoting krishnabc

...The size of the arrays was 122,230,324. I expect even larger arrays for some other problems...

Please take a look at how much memorywill be usedat the beginning of processing:

122,230,324 * 2 * 4 = 977,842,592bytes - 2 integer arrays
122,230,324 * 1 * 8 = 977,842,592bytes -1 double-precision array
1,955,685,184bytes - in total for three arrays

2,147,483,648bytes= 2GB-this is a maximum amount of memory on a 32-bit Windows platforman application could allocate / use
-
1,955,685,184bytes
=
191,798,464bytes - this isthe amount ofmemory left without taking into accounta memory for aFortran application anddependent DLLs.

I'm not sure that ~190MB of memorywill be enough to sort all these three arrays using a resursive
algorithm QuickSort that "switches" to InsertSort at some threshold. I'll do a test withone
122,230,234 elementsarray and the QuickSort algorithm in order to see how much memory will be used.

krishnabc · ‎03-04-2012

Sergey Kostrov,

Thanks for memory checks. I am trying to solve this problem on a 64-bit Windows platform.

jimdempseyatthecove · ‎03-05-2012

Krishna,

I downloaded your test program and module files.

Using:

Windows 7 x64 (16GB RAM)
Core i7 2600K
Microsoft Visual Studio 2010 Version 19.0.40219.1 SP1 Rel
Intel Parallel Studio XE 2011
Intel Visual Fortran w_fcompxe_2011.9.300
(above is what VS Help shows for IVF)
Output window lists: Intel Visual Fortran Compiler XE 12.1.3.300 [Intel 64]...

Debug, x64 build

Input 122230324 (size you reported in first post)

The program ran to completion.

(no changes to default options other than to add x64 configuration)

Jim Dempsey

jimdempseyatthecove · ‎03-05-2012

BTW, the SORT portion (sans build of arrays) took about 13 seconds in Release Build x64.

In Debug Build

Run up to, but not including allocation of the arrays of 122,230,324 elements each.
Task Manager showed application footprint 704KB.
After allocate 708KB.
What this means is the allocation acquired address space but had not yet acquired memory/page file (which apparently will be deferred until first touch).
As initialization loop ran, the footprint grew. Final size was 1,914,324KB.
Stepping over the call to SORT only bumped the size up a few KB (recursion level may have been about 27 levels).

Did your system run out of space for its page file?

Jim Dempsey

SergeyKostrov · ‎03-05-2012

Quoting jimdempseyatthecove

...
As initialization loop ran, the footprint grew. Final size was 1,914,324 KB.
...

It matches toestimated value:
...
122,230,324 * 2 * 4 = 977,842,592 bytes - 2 integer arrays
122,230,324 * 1 * 8 = 977,842,592 bytes -1 double-precision array
1,955,685,184bytes - in total for three arrays
~1,909,849 KB

SergeyKostrov · ‎03-05-2012

This is really a task for a 64-bit platform.

Based on results of my tests this is an extreme case for a 32-bit platform and only one 128MB array
could be sorted at a time.

Here is some statistics for a pure QuickSort algorithm:

Array size: 16777216 // 2^24 - 16MB - Sorted
Array size: 33554432 // 2^25 - 32MB - Sorted
Array size: 67108864 // 2^25 - 64MB - Sorted
Array size: 134217728 // 2^27 - 128MB - Sorted
Array size: 268435456 // 2^28 - 256MB - Failed - Not enough memory
Array size: 536870912 // 2^29 - 512MB - Failed - Not enough memory
Array size: 1073741824 // 2^30 - 1GB - Failed - Not enough memory

A "crash point" for a 32-bit platform with 2GB limit is somewhere between 128MB and 256MB.

jimdempseyatthecove · ‎03-06-2012

Krishna,

On the post with the screenshot you list in the text below the screenshot:
>>

First-chance exception at 0x000000013fc69b29 in Program.exe: 0xC00000FD: Stack overflow.

First-chance exception at 0x771c3560 in Program.exe: 0xC0000005: Access violation writing location 0x0000000000180cf4.

Unhandled exception at 0x771c3560 in Program.exe: 0xC0000005: Access violation writing location 0x0000000000180cf4.
<<

Note:

"at 0x000000013fc69b29"
"at 0x771c3560"

Did you omit the leading zeros?
Or were the leading zeros omitted in the actual error message?

If they were omitted in the original error message, then this leads me to guess that you may be linking to a 32-bit DLL that operates in conjunction with a 64-bit application via "thunks". There are limitations on the data that passes between the 64-bit address space and the 32-bit address space.

Does the mocked-up test program you posted fail for you when compiled as 64-bit application?
(IOW you compile the actual code you sent, as opposed to assuming it is the same as compiling equivilent code in your applicaiton). We want you to eliminate unknowns for us.

Jim Dempsey