simple program uses all memory when stacksize set unlimited ...

Ernest_Bertrand · ‎03-27-2009

One of the users I support introduced an error into his code which, when run, used all of the memory on the computer and didn't return forcing us to cycle the power. This was unfortunate as it is a multi-user environment. I've boiled down his code to a simple example that, most of the time, causes the same problem. Strangely we sometimes have to run the executable twice in succession to see the massive memory leak. In any case I wonder if anyone else can reproduce the problem (I will copy the program below). I recommend that you do NOT set stacksize to unlimited as many people recommend in these forums. Rather set the stacksize to some value below your computer's memory limit but still large enough to see what happens (have top running in another terminal perhaps). I am using a server with 8 GB of RAM and set the stacksize to 6 GB. Actually I am happy if you set it to unlimited and see what happens. We are forced to cycle the power when we do that.

We are running a current (supported) version of Ubuntu server 64bit on new Dell 1950 servers.

ifort -V reports "Build 20081105 Package ID: l_cprof_p_11.0.074" so that's pretty current as well.

Here's the code. Yes I know it has a bug in it. Nevertheless it compiles (I tried -warn and -check) and runs. I will be happy to hear that it causes problems elsewhere and not just on my collection of equipment and software. Perhaps it can be made even simpler.

program Fail

c ... use ifort -r8 -g fail.f
c ulimit -s unlimited
c ./a.out
c Uses all memory on machine and never returns

implicit none
integer im
real x

im = 100
call fail_sub (x,im)

end

subroutine fail_sub (dout, len)
implicit none

integer, intent(in) :: len
real, intent(out) :: dout(len)
real(kind=8) din(len)

dout(1:len) = din(1:len)

return
end

Ron_Green · ‎03-27-2009

I feel your pain - before joining Intel I was providing user support at a large national lab. I've also spent hours recovering my clusters when users do really stupid things like this.

This example is a good one since we can point out several tips tricks and techniques.

First, the user could have used -gen-interfaces -warn interfaces to catch this bug at compile time. A handy option to keep in your bag of debug tricks.

Second, unlimiting stack in general is a good idea. I always used to set up my clusters with the default stacksize to unlimited. I just got tired of having to tell user after user to do this.

Code wise: INTERFACE use: If the user had used -gen-interfaces -warn interfaces it would have caught the bug. But the user could have used INTERFACE to define the subroutine within the code as well and/or put the subroutine in a MODULE and had USE statement(s) for it. There is a side benefit here for the user: by coding INTERFACE blocks or using modules the compiler has more information about the arguments being passed (so it can flag errors at compile time) but in addition it will often AVOID array temporary creation (on stack by default). This reduces memory requirements AND gives the user a performance boost since unnecessary copies are avoided.

Next, the way this code is written array temps are created on stack. However, -heap-arrays would override this and put the data in heap. I did this for laughs and grins with your code - it seg faults right away as the bounds is caught almost immediately. Heap by it's nature is a collection of disjoint memory segments, and thus segmentation violations here GENERALLY get caught faster than those originating on stack (which is a nice linear collection of memory-resident pages perfect for overrunning and corrupting). Also, when you corrupt the stack very strange, unpredictable, and generally nasty things happen when RETURN is hit. Which brings up this this point:

-check bounds

would IDEALLY catch the array bounds error in this code. However, it does not in this case. If you read the docs on this option it will tell you that you there are limits to what we check - and this argument is one of those cases where we do not check. We've had requests to add bounds checking for arguments and automatic arrays (and any temporaries) OR to allow the user to seed dynamically allocated data with certain values such as NaN to help catch these cases. This feature request is near the top of our list for a future version of the product (no guarantees or timelines here).

And finally, Linux itself is not the best behaved OS when users run amok. Even good code can bring a cluster down IF the user accidently sizes his/her simulation such that it over consumes the memory on the nodes. This is why I'd run GANGLIA to monitor the cluster if you aren't already. This won't prevent the user from melting down the cluster BUT after the fact you have a record of their memory usage and you can go to them and say "See, you ding dong, you paged my cluster to meltdown. Resize your job and try again AND you owe me a beer."

So far we have yet to produce a user-proof compiler OR cluster.

interesting issue, thanks for sending this one in. I'm sure others will have more comments to add. I think I've just scratched the surface on this one.

ron

jimdempseyatthecove · ‎03-28-2009

Ernest,

Take Ron's advice well - use interfaces.

The bad code sample you submitted is particular nasty (assid from the fact it brought down your server).

In the caller

the scalar variables x and im are instantiated on the stack
references to these variables are pushed onto the stack
(or inserted into registers if on x64, but room for stack references are reserved),
the call to fail_sub is made pushing the return address onto the stack

In the callee:

the store into dout overwrites the return address on the stack
the return from subroutine goes to la la land.

The particularly nasty thing about this is the "la la land" often is somewhere else in the application, and that runs for some period of time before crashing. The crash traceback is particularly hard to follow due to the numerous false positives for errors.

It sounds like in your case the "la la land" was making system calls that chewed through your page file and worse yet appeared to cause a system deadlock. The system deadlock should not have been caused (unless your app is running with administrative privledges).

Jim Dempsey

Ernest_Bertrand · ‎03-30-2009

Thanks for the quick solutions. We weren't aware that -warn required -gen-interfaces (or I assume properly defined interfaces) to work. Since -check and -warn didn't report any errors or warnings we assumed all was well as far as the compiler was concerned (I guess it was, really).

I will be adding appropriate compiler options to all of the processes that I can and educating the users in other cases.

The process was not running with any special privileges. It just seems to continue to attempt to allocate more and more memory. Eventually the system starts killing other processes to compensate but can't keep up. With the paging to and from disk the system becomes completely unresponsive. I imagine if we waited long enough it might answer at the prompt.

Thanks again.

Steven_L_Intel1 · ‎03-30-2009

In a future release, -gen-interface will not need to be separately specified.

nooj · ‎04-01-2009

> In a future release, -gen-interface will not need to be separately specified.

I'm glad to hear that. I just found out that I had a bad argument list bug that -check and -warn did not catch. I looked hard many times for an option or technique to check argument lists, but didn't see one, and didn't put together "-warn interfaces" with -gen-interfaces.

Also, I once caused exactly the symptoms described in the original post--I use an unlimited stack size--and never found the error. It was probably something like this. I only saved the machine from reboot by realizing I could kill the process locally by terminating the ssh connection. (I run screen locally, not remotely.)

Thanks Ernest, for an extremely useful post!

- Nooj