according to a little test program (attached) it looks as if the stack memory is not freed after leaving a subroutine. I have had this problem with a much larger code, which after a while died because it was running out of memory. The problem can be solved by adding the compiler flag "-heap-arrays". (This is, however, only a partial solution to the problem because stack memory still seems to get eaten up in all the libraries that I use: MPI, HDF5, FFTW, ScaLAPACK.)
The simple test program calls a subroutine where a temporary array is created and reports the RSS value (function mem_usage) before and after (and also the array size). The output when compiling with "ifort -cpp test.f -o test" is
872 before sub
Array size 31250
32220 after sub
So, it seems that the stack memory is not freed after leaving the subroutine. Compiling with "ifort -cpp -heap-arrays test.f -o test" gives
880 before sub
Array size 31250
988 after sub
This looks better. (With "-no-heap-arrays" we get the first output.) Tested versions are 12.1.3 20120212 and 16.0.2 20160204.
The code is attached. I don't know if this is a bug. I submit it as a question.
Thanks in advance!
I'll think about trying to get around the non-portability of your /proc/ access later on. I was intending to see if the problem is reproduced with a current ifort.
It does look like a bug in automatic arrays. Generally, allocatable is considered better practice (at least when the size is significant), and allows you to check status and see whether explicit deallocation (required before f95) helps. I don't know whether automatic array may be supported primarily for legacy code, on the assumption it would not be used for new code.
In normal usage, if the procedure isn't called recursively, and the automatic memory is re-used, it may not become a problem. If it is called recursively, omitting the recursive declaration is a bad idea.
Thank you for your quick reply! The function mem_usage() is only included for testing, so non-portability is not really an issue.
In fact, if the array is defined as allocatable, it is put on the heap instead of the stack, and then there is no memory problem. The same effect is achieved by using the flag "-heap-arrays", but, as I mentioned, all the libraries that I use still have that problem.
The procedure is not called recursively, but usually the arrays are larger, and, in the case of an MPI code, all processes will have their own instance of the array. The problem is that it seems that the memory is not available anymore afterwards. So, unfreed stack memory keeps accumulating until there is no free memory anymore.
What you think is happening (by reading your code), is not what is actually happening by executing your code. I will try to explain;
/proc/.../status will return the Virtual Memory footprint of the process. This is, with page granularity, all the pages touched (and currently held) by the process within the process Virtual Address range. If the O/S is provisioning for 48-bit physical addressing this represents about 256TB Virtual Memory address space. Your process will only consume those addresses actually touched in page granularity. Page size can vary with process, default may be 4KB, but may be 2MB, or much larger.
If you ALLOCATE a large array, other than for any heap node management headers that are created (and touched), and if those virtual address pages hadn't been touch before, those pages will not be mapped to your process, and thus will not appear in the /proc/.../status report. Your assumption that DEALLOCATE frees virtual memory is false. Deallocate return to the heap, but does not (necessarily) un-map the Virtual Memory addresses (pages), neither in RAM, nor page file.
Your sample code, showing fixed size static array, is NOT placing the array on stack. The default, places this in static memory, same as if the array had the SAVE attribute. As a consequence, the process load, may have wiped (loaded empty block) into the static data area, and thus "first-touched" the memory causing it to appear in the /proc/.../status report.
Thank you for the explanation. I am actually not so worried about the numbers that the test program gives. The test program was just my attempt to reduce the problem to a test case that is easily understood (even if I have misinterpreted the memory issues; I am not a computer scientist). So, let me tell you what the problem is:
- When compiled without "-heap-arrays", my program (which uses a lot of memory and which is parallelized with MPI3) dies because it runs out of memory. (The system kills jobs that exceed 96GBytes per node.)
- When compiled with "-heap-arrays", it does not die (and it should not because my program needs less than the maximum 96GBytes).
- To analyze this problem, I have introduced the function mem_usage(), which tells me the current status of the RSS value while the program is running.
- It is my observation that the RSS value compares quite well with the memory demand that I would naively calculate from all my "allocates" and automatic arrays -- at least when I use "gfortran" or when I use "ifort" with "-heap-arrays". Even in the latter case, however, after a while the RSS value grows more strongly than expected, which I would attribute to the other libraries (MPI, FFTW, HDF5...) that are NOT compiled with "-heap-arrays".
Your situation is strange. I had a similar situation on my KNL system with 96GB. It may or may not be related to your situation. My application was hybrid:
MPI -> C# -> C++ unmanaged .so -> Fortran .so -> OpenMP
If your application is:
MPI -> someAppSpawingThreads -> OpenMP
MPI -> someAppSpawingThreads -> MKL (multi-threaded)
Then you may be seeing the same issue.
MKL (multi-threaded) internally uses OpenMP.
In my app, the C# program spawned-joined, spawned-joined, ... using up different thread handles (though not more than a reasonable total at any one time). If your application is performing spawned-joined, spawned-joined, ... threading, then the memory consumption issue is the underlying OpenMP maintains thread ID'd context for future use. As long as your app reuses spawned thread ID's you will be ok. If your app keeps generating new thread ID's you will be in trouble.
The "-heap-arrays" version might not be dying (yet) because the memory creep is slower.
The above said, you may have an entirely different issue.
I guess it is a different issue because I do not use OpenMP. I use MPI3, which allows the usage of shared memory. On the other hand, the cause might be the same ...
I looked at this case too and can reproduce the behavior. It does appear ifort may not be properly deallocating the automatic array. I submitted this to our Developers for further analysis.
(Internal tracking id: DPD200418254)
In fact, Development analyzed the report and indicated this is not a defect. What they indicated is:
What the test program measures is memory usage as seen by the operating system (OS). But each process additionally does its own memory management which is transparent to the operating system. One important fact to note is that after a process acquires memory from the OS, it is not required to immediately release it back to the OS when it doesn't need it. More often, the process itself keeps track of its free memory and reuses it for subsequent allocations.
This is what can be seen here: the program allocates some memory and no longer needs it. But the memory is not lost, the process would reuse it for its next allocations. You can see that if a single call to "sub" is replaced with for example 5 calls, the memory usage at the end will be identical.
The above is true for both stack and heap allocations. Now let's examine the differences between these two.
In the test program ifort normally allocates the array on stack. This is a much faster operation than a heap allocation. However, stack space is subject to a limit, so that a program can run out of stack space even if there's still memory available. A user may configure stack limit; one example is using the "ulimit -s unlimited" bash builtin. Stack space, once allocated, is normally not returned to the OS - doing so would greatly hurt execution performance. But the process will reuse the stack space.
Heap memory can, but doesn't have to, be returned to the OS when it is freed. It is a decision of the heap memory allocator.
In the report I noted that ifort and gfortran had different default behavior and the developer noted that "ifort -heap-arrays" and "gfortran" do heap allocations for the test program. In other words, gfortran defaults to using the heap.
In short, there is no underlying defect. I hope that explanation helps.
Thank you. This is interesting.
In fact, when compiled with "gfortran -fstack-arrays" (gfortran then always uses the stack), the result is (nearly) identical to what "ifort" produces:
728 before sub
Array size 31250
31996 after sub
You write "Stack space, once allocated, is normally not returned to the OS." Do I understand it correctly that a process will always drag along an ever increasing chunk of stack memory? (The OS would reserve this stack memory for the process, independent of whether the stack is actually used or just reserved for later.) This is, of course, problematic in the following situation:
First, the process uses large stack and small heap. Later, the process needs large heap but no stack at all. But since the stack memory from before is not freed, the total memory demand will consist of the large heap memory (in use) and the large stack memory (reserved for later). If the memory demand exceeds a certain limit, this might lead to an "out-of-memory" program crash.
Indeed, when compiled without "-heap-arrays", my program runs out of memory and dies. When compiled with "-heap-arrays", my program continues. (But even then the stack size grows because of calls to the libraries.) In other words, the program would not crash in the former case if the OS got back the stack memory that is unused anyway.
If this is so, it would be recommendable to put large arrays into the heap instead of the stack, right? In addition, libraries that require large stack memory should be compiled with "-heap-arrays"
Kevin, #13 is an elaboration of what I discussed in #4.
In your test case, the loss isn't of much concern (when the process exits, the O/S reclaims the memory).
Some applications do exhibit memory creep. This is usually dependent on two factors:
a) memory allocation patterns that interrelate with
b) choice of heap (memory) manager
That results in memory fragmentation.
Typically a heap manager that is designed for speed tends to fragment memory. Using, what is commonly called a Low Fragmentation Heap is slower as it has higher overhead in consolidating adjacent returned nodes.
For Virtual Memory systems, process memory isn't physically allocated until the page is touched (addresses are consumed, RAM and/or Page File is not).
You should place large allocations on the heap when you write a multi-threaded program. When you have a very small number of threads, stack-based allocations might be OK.
Consider what happens when you several (8, 16) or lots (256 or more) threads. The physical address space is a limited resource. Most operating systems tend to impose a maximum limit (typically Page File size). -ulimit is limited to this maximum limit. As yourself: with each thread in the application, how do you partition the addresses of a process?
1) subtract code space
2) subtract static data space
3) subtract initialization data
4) partition remainder into Heap and nThreads number of stacks
If only one or a few treads need the really large arrays on stack, configuring for this results in the equivalent addressing capability being available to the other threads stack. This may result in running out of addresses as well as drastically diminishing the available address space for the Heap (and memory allocated by other means).