Do you use MKL?

Christoph_F_ · ‎02-23-2017

Dear all,

according to a little test program (attached) it looks as if the stack memory is not freed after leaving a subroutine. I have had this problem with a much larger code, which after a while died because it was running out of memory. The problem can be solved by adding the compiler flag "-heap-arrays". (This is, however, only a partial solution to the problem because stack memory still seems to get eaten up in all the libraries that I use: MPI, HDF5, FFTW, ScaLAPACK.)

The simple test program calls a subroutine where a temporary array is created and reports the RSS value (function mem_usage) before and after (and also the array size). The output when compiling with "ifort -cpp test.f -o test" is

       872 before sub
Array size                 31250
     32220 after sub

So, it seems that the stack memory is not freed after leaving the subroutine. Compiling with "ifort -cpp -heap-arrays test.f -o test" gives

       880 before sub
Array size                 31250
       988 after sub

This looks better. (With "-no-heap-arrays" we get the first output.) Tested versions are 12.1.3 20120212 and 16.0.2 20160204.

The code is attached. I don't know if this is a bug. I submit it as a question.

Thanks in advance!

Christoph

TimP · ‎02-23-2017

I'll think about trying to get around the non-portability of your /proc/ access later on. I was intending to see if the problem is reproduced with a current ifort.

It does look like a bug in automatic arrays. Generally, allocatable is considered better practice (at least when the size is significant), and allows you to check status and see whether explicit deallocation (required before f95) helps. I don't know whether automatic array may be supported primarily for legacy code, on the assumption it would not be used for new code.

In normal usage, if the procedure isn't called recursively, and the automatic memory is re-used, it may not become a problem. If it is called recursively, omitting the recursive declaration is a bad idea.

Christoph_F_ · ‎02-23-2017

Thank you for your quick reply! The function mem_usage() is only included for testing, so non-portability is not really an issue.

In fact, if the array is defined as allocatable, it is put on the heap instead of the stack, and then there is no memory problem. The same effect is achieved by using the flag "-heap-arrays", but, as I mentioned, all the libraries that I use still have that problem.

The procedure is not called recursively, but usually the arrays are larger, and, in the case of an MPI code, all processes will have their own instance of the array. The problem is that it seems that the memory is not available anymore afterwards. So, unfreed stack memory keeps accumulating until there is no free memory anymore.

jimdempseyatthecove · ‎02-23-2017

Christoph,

What you think is happening (by reading your code), is not what is actually happening by executing your code. I will try to explain;

/proc/.../status will return the Virtual Memory footprint of the process. This is, with page granularity, all the pages touched (and currently held) by the process within the process Virtual Address range. If the O/S is provisioning for 48-bit physical addressing this represents about 256TB Virtual Memory address space. Your process will only consume those addresses actually touched in page granularity. Page size can vary with process, default may be 4KB, but may be 2MB, or much larger.

If you ALLOCATE a large array, other than for any heap node management headers that are created (and touched), and if those virtual address pages hadn't been touch before, those pages will not be mapped to your process, and thus will not appear in the /proc/.../status report. Your assumption that DEALLOCATE frees virtual memory is false. Deallocate return to the heap, but does not (necessarily) un-map the Virtual Memory addresses (pages), neither in RAM, nor page file.

Your sample code, showing fixed size static array, is NOT placing the array on stack. The default, places this in static memory, same as if the array had the SAVE attribute. As a consequence, the process load, may have wiped (loaded empty block) into the static data area, and thus "first-touched" the memory causing it to appear in the /proc/.../status report.

Jim Dempsey

Christoph_F_ · ‎02-24-2017

Thank you for the explanation. I am actually not so worried about the numbers that the test program gives. The test program was just my attempt to reduce the problem to a test case that is easily understood (even if I have misinterpreted the memory issues; I am not a computer scientist). So, let me tell you what the problem is:

- When compiled without "-heap-arrays", my program (which uses a lot of memory and which is parallelized with MPI3) dies because it runs out of memory. (The system kills jobs that exceed 96GBytes per node.)

- When compiled with "-heap-arrays", it does not die (and it should not because my program needs less than the maximum 96GBytes).

- To analyze this problem, I have introduced the function mem_usage(), which tells me the current status of the RSS value while the program is running.

- It is my observation that the RSS value compares quite well with the memory demand that I would naively calculate from all my "allocates" and automatic arrays -- at least when I use "gfortran" or when I use "ifort" with "-heap-arrays". Even in the latter case, however, after a while the RSS value grows more strongly than expected, which I would attribute to the other libraries (MPI, FFTW, HDF5...) that are NOT compiled with "-heap-arrays".

Christoph

jimdempseyatthecove · ‎02-24-2017

Your situation is strange. I had a similar situation on my KNL system with 96GB. It may or may not be related to your situation. My application was hybrid:

MPI -> C# -> C++ unmanaged .so -> Fortran .so -> OpenMP

If your application is:

MPI -> someAppSpawingThreads -> OpenMP
or
MPI -> someAppSpawingThreads -> MKL (multi-threaded)

Then you may be seeing the same issue.

MKL (multi-threaded) internally uses OpenMP.

In my app, the C# program spawned-joined, spawned-joined, ... using up different thread handles (though not more than a reasonable total at any one time). If your application is performing spawned-joined, spawned-joined, ... threading, then the memory consumption issue is the underlying OpenMP maintains thread ID'd context for future use. As long as your app reuses spawned thread ID's you will be ok. If your app keeps generating new thread ID's you will be in trouble.

The "-heap-arrays" version might not be dying (yet) because the memory creep is slower.

The above said, you may have an entirely different issue.

Jim Dempsey

Christoph_F_ · ‎02-24-2017

I guess it is a different issue because I do not use OpenMP. I use MPI3, which allows the usage of shared memory. On the other hand, the cause might be the same ...