Strange Memory Cache Leak (or at least it looks like that), assume buffered_io

L__D__Marks · ‎10-24-2018

I have a strange problem that looks like a "Memory Cache Leak" (not a memory leak).

Let me set the stage first. Reproducibly (using ganglia to monitor), on a cluster I have noticed that the cached memory is increasing, relatively slowly. When it becomes large, something like 2/3 of the total memory (Intel Gold with 32 cores & 192Gb) a program is running slower by about a factor of ~1.5. If I clear the cache and sync the disc (I have not tested which matter) with "sync ; echo 3 > /proc/sys/vm/drop_caches" the speed of the program increases back (~1.5 times faster).

The issue seems to be associated with I/O -- the relevant code uses mpi and only the core that is doing any I/O shows the cache leak. The program is doing a fair amount of I/O, but not massive amounts (10-40 Mb). I compile using ifort with -assume buffered_io. My suspicion is that may leave some cached files at the end, effectively a "cache leak".

Has anyone seen anything like this? I don't believe the amount of cached memory is supposed to matter with Linux -- but it does!

Are there any calls/flags/tricks that might remove this?

Any other ideas about how to probe it?

TimP · ‎10-25-2018

If you open new files repeatedly without closing previous ones, this effect would be expected. buffered_io would tie up more memory per file buffer.

Steve_Lionel · ‎10-25-2018

What -assume buffered_io does is ask the run-time library to "bundle" smaller I/O requests into a single, larger I/O request to the OS. To do this it allocates a buffer. By default it tries to manage the size of this automatically. When the unit is closed or the program exits, the buffer memory is freed.

I recall that, several versions ago, there was a problem where, for some patterns of I/O, it ended up allocating enormous buffers. You didn't say which version you are using. (I went to look up the release notes on that issue but the release notes from version 15 and older are missing from Intel's site, even though they are still linked. I'll file a complaint about that.

jimdempseyatthecove · ‎10-25-2018

Read: https://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/

Instead of using chron to periodically clear the file cache, have it periodically run a script to obtain the amount of memory used by the file cache as well as total memory. The when above some critical limit (TBD) run an application that allocates, and initializes RAM to the amount of file cache that you wish to return, then exits. This is done under the assumption that the file cache has some sort of LRU scheme in place.

Jim Dempsey

L__D__Marks · ‎10-25-2018

Tim P. wrote:

If you open new files repeatedly without closing previous ones, this effect would be expected. buffered_io would tie up more memory per file buffer.

A good thought. No new files are being repeatedly opened -- at least in the code. I added a subroutine that closes everything prior to the program exiting and it has no effect.

N.B., for reference (hopefully not confusing the issue more) the program is running on a child node and using a NFS mounted raid disk.

L__D__Marks · ‎10-25-2018

Steve Lionel (Ret.) wrote:

What -assume buffered_io does is ask the run-time library to "bundle" smaller I/O requests into a single, larger I/O request to the OS. To do this it allocates a buffer. By default it tries to manage the size of this automatically. When the unit is closed or the program exits, the buffer memory is freed.

I recall that, several versions ago, there was a problem where, for some patterns of I/O, it ended up allocating enormous buffers. You didn't say which version you are using. (I went to look up the release notes on that issue but the release notes from version 15 and older are missing from Intel's site, even though they are still linked. I'll file a complaint about that.

I am using the 2019 cluster compiler, but I saw similar behavior in the past with the 2013 and 2016 compilers.