memory leak when using linked list with allocatable instead of pointer?

may_ka · ‎10-06-2019

Hi,

from what I understand ifort 19.04 supports types having allocatable type components of its own. This would allow to create linked list which can be easily copied via "move_alloc" and having deep deallocation. An example:

module testt
  !!list element
  type :: lle
    integer(kind=8) :: i
    type(lle), allocatable :: next
  end type lle
  !!list container
  type llc
    integer(kind=8) :: n=0
    type(lle), allocatable :: start
    type(lle), pointer :: last=>null()
  contains
    procedure :: addend=>subaddend
    procedure :: addstart=>subaddstart
  end type llc
contains
  subroutine subaddend(this,i)
    class(llc), intent(inout), target :: this
    integer(kind=8), intent(in) :: i
    if(.not.allocated(this%start)) Then
      allocate(this%start)
      this%last=>this%start
    else
      allocate(this%last%next)
      this%last=>this%last%next
    end if
    this%last%i=i
    this%n=this%n+1
  end subroutine subaddend
  subroutine subaddstart(this,i)
    class(llc), intent(inout), target :: this
    integer(kind=8), intent(in) :: i
    type(lle), allocatable :: x
    if(.not.allocated(this%start)) Then
      allocate(this%start)
    else
      call move_alloc(this%start,x)
      allocate(this%start)
      call move_alloc(x,this%start%next)
    end if
    this%start%i=i
    this%n=this%n+1
  end subroutine subaddstart
end module testt
program test
  use testt
  implicit none
  type(llc), allocatable :: x
  integer(kind=8) :: k,i,j=100000000
  Do k=1,5
    write(*,*) k
    allocate(x)
    do i=1,j
      call x%addstart(i)
      !call x%addend(i)
    end do
    deallocate(x)
  end Do
end program test

I run this program on a computer with 32GB of RAM.

subroutine "addend" adds to the end of the list via an end pointer, subroutine "addstart" adds to the beginning of the list without using any pointer.

Compiled with gfortran 9.1 the memory usage builds up in round 1 to ~40% in round 1 and then remains there over subsequent iterations implying that deep deallocation works. This happens invariably of whether I use "addstart" or "addend".

Compiled with ifort 19.04 when using "addend" the deallocate statement did not seem have an effect as the program quickly flooded all available RAM.

Using "addstart" it still used ~80% of the memory which builds up over the first two rounds, then it remains at ~80%.

My suspicion is that this is a bug in ifort, but maybe somebody can confirm or point to what I did wrong before I launch the bug report.

Thanks

may_ka · ‎10-06-2019

I could at least get ifort to work by adding a finalizer and a deleter

module testt
  !!the list element
  type :: lle
    integer(kind=8) :: i
    type(lle), allocatable :: next
  end type lle
  !!the container
  type llc
    integer(kind=8) :: n=0
    type(lle), allocatable :: start
    type(lle), pointer :: last=>null()
  contains
    procedure :: addend=>subaddend
    procedure :: addstart=>subaddstart
    procedure :: del => subdel
    Final :: subfinal
  end type llc
contains
  subroutine subaddend(this,i)
    class(llc), intent(inout), target :: this
    integer(kind=8), intent(in) :: i
    if(.not.allocated(this%start)) Then
      allocate(this%start)
      this%last=>this%start
    else
      allocate(this%last%next)
      this%last=>this%last%next
    end if
    this%last%i=i
    this%n=this%n+1
  end subroutine subaddend
  subroutine subaddstart(this,i)
    class(llc), intent(inout), target :: this
    integer(kind=8), intent(in) :: i
    type(lle), allocatable :: x
    if(.not.allocated(this%start)) Then
      allocate(this%start)
    else
      call move_alloc(this%start,x)
      allocate(this%start)
      call move_alloc(x,this%start%next)
    end if
    this%start%i=i
    this%n=this%n+1
  end subroutine subaddstart
  Subroutine subdel(this)
    class(llc), intent(inout), target :: this
    integer :: i
    type(lle), allocatable :: x
    do
      if(allocated(this%start%next)) Then
        call move_alloc(this%start%next,x)
        deallocate(this%start)
        call move_alloc(x,this%start)
      else
        deallocate(this%start)
        exit
      end if
    end do
    this%n=0
    nullify(this%last)
  end Subroutine subdel
  Subroutine Subfinal(this)
    type(llc), intent(inout) :: this
    call this%del()
  end Subroutine Subfinal
end module testt
program test
  use testt
  implicit none
  type(llc), allocatable :: x,y
  integer(kind=8) :: k,i,j=100000000
  real(kind=8) :: t0,t1
  Do k=1,5
    write(*,*) k
    allocate(x)
    call cpu_time(t0)
    do i=1,j
      call x%addend(i)
    end do
    call cpu_time(t1)
    write(*,*) "add ",t1-t0
    call cpu_time(t0)
    deallocate(x)
    call cpu_time(t1)
    write(*,*) "delete ",t1-t0
  end Do
end program test

The additional subdel was necessary to circumvent a segfault in gfortran which occured when the subdel code was put into the finalizer directly.

However, what struck me most was the performance difference. The program was compiled with -03 only. gfortran needed ~10% of the RAM and had this timing:

add    2.6331760000000002     
 delete   0.74638599999999977     
                    2
 add   0.95624600000000015     
 delete   0.87373999999999974     
                    3
 add    1.1742239999999997     
 delete   0.74350200000000033     
                    4
 add    1.4223770000000009     
 delete   0.83331300000000041     
                    5
 add   0.87081300000000006     
 delete   0.73836900000000050

ifort 19.04 needed 20% of the RAM and had this timing:

                    1
 add    4.32092700000000     
 delete    3.47127200000000     
                     2
 add    2.19527500000000     
 delete    4.37747900000000     
                     3
 add    2.19691300000000     
 delete    3.48525300000000     
                     4
 add    2.17025300000000     
 delete    4.34131900000000     
                     5
 add    2.15258800000000     
 delete    3.50221600000000

This is a performance difference of almost factor 3.

FortranFan · ‎10-06-2019

may.ka wrote:
.. My suspicion is that this is a bug in ifort, but maybe somebody can confirm or point to what I did wrong before I launch the bug report. ..

I fail to notice any bug in Intel Fortran with the code in the original post.

The standard for Fortran leaves a lot of flexibility to the processors and for the shown code, it seems to me Intel Fortran does its "thing" i.e., whatever it deems appropriate with the high frequency allocations and move_alloc's but which a coder may not "like" per the code design being followed. The user may be able to take it up with the vendor not as a bug report but as a feature request.

By the way, the code in the original post fails with gfortran 10.

may_ka · ‎10-06-2019

FortranFan wrote:
I fail to notice any bug in Intel Fortran with the code in the original post.

Does that mean you run the code or you looked at it???

Why should the program go to 19% RAM usage during add, and then use another +60% (so in total +80%) during deallocation??

I cannot imagine that this is the default behavior intended by Intel.

Cheers

jimdempseyatthecove · ‎10-07-2019

>>Why should the program go to 19% RAM usage during add, and then use another +60% (so in total +80%) during deallocation??

This will depend on the heap manager .AND. the method you use to determine how much memory is used. If you are using the Virtual Memory footprint, this is the total memory touched since process start and not the amount of free heap (available virtual memory addresses touched or not).

The increase on deletion (~3x that of initial allocation) would seem to indicate that the method use is (unintentionally) performing allocations during deallocation. Possibly by performing an object copy operation during deallocation. Note, the total number of visible and unseen allocations equal the total number of visible and unseen deallocations, but due to the technique use in the heap manager, the effect is to walk virtual memory. IOW an object freed, while returning to the heap, is intentionally (by design) not immediately available for reallocation. This is either done for performance reasons on allocations and/or security reasons (to hide stale data). If on Windows, you have an option to choose to use LFH (Low Fragmentation Heap). I do not recall how to do this on Linux without googling it.

Jim Dempsey

IanH · ‎10-08-2019

When run under windows, runtime memory diagnostics conducted at the level of the underlying heap manager show no evidence of a leak. As others have noticed, heap managers can still suffer from this like fragmentation even if all objects have been released, and typically a heap manager does not return memory to an operating system, so at a lower level you may still see significant memory use being reported against the program even if the Fortran runtime is doing the right thing.

On the Windows platform, the deallocation of the container structure is done using recursive calls. For a deeply nested structure, this series of calls may result in a stack overflow. I suspect some of your observations relate to this use of recursion.

Use of recursion to deallocate a recursive structure seems reasonable to me. On the other hand, a recursive structure with several million levels of recursion is extreme. I don't consider this a compiler bug.

Use of the addend procedure would be non-conforming in the code as otherwise shown - the actual argument to the call does not have the target attribute, so pointers referencing components in the actual argument become undefined when the procedure completes.

Be careful defining non-default-kind integers with large values that do not have the appropriate kind suffix.

may_ka · ‎10-08-2019

Hi all,

thanks for the comments. As an additional information I tested the program on Linux, kernel 5.3. I monitored the memory usage with "top". Now one may argue that that does not give the full/true picture, but lets just stick at an even more "superficial" level.

Below are two programs for the linked list, both doing exactly the same, one with pointer, one with allocatable.

With "superficial" I mean that when running with pointers, RAM usages peaks at ~30% (of 32GB) and the computer is still usable. When running with allocatable, the RAM usage is so bad that kswapd sets in and the whole computer freezes. I am happy that you guys telling me that everything is fine ........... but the above observation renders the allocatable implementation useless. Thus it cannot do what it is meant to do .......... replacing the pointer implementation and with that allowing for deep copying and deep deallocation.

Here the allocatable version (pretty much the same as in the op, but kind=8 replaced by iso_fortran and the number of list elements increased to 150,000,000)

module testt
  use, intrinsic :: iso_fortran_env, only: int64
  implicit none
  !!the list element
  type :: lle
    integer(int64) :: i
    type(lle), allocatable :: next
  end type lle
  !!the container
  type llc
    integer(int64) :: n=0
    type(lle), allocatable :: start
    type(lle), pointer :: last=>null()
  contains
    procedure :: addend=>subaddend
    procedure :: addstart=>subaddstart
  end type llc
contains
  subroutine subaddend(this,i)
    class(llc), intent(inout), target :: this
    integer(int64), intent(in) :: i
    if(.not.allocated(this%start)) Then
      allocate(this%start)
      this%last=>this%start
    else
      allocate(this%last%next)
      this%last=>this%last%next
    end if
    this%last%i=i;this%n=this%n+1
  end subroutine subaddend
  subroutine subaddstart(this,i)
    class(llc), intent(inout), target :: this
    integer(int64), intent(in) :: i
    type(lle), allocatable :: x
    if(.not.allocated(this%start)) Then
      allocate(this%start)
    else
      call move_alloc(this%start,x)
      allocate(this%start)
      call move_alloc(x,this%start%next)
    end if
    this%start%i=i;this%n=this%n+1
  end subroutine subaddstart
end module testt
program test
  use testt
  implicit none
  type(llc), allocatable :: x,y
  integer(int64) :: k,i,j=150000000
  real(real64) :: t0,t1
  Do k=1,5
    write(*,*) k
    allocate(x)
    call cpu_time(t0)
    do i=1,j
      call x%addend(i)
    end do
    call cpu_time(t1)
    write(*,*) "add ",t1-t0
    call cpu_time(t0)
    deallocate(x)
    call cpu_time(t1)
    write(*,*) "delete ",t1-t0
  end Do
end program test

The pointer version. Accounting for the pointers it has a finalizer, but the rest is pretty much the same as the allocatable version:

module testt
  use, intrinsic :: iso_fortran_env, only: int64, real64
  implicit none
  !!the list element
  type :: lle
    integer(int64) :: i
    type(lle), pointer :: next=>null()
  end type lle
  !!the container
  type llc
    integer(int64) :: n=0
    type(lle), pointer :: start=>null(), last=>null()
  contains
    procedure :: addend=>subaddend
    procedure :: addstart=>subaddstart
    procedure :: del => subdel
    Final :: subfinal
  end type llc
contains
  subroutine subaddend(this,i)
    class(llc), intent(inout), target :: this
    integer(int64), intent(in) :: i
    if(.not.associated(this%start)) Then
      allocate(this%start)
      this%last=>this%start
    else
      allocate(this%last%next)
      this%last=>this%last%next
    end if
    this%last%i=i;this%n=this%n+1
  end subroutine subaddend
  subroutine subaddstart(this,i)
    class(llc), intent(inout), target :: this
    integer(int64), intent(in) :: i
    type(lle), pointer :: x=>null()
    if(.not.associated(this%start)) Then
      allocate(this%start)
    else
      x=>this%start;nullify(this%start)
      allocate(this%start);this%start%next=>x
    end if
    this%start%i=i;this%n=this%n+1
  end subroutine subaddstart
  Subroutine subdel(this)
    class(llc), intent(inout), target :: this
    type(lle), pointer :: x=>null()
    x=>this%start
    do
      if(associated(this%start%next)) Then
        this%start=>this%start%next
        deallocate(x);x=>this%start
      else
        deallocate(this%start);exit
      end if
    end do
    nullify(this%start,this%last)
  end Subroutine subdel
  Subroutine Subfinal(this)
    type(llc), intent(inout) :: this
    call this%del()
  end Subroutine Subfinal
end module testt
program test
  use testt
  implicit none
  type(llc), allocatable :: x,y
  integer(int64) :: k,i,j=150000000
  real(real64) :: t0,t1
  Do k=1,5
    write(*,*) k
    allocate(x)
    call cpu_time(t0)
    do i=1,j
      call x%addend(i)
    end do
    call cpu_time(t1)
    write(*,*) "add ",t1-t0
    call cpu_time(t0)
    deallocate(x)
    call cpu_time(t1)
    write(*,*) "delete ",t1-t0
  end Do
end program test

BTW, 150,000,000 elements in a linked list is standard in my applications and given that fortran is meant to be THE hpc language my assumption would be that it can easily cope with it (and the pointer version indeed does).

Thanks again for the comments.

cheers

IanH · ‎10-08-2019

Conceptually, you don't have a linked list if you are using an allocatable component, you have a (very) deeply nested, very large object of derived type. The value of the container object is the entire structure. Concepts do not always map through to implementation, but it does not surprise me that you are running into issues here.

jimdempseyatthecove · ‎10-09-2019

I think this may be an exemplar of CS courses instruction that recursion should be favored over iteration. IOW ease of programming is to be made over costs in resource consumption. It is clear to see from the two methods listed in post #7 that the iterative process is less resource intensive, and for this type of usage, should be implemented as opposed to the recursive technique.

I will leave it as an exercise for may.ka to modify the allocatable implementation to use an iterative deallocate, and post back here the results.

Jim Dempsey

may_ka · ‎10-09-2019

Hi,

thanks for the comments

jimdempseyatthecove (Blackbelt) wrote:
I will leave it as an exercise for may.ka to modify the allocatable implementation to use an iterative deallocate, and post back here the results.
Jim Dempsey

have a look at post #2.

Cheers

jimdempseyatthecove · ‎10-09-2019

>>have a look at post #2

Why did you use move_alloc???

Why not iteratively:

1) locate last node and next to last node
2) delete last node by way of next to last node
3) condition former next to last node now into last node

Note, if you maintain a pointer to prior node, you can avoid multiple step 1)'s

While you could recursively search for last node, then delete on pop of recursion. this still presents a problem of excessive use of stack.

For linked lists, using pointer's alone is much less resource intensive than having recursive allocatables.

Jim Dempsey

may_ka · ‎10-09-2019

Hi Jim.

I used "move_alloc" to have an implementation which is almost pointer-free (except last).

My interpretation of "move_alloc" is that "move_alloc" is the same as moving a pointer except that the pointee can be owned by one pointer only. In that sense an allocatable entity is an unbreakable combination of a pointer and a resource. The only option to assign another pointer includes nullifing the initial pointer (e.g. "move_alloc"). If that interpretation holds, the complexity of "move_alloc" should be the same as assigning a pointer.

Anyway, in the #2 implementation I delete the list from the start and iterate towards the end. Since it is still a single linked list, though via allocation, I cannot see how one could delete the list from the end without recursion. If you don't mind you can change the example #2 to make it more clear what you mean.

Thanks.

jimdempseyatthecove · ‎10-09-2019

while list not empty
     1) locate last node and next to last node
     2) delete last node by way of next to last node
     3) condition former next to last node now into last node
end while

FortranFan · ‎10-09-2019

may.ka wrote:
Quote:
FortranFan wrote:

I fail to notice any bug in Intel Fortran with the code in the original post.

Does that mean you run the code or you looked at it???
Why should the program go to 19% RAM usage during add, and then use another +60% (so in total +80%) during deallocation??
I cannot imagine that this is the default behavior intended by Intel.
Cheers

My point is I do not think what someone, say OP, might think is a less "efficient" compiler implementation of the Fortran 2008 feature of "allocatable components of recursive type" relative to their expectation or with memory/CPU usage of another compiler to be a bug. Sure someone, like OP, can contact the vendor such as Intel and say, hey code built using that vendor's development toolchain is not competitive in terms of whatever performance measurements they have relative to whatever benchmark they can agree upon mutually and work out the next steps. But that ain't a bug per se.

Looking at the code in the original post, I don't see Intel Fortran compiler having a bug or causing a memory leak, etc. One might view Intel Fortran's treatment of this facility - allocatable components of recursive type - as being way more demanding than it needs to be in terms of system memory availability and utilization, say, gfortran on some OS platforms resulting in slower performance but as I indicate above, it's a different argument. I do not notice a memory leak.