I realized that the program below, which implements a linked list, will differ in its ram requirement conditional on using the compiler flag "-parallel"
Module mod_ll Type :: llele integer*8 :: a,b type(llele), pointer :: next=>null() end type llele Type :: container integer*8 :: length Type(llele), pointer :: start=>null(), end=>null() contains procedure, pass :: add => Subadd end type container contains Subroutine SubAdd(this,val1,val2) Implicit none class(container), intent(inout) :: this Integer*8, intent(in) :: val1, val2 if(.not.associated(this%start)) then allocate(this%start) this%end=>this%start this%start%a=val1 this%start%b=val2 else allocate(this%end%next) this%end=>this%end%next this%end%a=val1 this%end%b=val2 End if this%length=this%length+1 end Subroutine SubAdd End Module mod_ll Program Test use mod_ll, only: container Type(container), allocatable :: xx integer*8 :: i allocate(xx) Do i=1,50000000 call xx%add(i,i) end Do read(*,*) end Program Test
ifort -O3 -o Test Test.f90
will more than double the ram demand compared to compiling with
ifort -O3 -parallel -o Test Test.f90
Tested on linux kernel 4.14 with ifort 17.05, the first will require about 3.7GB of the ram, whereas the later 1.5GB. I measured the ram with "top".
Any reasons for that!?
IIF your code is indeed run in parallel. it is not thread-safe. To be thread-safe (in parallel), the interior body of the add subroutine would have to be in a critical section (as written).
As to if this is causing the excess memory consumption, I cannot say. An additional cause can be the stack requirements for the additional threads. How many threads will be created and what are the stack requirements for each thread?
thanks for the comment.
However, it implies that you understood the version compiled with "-parallel" has the excessive memory usage. But it is exactly the opposite. The one compiled WITHOUT "-parallel" has the excessive memory usage.
I am aware that the code cannot be run in parallel, and when excuted, invariable whether "-parallel" was set, no multiple thread usage occured (I assume the compiler did not find anything to parallelize, which was also not intended).
Type llele is 24 bytes (assuming 64-bit build).
Each node being an allocatable, with memory node overhead of at least 2 size_t pointers (16 bytes),
Minimally the node load is 40 bytes, but heap granularity can be 16 bytes, ergo node load is likely 48 bytes.
50 million allocations requires 2,400 million bytes, ~2.4GB.
Therefor the run showing 1.5GB ram in use must be in error.
Reversing the logic
1.5GB / 50M nodes = 30 bytes/node (assuming 0B for program and stack)
Even if the node allocation load was node data + (hidden) size_t (iow no link), the 50M allocations would exceed 1.5GB.
Something is amiss in the top figure.