Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Slow deallocation in derived type data

De_Vita__Paolo
Beginner
790 Views

Hello,

I found a very slow speed during a deallocation of a derived data type that I can not fix.

I'm using IVFC 16.0.3.207 for Windows, with default release compiler options.

Here below an exemplification of my problem:

program test_dealloc
 
implicit none
integer(kind=4) :: n1 = 5827396
integer(kind=4) :: n2 = 75
integer(kind=4) :: i
type ddata
complex(kind=4), dimension(:,:), allocatable :: elem
end type ddata
type(ddata), pointer, dimension(:)   :: ddata_p
 
real(4) time1,time0
! start
 
time0=secnds(0.0)
allocate(ddata_p(n1))
do i=1,n1
allocate(ddata_p(i)%elem(n2,n2))
enddo
time1=secnds(time0)
print*, ' '
print*, 'Allocation time', time1
 
! computation kernel
 
time0=secnds(0.0)
do i=1,n1
deallocate(ddata_p(i)%elem)
enddo
deallocate (ddata_p)
time1=secnds(time0)
print*, ' '
print*, 'Deallocation time', time1
 
end

 

I've used the function secnds for the timing and the print on the screen is:

>test_dealloc.exe

 Allocation time   15.62891

 Deallocation time   4075.336

The machine is a XEON E5-2690 x64 with 192 GB of RAM, the OS is Windows Server 2008 R2 Enterprise.

From the task manager the allocated memory is around 44GB. The time for the deallocation is very large, much more than the allocation one and also much more of the computation kernel where the data structure is filled.

Is there any possibility to reduce such a deallocation time?

Thank you very much. Regards

Paolo

 

0 Kudos
13 Replies
Andrew_Smith
Valued Contributor I
790 Views

This issue sounds serious and deserves consideration by Intel.

0 Kudos
John_Campbell
New Contributor II
790 Views

"From the task manager the allocated memory is around 44GB"

I did a calc for n1 = 5827396 and n2 = 75 and get an estimate of 244 GB. I suggest you try n1 = 5827396/100, use the following changes and run with task manager.

  integer(kind=8) :: siza, sizi, sizet
  real*4 gb  
  ! start
  
  time0 = secnds(0.0)
  
  allocate (ddata_p(n1))
  write (*,*) 'allocate ddata_p size =', sizeof (ddata_p)
  sizet = 0
  
  do i=1,n1
  
    allocate (ddata_p(i)%elem(n2,n2))
    sizet = sizet + sizeof ( ddata_p(i)%elem )
  
  enddo
  
  time1 = secnds(time0)
  gb = sizet / 1024.**3
  write (*,*) 'fill ddata_p size =', sizet, gb  ! c_sizeof (ddata_p)
  write (*,*) 'allocate ddata_p size =', sizeof (ddata_p)
  read (*,*) gb
 

 

0 Kudos
De_Vita__Paolo
Beginner
790 Views

Thank you John,

You are rigth, the real size of the data structure is 244 GB, in fact if I try to initialize the variable I read in the task manager the same allocated memory that you read too.

I made this test on bigger machine (Intel Xeon Gold 6144 with 684 GB RAM OS Windows Server 2016 Datacenter) adding the following initialization:

...

time0=secnds(0.0)


do i=1,n1

ddata_p(i)%elem = 0

enddo

time1=secnds(time0)

print*, ' '

print*, 'Setting time', time1

...

 

And I obtained these new values:

test_dealloc_setting.exe


 Allocation time   14.67578

 Setting time   79.22656


 Deallocation time   3202.836

 

The deallocation time is still very large.

Paolo

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
790 Views

This looks like an issue with the C++ heap manager and/or virtual memory paging due to fragmentation issues. Try deallocating in reverse order of allocation:

time0=secnds(0.0)
do i=n1,1,-1
  deallocate(ddata_p(i)%elem)
enddo
deallocate (ddata_p)
time1=secnds(time0)
print*, ' '
print*, 'Deallocation time', time1

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
790 Views

Also, If you have memory deallocation debugging enabled (e.g. valgrind, or Windows debug C Runtime Library) this may have an effect on free (deallocate).

Jim Dempsey

0 Kudos
De_Vita__Paolo
Beginner
790 Views

Hi Jim,

I tried to deallocate in reverse order but nothing changed.

Regarding the compiler options, I've left the default ones for the release configuration, that are (from the command window of the compiler):

/nologo /O2 /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc100.pdb" /libs:dll /threads /c

Paolo

0 Kudos
LRaim
New Contributor I
790 Views

One possible solution is: 

1) change elem into a pointer instead  of an allocatable array. 
2) allocate a single block of n1*n2*n2 elements.
3) set  ddata_pi(i)%elem to the initial element of each n2*n2 matrix

In this case you will have a single de-allocation.

Regards 

0 Kudos
De_Vita__Paolo
Beginner
790 Views

Hi Luigi,

Thank you for the suggestions. About the options:

1) I tried with pointer but no real change

2) That would be really the worst solution for my case because in the real code n2 is depending from n1 and I don't know in advance the final dimensione of the data, so the derived structure would be the best option for my application.

3) I didn't understand, can you explain better please? (or make an example)

Regards

Paolo

 

0 Kudos
LRaim
New Contributor I
790 Views

Sorry for the hurry in answering. 
The 3 points, I suggested, must be implemented together to get a possible solution .
About your point 2). The possible solution can be applied if  n2(i) for each i can be computed in a previous do-loop.

About your point 3). I do not have time to set-up an example. This step is similar to set N pointers to each column of a NxN matrix (after the matrix has been allocated).  

Regards

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
790 Views

Do you have VTune available to perform a performance test?

About the only thing left to check is there used to be an Intel Floating License check issue causing long delays in a program. Though I do not recall it being related to deallocation.

Jim Dempsey

0 Kudos
Andrew_Smith
Valued Contributor I
790 Views

This is a very serious 300 fold drop in performance. Why has it not been aknowledged by Intel as an issue?

0 Kudos
andrew_4619
Honored Contributor III
790 Views

I took the program in #1 and built it. I reduced n1 by a factor of 100 because the machine had nowhere near the memory you asked for.

The allocation was 0.21s and the dealloc 0.08s. Our machines will have different speeds but broadly speaking the alloc time is in proportion with your bigger n1 value, the dealloc time however is not and is orders out!!!!

I would suggest running your test at several increasing values of n1 I guess we will see some threshold value at which there is a step change in the dealloc time. This might show something interesting.

The limit (if that is what we see) might be a function of your windows and/or hardware or it may be some limit within the compiler.....

 

0 Kudos
John_Campbell
New Contributor II
790 Views

My initial thoughts we the large delay was due to a virtual memory event being initiated during deallocation. Physical memory is not used during allocation, but when the array is being used, perhaps at deallocation. I notice the initialisation phase has now been added to the tester, which appears to indicate the deallocation delay is not associated with virtual memory usage. Given the memory sizes being tested, I am also not able to reproduce this problem. The luxury of having this problem !

0 Kudos
Reply