I found a very slow speed during a deallocation of a derived data type that I can not fix.
I'm using IVFC 188.8.131.52 for Windows, with default release compiler options.
Here below an exemplification of my problem:
I've used the function secnds for the timing and the print on the screen is:
Allocation time 15.62891
Deallocation time 4075.336
The machine is a XEON E5-2690 x64 with 192 GB of RAM, the OS is Windows Server 2008 R2 Enterprise.
From the task manager the allocated memory is around 44GB. The time for the deallocation is very large, much more than the allocation one and also much more of the computation kernel where the data structure is filled.
Is there any possibility to reduce such a deallocation time?
Thank you very much. Regards
"From the task manager the allocated memory is around 44GB"
I did a calc for n1 = 5827396 and n2 = 75 and get an estimate of 244 GB. I suggest you try n1 = 5827396/100, use the following changes and run with task manager.
integer(kind=8) :: siza, sizi, sizet real*4 gb ! start time0 = secnds(0.0) allocate (ddata_p(n1)) write (*,*) 'allocate ddata_p size =', sizeof (ddata_p) sizet = 0 do i=1,n1 allocate (ddata_p(i)%elem(n2,n2)) sizet = sizet + sizeof ( ddata_p(i)%elem ) enddo time1 = secnds(time0) gb = sizet / 1024.**3 write (*,*) 'fill ddata_p size =', sizet, gb ! c_sizeof (ddata_p) write (*,*) 'allocate ddata_p size =', sizeof (ddata_p) read (*,*) gb
Thank you John,
You are rigth, the real size of the data structure is 244 GB, in fact if I try to initialize the variable I read in the task manager the same allocated memory that you read too.
I made this test on bigger machine (Intel Xeon Gold 6144 with 684 GB RAM OS Windows Server 2016 Datacenter) adding the following initialization:
... time0=secnds(0.0) do i=1,n1 ddata_p(i)%elem = 0 enddo time1=secnds(time0) print*, ' ' print*, 'Setting time', time1 ...
And I obtained these new values:
test_dealloc_setting.exe Allocation time 14.67578 Setting time 79.22656 Deallocation time 3202.836
The deallocation time is still very large.
This looks like an issue with the C++ heap manager and/or virtual memory paging due to fragmentation issues. Try deallocating in reverse order of allocation:
time0=secnds(0.0) do i=n1,1,-1 deallocate(ddata_p(i)%elem) enddo deallocate (ddata_p) time1=secnds(time0) print*, ' ' print*, 'Deallocation time', time1
Also, If you have memory deallocation debugging enabled (e.g. valgrind, or Windows debug C Runtime Library) this may have an effect on free (deallocate).
I tried to deallocate in reverse order but nothing changed.
Regarding the compiler options, I've left the default ones for the release configuration, that are (from the command window of the compiler):
/nologo /O2 /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc100.pdb" /libs:dll /threads /c
One possible solution is:
1) change elem into a pointer instead of an allocatable array.
2) allocate a single block of n1*n2*n2 elements.
3) set ddata_pi(i)%elem to the initial element of each n2*n2 matrix
In this case you will have a single de-allocation.
Thank you for the suggestions. About the options:
1) I tried with pointer but no real change
2) That would be really the worst solution for my case because in the real code n2 is depending from n1 and I don't know in advance the final dimensione of the data, so the derived structure would be the best option for my application.
3) I didn't understand, can you explain better please? (or make an example)
Sorry for the hurry in answering.
The 3 points, I suggested, must be implemented together to get a possible solution .
About your point 2). The possible solution can be applied if n2(i) for each i can be computed in a previous do-loop.
About your point 3). I do not have time to set-up an example. This step is similar to set N pointers to each column of a NxN matrix (after the matrix has been allocated).
Do you have VTune available to perform a performance test?
About the only thing left to check is there used to be an Intel Floating License check issue causing long delays in a program. Though I do not recall it being related to deallocation.
I took the program in #1 and built it. I reduced n1 by a factor of 100 because the machine had nowhere near the memory you asked for.
The allocation was 0.21s and the dealloc 0.08s. Our machines will have different speeds but broadly speaking the alloc time is in proportion with your bigger n1 value, the dealloc time however is not and is orders out!!!!
I would suggest running your test at several increasing values of n1 I guess we will see some threshold value at which there is a step change in the dealloc time. This might show something interesting.
The limit (if that is what we see) might be a function of your windows and/or hardware or it may be some limit within the compiler.....
My initial thoughts we the large delay was due to a virtual memory event being initiated during deallocation. Physical memory is not used during allocation, but when the array is being used, perhaps at deallocation. I notice the initialisation phase has now been added to the tester, which appears to indicate the deallocation delay is not associated with virtual memory usage. Given the memory sizes being tested, I am also not able to reproduce this problem. The luxury of having this problem !