- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I found a very slow speed during a deallocation of a derived data type that I can not fix.
I'm using IVFC 16.0.3.207 for Windows, with default release compiler options.
Here below an exemplification of my problem:
I've used the function secnds for the timing and the print on the screen is:
>test_dealloc.exe
Allocation time 15.62891
Deallocation time 4075.336
The machine is a XEON E5-2690 x64 with 192 GB of RAM, the OS is Windows Server 2008 R2 Enterprise.
From the task manager the allocated memory is around 44GB. The time for the deallocation is very large, much more than the allocation one and also much more of the computation kernel where the data structure is filled.
Is there any possibility to reduce such a deallocation time?
Thank you very much. Regards
Paolo
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This issue sounds serious and deserves consideration by Intel.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"From the task manager the allocated memory is around 44GB"
I did a calc for n1 = 5827396 and n2 = 75 and get an estimate of 244 GB. I suggest you try n1 = 5827396/100, use the following changes and run with task manager.
integer(kind=8) :: siza, sizi, sizet real*4 gb ! start time0 = secnds(0.0) allocate (ddata_p(n1)) write (*,*) 'allocate ddata_p size =', sizeof (ddata_p) sizet = 0 do i=1,n1 allocate (ddata_p(i)%elem(n2,n2)) sizet = sizet + sizeof ( ddata_p(i)%elem ) enddo time1 = secnds(time0) gb = sizet / 1024.**3 write (*,*) 'fill ddata_p size =', sizet, gb ! c_sizeof (ddata_p) write (*,*) 'allocate ddata_p size =', sizeof (ddata_p) read (*,*) gb
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you John,
You are rigth, the real size of the data structure is 244 GB, in fact if I try to initialize the variable I read in the task manager the same allocated memory that you read too.
I made this test on bigger machine (Intel Xeon Gold 6144 with 684 GB RAM OS Windows Server 2016 Datacenter) adding the following initialization:
... time0=secnds(0.0) do i=1,n1 ddata_p(i)%elem = 0 enddo time1=secnds(time0) print*, ' ' print*, 'Setting time', time1 ...
And I obtained these new values:
test_dealloc_setting.exe Allocation time 14.67578 Setting time 79.22656 Deallocation time 3202.836
The deallocation time is still very large.
Paolo
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This looks like an issue with the C++ heap manager and/or virtual memory paging due to fragmentation issues. Try deallocating in reverse order of allocation:
time0=secnds(0.0) do i=n1,1,-1 deallocate(ddata_p(i)%elem) enddo deallocate (ddata_p) time1=secnds(time0) print*, ' ' print*, 'Deallocation time', time1
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, If you have memory deallocation debugging enabled (e.g. valgrind, or Windows debug C Runtime Library) this may have an effect on free (deallocate).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
I tried to deallocate in reverse order but nothing changed.
Regarding the compiler options, I've left the default ones for the release configuration, that are (from the command window of the compiler):
/nologo /O2 /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc100.pdb" /libs:dll /threads /c
Paolo
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One possible solution is:
1) change elem into a pointer instead of an allocatable array.
2) allocate a single block of n1*n2*n2 elements.
3) set ddata_pi(i)%elem to the initial element of each n2*n2 matrix
In this case you will have a single de-allocation.
Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Luigi,
Thank you for the suggestions. About the options:
1) I tried with pointer but no real change
2) That would be really the worst solution for my case because in the real code n2 is depending from n1 and I don't know in advance the final dimensione of the data, so the derived structure would be the best option for my application.
3) I didn't understand, can you explain better please? (or make an example)
Regards
Paolo
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for the hurry in answering.
The 3 points, I suggested, must be implemented together to get a possible solution .
About your point 2). The possible solution can be applied if n2(i) for each i can be computed in a previous do-loop.
About your point 3). I do not have time to set-up an example. This step is similar to set N pointers to each column of a NxN matrix (after the matrix has been allocated).
Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you have VTune available to perform a performance test?
About the only thing left to check is there used to be an Intel Floating License check issue causing long delays in a program. Though I do not recall it being related to deallocation.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is a very serious 300 fold drop in performance. Why has it not been aknowledged by Intel as an issue?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I took the program in #1 and built it. I reduced n1 by a factor of 100 because the machine had nowhere near the memory you asked for.
The allocation was 0.21s and the dealloc 0.08s. Our machines will have different speeds but broadly speaking the alloc time is in proportion with your bigger n1 value, the dealloc time however is not and is orders out!!!!
I would suggest running your test at several increasing values of n1 I guess we will see some threshold value at which there is a step change in the dealloc time. This might show something interesting.
The limit (if that is what we see) might be a function of your windows and/or hardware or it may be some limit within the compiler.....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My initial thoughts we the large delay was due to a virtual memory event being initiated during deallocation. Physical memory is not used during allocation, but when the array is being used, perhaps at deallocation. I notice the initialisation phase has now been added to the tester, which appears to indicate the deallocation delay is not associated with virtual memory usage. Given the memory sizes being tested, I am also not able to reproduce this problem. The luxury of having this problem !
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page