lanH, - Page 2

Frank_W_1 · ‎05-24-2013

I have defined a Type Called A_Type which has dynamic array inside,

then I construct a dynamic array of A_Type,

When I test the memory I could use ,

I find the available memory decreases after I deallocate all the dynamic array.

Why? The phenomenon looks like memory leak.

Is there any suggestion ?

IanH · ‎06-06-2013

Your c++ Memory_Total_Test is testing a different thing to your Fortran case. The C++ test ultimately allocates its blocks from a heap managed by an operating system library using HeapAlloc. The Fortran case supplies the allocation (because it is large) directly from the process' address space using VirtualAlloc. This is the aspect that I was referring to above, when I said that the Fortrans programs way of testing for available contiguous memory might be confusing things.

If you rewrote your memory_total_test to use VirtualAlloc I expect that you would see very similar results.

Your Fortran approach implies that you want both one large contiguous chunk and many, many smaller identical size chunks. Is that really your use case? If you are worried about fragmentation preventing the availbility of the large chunk, and easy solution is to get (and retain) that allocation early - before you do the smaller allocations. That's easy enough to do in standard Fortran.

John_Campbell · ‎06-06-2013

Steve,

Could you explain the difference between C malloc to allocate the storage and the WIndows API routine VirtualAlloc ? At what allocation size does the switch take place ?

It is my understanding form your earlier explaination that smaller amounts of memory are allocated off the heap. However Frank's example allocates 2,000,000 small blocks of 600 bytes, which is 1.2gb in total. At some point, does the heap overflow ? If so does a new heap become created or does it revert to VirtualAlloc ?

There appears to be some memory management structure left over from the DEALLOCATE. The situation appears to be worse when N=250 and the ALLOCATE fails before 2,000,000 are allocated.

Given that there are two types of allocation methods, could the routine "Memory_Total_Test" be getting different answers, depending on if it uses MALLOC or VirtualAlloc ?

John

Perhaps I should modify the test program to report the address and size of each successfull ALLOCATE. This would then provide a memory map of where all the memory is being used, and identify what memory is not being used. I could then analyse for contiguous blocks and report them.

Gap Size (bytes)                 Number
                        40                      14
                        56        1,994,948
                        88                         1
                     456                4,726
                     584                         7
                     648                         4
                     680                         7
                     696                         9
                  1,480                    198
                 1,672                         1
               35,128                         1
               62,920                      52
               63,944                      14
               64,968                      12
             196,040                         1
             262,600                         2
         4,758,248                         1
     160,088,856                         1

The table represents the count of the gaps between each B(i)%C allocation address, after allowing for 600 bytes in size. The minimum gap is 40 bytes and the typical is 56 bytes. So typically each allocation is stepping 656 bytes through memory. Is this an overhead of the %C index or a memory gap from MALLOC ?
There are 5,037 gaps larger than 56 bytes.

The allocation : ALLOCATE(B(M),STAT=Error) appears to allocate 2,000,000 x 80 bytes of memory, based on the report :
write (*,*) 'mem_index',m, loc(B), loc(B(m)), B(m)%N, error ; giving
mem_index 2000000 6094920 166094840 150
this would represent the 160,000,000 gap identified above.

Why is there 80 bytes per entry of B(:) ?

Edit : 06/11/2013
The test I reported above was done using Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.5.344 Build 20120612
I have now re-tested the example, using Intel(R) Visual Fortran Compiler XE for applications running on IA-32, Version 12.1.5.344 Build 20120612

The changes are that:
There are now 40 bytes per B(:) entry. With the address information changing from 8 bytes to 4 bytes, there is potentially 10 addresses per B(:) entry ?, or is the default integer in ifort_64 8 bytes ?

The table of gaps between each B(i)%C now has a mixture of 40 bytes or 56 bytes, as the main gap, but a few 24 bytes (?).

[plain]Gap (bytes)Count of gap
                          24                       7
                          40          999,982
                          56          995,037
                       136                       4
                       216                       1
                       232                    14
                       392                       1
                       408              4,792
                    1,432                    78
                 35,064                       1
                 63,896                    25
                 64,920                    28
                 65,944                    24
               262,552                       2
               634,056                       1
           2,492,888                       1
         81,658,264                       1
Grand Total     1,999,999 [/plain]

These large gaps between each entry still indicate there is potential for extra information being associated with each B(k)%C(150) allocation.

Frank_W_1 · ‎06-06-2013

lanH,

Thank you for your help.

But I don't care and have no ability to know what have been done in the lower level with fortran or C++;

As a programer ,I will always focus on the algorithm with the program tool,meanwhile I will not think what happened in deep level and how

the program tool works, or I will be a compiler instead of a coder now.

In my opinion, when I have made the same data structure and algorithm in different program language ,it should have the same or similar result.

You explains to me that my c++ Memory_Total_Test is different with my Fortran case.

But I can't tell the difference between them from algorithm.They are twins just with different program languages.

As Steve has said,

"My experiment replacing the ALLOCATE/DEALLOCATE with malloc and frree, using the identical amounts of storage, demonstrates the exact same fragmentation issue, which shows that Fortran is not doing something extra resulting in the fragmentation. It is the Microsoft C library that is not properly consolidating the freed blocks",

Due to Steve's Opinion, malloc and free maybe has some problem under such situation, if I didnt make a mistake to understand him.

But after my test with C++, this doesnt happen. So I feel puzzled.

Now I need such data structure to program some complex algorithm,what should I do,do you have any suggestions?

Change to C++ because of no fragmentation? I do hope not.

Frank

Frank_W_1 · ‎06-06-2013

Lanh,

As you have explained,

"The C++ test ultimately allocates its blocks from a heap managed by an operating system library using HeapAlloc. The Fortran case supplies the allocation (because it is large) directly from the process' address space using VirtualAlloc. "

Which is larger? Both are large. The algorithm of Memory_Total_Test is to get the biggest continuous memory, the array of Temp in both C++ and Fortran will be about 2GB at the final successful allocation.

From your opinion, HeapAlloc or VirtualAlloc maybe have some problem, if so ,why not change to the better one?

"Your Fortran approach implies that you want both one large contiguous chunk and many, many smaller identical size chunks. Is that really your use case? If you are worried about fragmentation preventing the availbility of the large chunk, and easy solution is to get (and retain) that allocation early - before you do the smaller allocations. That's easy enough to do in standard Fortran."

I don't need one large contiguous chunk.The function of Memory_Total_Test ,as I have said above,is just to test the biggest continuous memory.

During the process of my code, the dimension of dynamic array is always changing .They are identical just at the beggining. After many cycles of deallocate/allocate,about 300 MB Memory is lost. The fortran case is just a abstract case to show this problem.

IanH · ‎06-07-2013

Hang a tick - the requested allocations are wildly different between the Fortran examples and the C++ example!

When I adjust the C++ program to have the same characteristics as one of the Fortran programs (2 million allocations of about 600 bytes or so) - I see the same fragmentation behaviour.

Alternatively, if I change the Fortran code to have the same characteristics as the latter C code (500 small allocations of 3MB each) - I see no fragmentation.

With the search for maximum allocation strategy used (based off John's code) the size of allocation used by new in the C++ code is always bigger than what the underlying heap manager appears to accomodate - it just ends up setting aside memory separate to the "normal" heap memory using VirtualAlloc anyway - so the results from Fortran "directly" calling VirtualAlloc and the C runtime calling HeapAlloc become consistent for the largest contiguous free block search.

I can reproduce the behaviours using language agnostic calls direct to the Win32 api. When fragmentation occurs it has nothing to do with the Fortran language runtime. I guess fragmentation is an occasional threat for any dynamic memory system where allocated memory blocks cannot be moved around. If fragmentation is causing you issues there are some workarounds - one example mentioned previously - in the Win32 API code you can see how obliterating a fragmented heap and then creating a new heap lets you "reset" things. But implementing this wouldn't be pretty.

VirtualAlloc and HeapAlloc have different roles to play. VirtualAlloc is not appropriate to use for small allocations - the minimum it can allocate is a page of memory and it requires a kernel call. HeapAlloc can efficiently dish out smaller allocations and, unless the heap is full and more address space needs to be added to the heap (in which case HeapAlloc calls VirtualAlloc to do that reservation) doesn't necessarily have to call into the kernel. There is a small overhead associated with HeapAlloc for large allocations that VirtualAlloc doesn't incur, and VirtuaAlloc gives you more control over the access settings, etc for the allocated memory. But anyway - that's all detail - let the language runtimes worry about what's best.

John_Campbell · ‎06-07-2013

Two points I don't understand from my previous post:
1) Why are there these 56 byte gaps between the c(100) allocations ? Is this an indication of something else being allocated, or just a lot of wastage. I tried another compiler and it allocated 16 byte gaps. The next question is if this full 600 + 56 bytes is being released and then merged with the adjacent memory pool. The symptoms of the problem suggest this is not taking place.
2) The total allcation is over1.2gb. I didn't think the heap was so large, so this "heap" must be a more general definition, rather than the bottom of the stack.

Maybe this is what we have, in regards to the capability of ALLOCATE / DEALLOCATE, although I'm surprised that there can not be more clean-out achieved when removing the Type structure.

John

Frank_W_1 · ‎06-07-2013

LanH，
Thank you very much for you help.
I am very sorry I have made a unfair test example between C++ and Fortran.Without you detail
explaination,I will not find out such bad test design case.Steve has also pointed it out,
but I didnt catch it.I should make an apology.

It is really complicated.Is this a bug or a bad design of heap management ? Maybe or Maybe not!
But now I do know some algorithm should be changed under such situation.
Some data stucture should be optimized too.Millions of dynamic arrays with random size
should be forbidden.And cycles of deallocate/allocate should be limited to some level.

Now I begin to worry about the data structure such as list/tree,which have dynamic nodes
with random amount.Once there is a very big list or very huge tree with small data nodes ,cycles of construct and reconstrcuct,
will the available memory become smaller and smaller?If so,it is really horrible.

By the way, will this happen under x64 OS? Do you have any idea?
I have made some test,but it is hard to explain.
Some evidence shows it will not happen under X64.But I am not sure.

Thank you !

Frank

Frank_W_1 · ‎06-07-2013

Two dimmension dynamic array has the same problom too.

int m,n;
int row,col;
int **p;
/*Allocate*/

Memory_Total_Test(1);
m =1000000;
p=(int **)malloc(m * sizeof(int *));
for (row=0;row<m;row++)
{
  n = 200;
  p[row]=(int *)malloc(n * sizeof(int));
  for(col=0;col<n;col++)
   p[row][col]=0;
}
Memory_Total_Test(2);

/*deallocate*/
for (row=0;row<m;row++)
{
free(p[row]);
}
free(p);

Memory_Total_Test(3);

Position =1Memory that could be used(M):1728
Position =2Memory that could be used(M):941
Position =3Memory that could be used(M):941

IanH · ‎06-08-2013

(This can all vary a bit depending on you runtime library options (static vs DLL, debug vs release, etc). Because the underlying heap manager is an operating system service - it will also vary with operating system version and perhaps system settings. As mentioned before, the mere presence of a debugger can also change behaviour.)

The Fortran programmer executes an ALLOCATE statement - as part of that the Fortran runtime [may] call the malloc function in the C runtime. The C runtime then asks the operating system provided heap manager to allocate some memory using HeapAlloc (with earlier versions of Microsoft C++ the C runtime provided its own heap manager. I'm pretty sure earlier versions of Windows (pre W2K?) didn't provide a heap manager for general use either.) The heap manager that sits behind HeapAlloc will have some overhead associated with the allocation (things like the size of the block, pointer to the next block, status of the block, etc.). You can query that overhead using the HeapWalk API - it appears to be about 24 bytes. I expect that 24 bytes is probably located immediately before the allocation, though it could be before and after, or part could even be located remotely.

The debug variant of the C runtime then has further overhead located immediately before and after the allocated block. If you have a professional edition of VS you can actually inspect the source code for this in crtdbg.h, but it is also well documented on msdn. Effectively the allocation returned by HeapAlloc is filed with a structure in Fortran that looks like:

[fortran] TYPE, PUBLIC, BIND(C) :: CrtMemBlockHeader
    TYPE(C_PTR) :: pBlockHeaderNext         ! struct _CrtMemBlockHeader*
    TYPE(C_PTR) :: pBlockHeaderpREV         ! struct _CrtMemBlockHeader*
    TYPE(C_PTR) :: szFileName               ! char *
    INTEGER(C_INT) :: nLine
    ! On x64 the order of the next two members are reversed from the 32 bit
    ! variant to keep the structure packed and aligned all nicely. Academic
    ! for me until the CFO ponies up more cash, but hopefully this
    ! directive covers it.
!DEC$ IF DEFINED(_M_IA64) .OR. DEFINED(_M_AMD64) .OR. DEFINED(__x86_64__)
    INTEGER(C_INT) :: nBlockUse
    INTEGER(C_SIZE_T) :: nDataSize
!DEC$ ELSE
    INTEGER(C_SIZE_T) :: nDataSize
    INTEGER(C_INT) :: nBlockUse
!DEC$ END IF
    INTEGER(C_LONG) :: lRequest
    CHARACTER(C_CHAR) :: gap(nNoMansLandSize)
    ! Psuedo fields that follow.
    !CHARACTER(C_CHAR) :: data(nDataSize)
    !CHARACTER(C_CHAR) :: anotherGap(nNoMansLandSize)
END TYPE CrtMemBlockHeader
[/fortran]

where the data field is the actual memory that holds the data of the Fortran allocatable variable. (The allocatable variable will itself have its own descriptor, but that is located elsewhere in the storage for the derived type itself). The no mans land gaps are used to detect heap overruns.and are four bytes long - so C runtime overhead here on 32 bit systems is 36 bytes.

(36 + 24 is more than your 56, but we're within engineering tolerance...)

Then you might also have a small amount of overhead associated with various runtimes ensuring that the memory is appropriately aligned.

The underlying heaps (there can be more than one - one is created for the process when the process is created, but other heaps are created by various subsystems. The C runtime creates its own heap separate to the process default heap - and that is typically the fifth heap whenever I've gone process memory spelunking) are grown as required - see the discussion for the HeapAlloc etc. This growth happens by the heap manager calling VirtualAlloc and asking for another chunk (referred to as a region) of the process' address space - it looks like it does this in a doubling sense (first it might request 4K, then 8K, 16K, etc, up to a maximum of a 16 MB increment). That region may not be contiguous with the previous regions. Typically once the heap manager has added a region it doesn't give it back to the operating system until the heap is destroyed (effectively when the program terminates, unless the programmer explicitly destroys the heap). From experimentation yesterday I see exceptions - "super" regions are requested for each large allocations (more than 16MB) and are effectively managed separately to the regions that make the heap up proper. There is a small amount of overhead for each region.

From what I can tell (not categoric), allocations within a region are coalesced when they are released, but blocks in separate regions that happen to be adjacent are not.

Importantly - when you ALLOCATE a large block of memory (16 MB or more - or something like that) - the Fortran runtime doesn't go via the C runtime - it directly asks the operating system via VirtualAlloc for a chunk of address space. This request cannot be satisfied by address space that the heap manager has already requested, even if the heap is completely empty. In extreme cases ALLOCATE might fail to find a single 16MB spot for its data, but could quite happily allocate two hundred or so 10 MB blocks. The Fortran test cases in this thread show that problem - they make a whole heap of small allocations, which are fulfilled from a heap - so the heap grows, then they free those small allocations (the heap is mostly empty), then they try and find a single large allocation.

If you want to go investigating further:

- install the vmmap application that Microsoft put out through their sysinternals "site".

- I translated bits of crtdbg.h to Fortran when debugging a memory leak associated with polymorphic components a while back. It provides various routines that can be useful for analysing the debug heap (there are more still that I didn't translate across and I see I've been hacking without testing - hopefully it still works).

- The AllocateTest routine shows use of HeapWalk and friends - you could use this to build a memory map of heap allocations. vmmap has similar capability.

To the OP - are you sure this is really a problem for you? It is one thing to observe symptoms with test cases, it is another to experience those in production code. If you do run into problems I think it is more due to the simple fact that you are trying to use the majority of your process' address space, and not so much the performance of the heap manager. A consequence of this is that moving to 64 bit would help massively here. With dynamic allocation you are always going to have some overhead/loss associated with things like fragmentation. I don't see this as a heap manager bug or as a heap manager design issue - its simply a question of whether your program is making resonable use of the resources of the machine. The heap manager performs acceptably for the vast majority of Windows applications, and most of those have far more complicated memory allocation and deallocation patterns than your typical Fortran program, but perhaps less total memory demand.

jimdempseyatthecove · ‎06-08-2013

If this is a pressing problem, you can replace the MS VS CRTL malloc and free with a different thread-safe malloc and free with the required behavior. Just link in before the CRTL.

Jim Dempsey

Frank_W_1 · ‎06-14-2013

lanH，

Many thanks for your help.

The problem did happen in production code due to the algorithm I have designed.
Millions of dynamic arrays have been used to collect and group some result data
with inserting and sorting operation.

But when I find out the problem is related to the data structure,
I have changed the data structure to avoid this problem with some efficiency loss.

In the area of computational mechanics,with the development of program language,
due to the flexible use of some new features,millions of dynamic arrays do may happen.
In the past,all the computation may be done in a very big array,which is hard to manage and read,
we hope to do some improvement.But since the fragmentation problem, the big array looks a good way.

For it is a commercial code, we should consider the need of Win32 users, so just
updating it to win64 is not a perfect solution.We will change the algorithm to avoid such problem.

Thank you for the great help.

Thanks to Steve.

Thanks to Jhon.

Thanks to Jim.

Frank

Steven_L_Intel1 · ‎06-14-2013

Sorry I missed the earlier question about VirtualAlloc - it is used for allocations 16MB and larger, otherwise it calls aligned_malloc.

Dynamic Array Memory Problem