Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2465 Discussions

scalable_malloc fails to allocate memory while there is much memory avaliable.

wzpstbb
Beginner
1,262 Views
Hi,

We are using tbbmalloc to manage our system memory. We use scalable_malloc/scalable_free to allocate/deallocate system memory. Everything worked fine before we ran into below case:
1. Keep allocating 1M textureusing DX10 until the allocation fails. Note that some of the system memory will be consumed by doing this.
2. Release all the allocated textures in step #1.
3. Create some objects on heaps. scalable_malloc returns null while I believe there are a lot system memory available at this point. And we have tried to replace scalable_malloc with malloc, then the memory can be allocated.

Does anyone have any idea about why?

Thanks,
Wallace
0 Kudos
25 Replies
wzpstbb
Beginner
1,036 Views
Quoting - wzpstbb
Hi,

We are using tbbmalloc to manage our system memory. We use scalable_malloc/scalable_free to allocate/deallocate system memory. Everything worked fine before we ran into below case:
1. Keep allocating 1M textureusing DX10 until the allocation fails. Note that some of the system memory will be consumed by doing this.
2. Release all the allocated textures in step #1.
3. Create some objects on heaps. scalable_malloc returns null while I believe there are a lot system memory available at this point. And we have tried to replace scalable_malloc with malloc, then the memory can be allocated.

Does anyone have any idea about why?

Thanks,
Wallace

By the way, the caseis run on Vista 32bits, 4G RAM, NV 88000 GTS card with 640M video memory.

- Wallace
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,036 Views

TBB uses a pools concept. It probably uses aligned_malloc to acquire a multi-megabyte pool when a scalable malloc would otherwise fail. Once a (or some) pool(s) is(are) allocated scalable malloc draws from this pool, until it runs out, and then it allocates another large pool.Scalable free, returns to a pool and pools are "never" returned to the C/C++ heap. Malloc internally does something similar to expand its heaps in virtual memory. Each memory allocator (malloc and TBB) once virtual memory is allocated, they don't return this memory until exit (under the assumption your program will allocate/freee again and again using the same allocator)

When using a combination of malloc and TBB scalable malloc you can run into allocation failures when memory gets fragmented not only within a heap or pool but also amongst the heapand pool(s).

A paliative measure (when on Windows)might be to enable the Low Fragmentation Heap (search MSDN for LFH).

Alternately, before your step 1), addstep 0) to use TBB scalable allocator to allocate a working set of memory, then scalable free this memory. Afterwhich, your step 1) will have a reducedupper boundfor allocations, and step 3) will have a working set available.

Jim Dempsey
0 Kudos
Alexey-Kukanov
Employee
1,036 Views
Quoting - wzpstbb
Hi,

We are using tbbmalloc to manage our system memory. We use scalable_malloc/scalable_free to allocate/deallocate system memory. Everything worked fine before we ran into below case:
1. Keep allocating 1M textureusing DX10 until the allocation fails. Note that some of the system memory will be consumed by doing this.
2. Release all the allocated textures in step #1.
3. Create some objects on heaps. scalable_malloc returns null while I believe there are a lot system memory available at this point. And we have tried to replace scalable_malloc with malloc, then the memory can be allocated.

Does anyone have any idea about why?

Thanks,
Wallace

Most likely, DX10and MS CRT share the same memory pool, which the TBB allocator can not use. Once virtual memory is exhausted, it all is hoarded in that pool; so the TBB allocator does not succeed in attempts to map somemore memory.
To prove or disprove that, a reproducing test case would be helpful.
0 Kudos
wzpstbb
Beginner
1,036 Views

Most likely, DX10and MS CRT share the same memory pool, which the TBB allocator can not use. Once virtual memory is exhausted, it all is hoarded in that pool; so the TBB allocator does not succeed in attempts to map somemore memory.
To prove or disprove that, a reproducing test case would be helpful.

Thanks for the quick reply.

Sounds likethe cause. I have attached a reproducing test case. Please pay attention to Tutorial01.cpp, line 223~270. I am using VS 2008 and DirectX SDK(March 2009).

- Wallace.
0 Kudos
Alexandr_K_Intel1
1,036 Views

Sadly I unable to reproduce your situation, i.e. when I run your test case scalable_malloc returns a memory block successfully.

Also I see that on second and subsequent calls to Render() zero 2D textures can be allocated. Is this expected behavior? Are memory really released? Can it be connected with scalable_malloc behavior you observe?

There is GlobalMemoryStatusEx function to report available virtual memory size. Could you check available size before first scalable_malloc call?

If the reason is lack of virtual address space as Alexey suppose above VirtualAlloc for 1MB block can be failed in place of first scalable_malloc call (Its that scalable_mallod do internally on first call).
0 Kudos
wzpstbb
Beginner
1,036 Views

Sadly I unable to reproduce your situation, i.e. when I run your test case scalable_malloc returns a memory block successfully.

Also I see that on second and subsequent calls to Render() zero 2D textures can be allocated. Is this expected behavior? Are memory really released? Can it be connected with scalable_malloc behavior you observe?

There is GlobalMemoryStatusEx function to report available virtual memory size. Could you check available size before first scalable_malloc call?

If the reason is lack of virtual address space as Alexey suppose above VirtualAlloc for 1MB block can be failed in place of first scalable_malloc call (Its that scalable_mallod do internally on first call).

It weird! I can easily reproduce the problem with the test case I attached. I ran the case onVista 32 bits OS, 4G RAM, andNvidia 8800 GTS which has640M video memory.

Yes, the textures are allocated and then release immediately. Therefore, it is expected that on second and subsequent calls to Render() zero 2D textures can be allocated.

I will have a try with GlobalMemoryStatusEx function to see the available virtual memory size.If the reason is lack of virtual address space, is there a solution for it?

Thanks,
- Wallace
0 Kudos
turks
Beginner
1,036 Views
Quoting - wzpstbb
I will have a try with GlobalMemoryStatusEx function to see the available virtual memory size.If the reason is lack of virtual address space, is there a solution for it?

Yes. I ran into what I think is the same problem. A program that legitimately mallocs or scalable_mallocs and then frees up everything allocated still eventually runs of virtual address space (not memory).
You will see that the dwAvailVirtual as reported by GlobalMemoryStatusEx does not go back after the free.
It has effectively "stepped on" a huge range of addresses, even though the memory was given back.

I did find a solution, which I think will work for you.
Instead of malloc or scalable_malloc use the following two functions for alloc and free.
They DO give back the virtual address SPACE as well as the memory.

data = (byte *)VirtualAlloc(NULL, BLOCK_SIZEP, MEM_COMMIT, PAGE_READWRITE);
and
VirtualFree(data, 0, MEM_RELEASE);

This did totally solve this sticky problem for us.
Good luck!
Mitch

0 Kudos
Alexey-Kukanov
Employee
1,036 Views
In most cases, VirtualAlloc shouldn't be the allocation method of choice, for at least two reasons: it is much slower than malloc, and it operates with relative large blocks - a range in the address space should first be reserved by 64K chunks, then committed by 4K pages. Basically, it is suitable to build custom memory pools on top of it, but not as a substitution to malloc.
0 Kudos
turks
Beginner
1,036 Views
Alexey is right. Use the VirtualAlloc only for these large textures not as a general malloc replacement. In this case, since textures are 1M allocations, the size is not an issue. The speed was not an issue for us. Write a very simple loop to do the following a few dozen times: { Display available virtual memory, Allocate 1000 textures, Free 1000 textures } Each pass through the loop gets a Gigabyte of memory and address space. The next loop pass also gets 1 Gb of memory and address space. If you don't use the VirtualAlloc/Free mechanism, the addresses of the textures will keep crawling throughout the full 2-4 Gb address range and the displayed available VM will go down with each loop pass. This will also be seen if you use the Task Manager to view the VM usage.

Depending on your speed requirements and how often your program allocates textures, you may or may not be able to use this solution. After re-reading your post, I think there may be a better solution, as follows.

Consider address space fragmentation. There is a big difference between doing 100 allocates for 100 textures and allocating a single array of 100 textures. The latter requires the address space to be contiguous, which may quickly not become possible after many alloc/de-allocs. Changing you code to do the less-efficient, separate allocation per texture will get around this address space fragmentation problem and could solve your problem without resorting to Virtual Alloc.

Though we were taught that a demand paging system pretty much solves your memory issues, they never anticipated nor dealt with the address SPACE management issues.

Address space fragmentation may in fact be your main problem.
I'm curious if your code allocates textures in blocks and, if so, if its easy to change that and see if the problem totally goes away.

Mitch
0 Kudos
turks
Beginner
1,036 Views
Wallace,
I just took a look at your tutorial01.cpp. You do realize that since your program tries to allocate 65k of 1Mb textures, that is 65 Gigabytes of virtual memory. The limit on any one process is eith 2Gb (Xp), 3Gb (XP w/3GB switch), 4Gb with 64 bit OS. Thus you are always using up your whole virtual address space before you do the scalable_malloc.
Yes, you are giving it all back, but scalable_malloc needs to start with some virtual address space of its own.
If you insert a call to GlobalMemoryStatus before the texArray alloc and again right after the texArray delete, you will see that dwAvailVirtual has gone down and NOT been restored.

Since you are doing separate 1M texture allocations until all of memory is used, then your problem is not the address space fragmentation I mentioned above, but the former problem. If you can change the alloc methods CreateTexture2D() and Release() to use the VirtualAlloc/VirtualFree, I'm pretty sure tutorial01.cpp will work.
0 Kudos
wzpstbb
Beginner
1,036 Views
Quoting - turks
Wallace,
I just took a look at your tutorial01.cpp. You do realize that since your program tries to allocate 65k of 1Mb textures, that is 65 Gigabytes of virtual memory. The limit on any one process is eith 2Gb (Xp), 3Gb (XP w/3GB switch), 4Gb with 64 bit OS. Thus you are always using up your whole virtual address space before you do the scalable_malloc.
Yes, you are giving it all back, but scalable_malloc needs to start with some virtual address space of its own.
If you insert a call to GlobalMemoryStatus before the texArray alloc and again right after the texArray delete, you will see that dwAvailVirtual has gone down and NOT been restored.

Since you are doing separate 1M texture allocations until all of memory is used, then your problem is not the address space fragmentation I mentioned above, but the former problem. If you can change the alloc methods CreateTexture2D() and Release() to use the VirtualAlloc/VirtualFree, I'm pretty sure tutorial01.cpp will work.

Hi turks,

Thanks for exploring the issue. And sorry for my late reply because I was on a vacation last week.

But I don't want to discard TBB allocator. As suggested by Alexey, VirtualAlloc/VirtualFree is much slower than scalable_malloc and is not recommended to be used for common allocation.

Alexey, can TBB allocator resolves this issue? That is, is it possible for TBB allocator to share the virtual address with DX10 and CRT?

- Wallace
0 Kudos
Alexey-Kukanov
Employee
1,036 Views
Quoting - wzpstbb
Alexey, can TBB allocator resolves this issue? That is, is it possible for TBB allocator to share the virtual address with DX10 and CRT?

I'd tell you upfront if that was possible; unfortunately it is not - at least not without TBB source changes.

The way for the TBB allocator to use the same pool as malloc would be to call malloc instead of VirtualAlloc; that's relatively easy change one could do having the TBB sources. But that'solnyhalf the work or even less, becauseTBB allocator is also "greedy" and in most cases it does not return the memory back. And finding a good balance between keeping memory blocks to speed up future allocations and returning them back to be more cooperative with other memory managers is a challenge with some ambiguous tradeoffs.

Possibly the best thing I can suggest you is to pre-allocate enough memory with the scalable_allocator before allocating textures.
0 Kudos
wzpstbb
Beginner
1,036 Views
Can TBB expose someinterfaces for releasing the virtual address? So we canask TBBto release the virtual address whenthe otherallocator fails to allocate memory.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,036 Views
Wallace,

I suggest you adopt your 32-bit strategy to:

1. Keep allocating 1M textureusing DX10 until the allocation fails.
1.a) Determine how to prorate memory into three general sections: 1M textures, TBB pool, malloc pool
1.b) Keep the decided upon number of 1M textures in your own private pool of textures.
2. Release all the non-reserved allocated textures in step #1
3. Remove (if present) the overload of new/delete/malloc/free (via TBB header) and then explicitly use the memory allocator in those sections of code that are suitable for scalable allocation.
4. Use the private pool of previously allocated1Mtexture buffers as you require 1M textures. When you run out, adapt your code to run with this limited number of textures.

Jim Dempsey
0 Kudos
wzpstbb
Beginner
1,036 Views
Hi Jim,

My sample app is just for reproducing the problem. Our use case is much more complex. In our system, we overload the global new/delete operators. We use scalable_allocate/scalable_free to process memory requests from new/delete. It is difficult for us to predict the how much memory would be allocated by TBB.Usually the memory consumed by TBB would be constant.However, at some point thememory allocated using TBB could be extremely high. We want TBB to reserve a medium size of virtual address instead of the maximum virtual address. So it would be great if TBB has some APIs for us to release some virtual address in off-peak duration.

Thanks,
Wallace
0 Kudos
Vladimir_P_1234567890
1,036 Views
0 Kudos
Alexandr_K_Intel1
1,036 Views

Wallace,

Did you try TBB allocator from TBB 4.0 (preferable, 4.0 update 2)? For 4.0 we made significant changes in controlling virtual address space, and there is a hope that the changes can help in your case.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,036 Views
Scalable allocators tend to not be friendly towards return of memory once allocated to the (an) allocator pool. You may have some success with returning the entire pool or none. By this I mean you might have some success by adding (using) a feature that you can scope instantiate a new scalable allocator pool. When you exit that scope, that pool evaporates, and the prior scalable allocator pool is reactivated. This would have the requirement that objects allocated in the nested layer do not persist as you pop out of that scope. This techniquemight be acan-o-worms if you are not careful.

The better strategy might be to not overload default new/delete. Instead overload specific object new/delete with one using the scalable allocator. i.e. only those objects with high flux are subject to scalable allocation.

An intermediary technique would be to overload specific object new/delete with one .NOT.using the scalable allocator, rather using a concurrent_queue. On 'new' pop an item from the queue, if the queue is empty then malloc. On 'delete' push the object pointer into the queue. When tight on memory you can pull items from the queue and free them.... however.... depending on your probram flow you might not get the peak level back again.

If this becomes a problem (out of memory), I suggest you re-think your application such that it is impossible to make more allocations than will run.

Example (made up example with similar issues):

Problem: Make a parallel anti-virus scan

Pseudo code:

program
for each file
enqueue fileTask
end program

The problem with the above is you could end up with 500,000 fileTask's plus the buffering requirements plus the tasks spawned by fileTask.

A better route would be

program
useParallelPipeline(withSomeTokens);
end program

Where the parallel_pipeline pulls in the next file only when a token is available. This puts an upper limit of concurrent file processing at 'withSomeTokens' number of tokens (say 10) instead of an unbounded number of files (say 500,000).

Your problem is not necessarily dealing with files, but it may have a large number of things to process, which in your current design apparently is experience a congestion point where an excessive amout of allocations is required. The point of the programming change is to restrict the peak allocations. Note, your program may run faster on a restricted number of buffers. The thread performing the (excessive) allocations will be available for processing current allocations while waiting for next input token.

Jim Dempsey




0 Kudos
SergeyKostrov
Valued Contributor II
1,036 Views

Wallace,

Did you try TBB allocator from TBB 4.0 (preferable, 4.0 update 2)? For 4.0 we made significant changes in controlling virtual address space, and there is a hope that the changes can help in your case.


The Test-Case created by Wallace in 2009 uses TBBversion 2.1 ( some TBB headers are included in the VS project):

tbb_stddef.h

...
#define TBB_VERSION_MAJOR 2
#define TBB_VERSION_MINOR 1
...

It really makes sence to try the latest version 4.0of TBB.

0 Kudos
wzpstbb
Beginner
902 Views
Thank you for all the answers.

One reason we decided to override the global new/delete is that we want to handle all the low-memory/out-of-memorysituations by ourself.In addition,dropping the overriden global new/delete operators would have a big impact on our clients. We will update to TBB 4.0 in our next release.

Wallace
0 Kudos
Reply