We are using tbbmalloc to manage our system memory. We use scalable_malloc/scalable_free to allocate/deallocate system memory. Everything worked fine before we ran into below case:
1. Keep allocating 1M textureusing DX10 until the allocation fails. Note that some of the system memory will be consumed by doing this.
2. Release all the allocated textures in step #1.
3. Create some objects on heaps. scalable_malloc returns null while I believe there are a lot system memory available at this point. And we have tried to replace scalable_malloc with malloc, then the memory can be allocated.
Does anyone have any idea about why?
TBB uses a pools concept. It probably uses aligned_malloc to acquire a multi-megabyte pool when a scalable malloc would otherwise fail. Once a (or some) pool(s) is(are) allocated scalable malloc draws from this pool, until it runs out, and then it allocates another large pool.Scalable free, returns to a pool and pools are "never" returned to the C/C++ heap. Malloc internally does something similar to expand its heaps in virtual memory. Each memory allocator (malloc and TBB) once virtual memory is allocated, they don't return this memory until exit (under the assumption your program will allocate/freee again and again using the same allocator)
When using a combination of malloc and TBB scalable malloc you can run into allocation failures when memory gets fragmented not only within a heap or pool but also amongst the heapand pool(s).
A paliative measure (when on Windows)might be to enable the Low Fragmentation Heap (search MSDN for LFH).
Alternately, before your step 1), addstep 0) to use TBB scalable allocator to allocate a working set of memory, then scalable free this memory. Afterwhich, your step 1) will have a reducedupper boundfor allocations, and step 3) will have a working set available.
Most likely, DX10and MS CRT share the same memory pool, which the TBB allocator can not use. Once virtual memory is exhausted, it all is hoarded in that pool; so the TBB allocator does not succeed in attempts to map somemore memory.
To prove or disprove that, a reproducing test case would be helpful.
Thanks for the quick reply.
Sounds likethe cause. I have attached a reproducing test case. Please pay attention to Tutorial01.cpp, line 223~270. I am using VS 2008 and DirectX SDK(March 2009).
Sadly I unable to reproduce your situation, i.e. when I run your test case scalable_malloc returns a memory block successfully.
Also I see that on second and subsequent calls to Render() zero 2D textures can be allocated. Is this expected behavior? Are memory really released? Can it be connected with scalable_malloc behavior you observe?
There is GlobalMemoryStatusEx function to report available virtual memory size. Could you check available size before first scalable_malloc call?
If the reason is lack of virtual address space as Alexey suppose above VirtualAlloc for 1MB block can be failed in place of first scalable_malloc call (Its that scalable_mallod do internally on first call).
It weird! I can easily reproduce the problem with the test case I attached. I ran the case onVista 32 bits OS, 4G RAM, andNvidia 8800 GTS which has640M video memory.
Yes, the textures are allocated and then release immediately. Therefore, it is expected that on second and subsequent calls to Render() zero 2D textures can be allocated.
I will have a try with GlobalMemoryStatusEx function to see the available virtual memory size.If the reason is lack of virtual address space, is there a solution for it?
Yes. I ran into what I think is the same problem. A program that legitimately mallocs or scalable_mallocs and then frees up everything allocated still eventually runs of virtual address space (not memory).
You will see that the dwAvailVirtual as reported by GlobalMemoryStatusEx does not go back after the free.
It has effectively "stepped on" a huge range of addresses, even though the memory was given back.
I did find a solution, which I think will work for you.
Instead of malloc or scalable_malloc use the following two functions for alloc and free.
They DO give back the virtual address SPACE as well as the memory.
data = (byte *)VirtualAlloc(NULL, BLOCK_SIZEP, MEM_COMMIT, PAGE_READWRITE);
VirtualFree(data, 0, MEM_RELEASE);
This did totally solve this sticky problem for us.
Depending on your speed requirements and how often your program allocates textures, you may or may not be able to use this solution. After re-reading your post, I think there may be a better solution, as follows.
Consider address space fragmentation. There is a big difference between doing 100 allocates for 100 textures and allocating a single array of 100 textures. The latter requires the address space to be contiguous, which may quickly not become possible after many alloc/de-allocs. Changing you code to do the less-efficient, separate allocation per texture will get around this address space fragmentation problem and could solve your problem without resorting to Virtual Alloc.
Though we were taught that a demand paging system pretty much solves your memory issues, they never anticipated nor dealt with the address SPACE management issues.
Address space fragmentation may in fact be your main problem.
I'm curious if your code allocates textures in blocks and, if so, if its easy to change that and see if the problem totally goes away.
I just took a look at your tutorial01.cpp. You do realize that since your program tries to allocate 65k of 1Mb textures, that is 65 Gigabytes of virtual memory. The limit on any one process is eith 2Gb (Xp), 3Gb (XP w/3GB switch), 4Gb with 64 bit OS. Thus you are always using up your whole virtual address space before you do the scalable_malloc.
Yes, you are giving it all back, but scalable_malloc needs to start with some virtual address space of its own.
If you insert a call to GlobalMemoryStatus before the texArray alloc and again right after the texArray delete, you will see that dwAvailVirtual has gone down and NOT been restored.
Since you are doing separate 1M texture allocations until all of memory is used, then your problem is not the address space fragmentation I mentioned above, but the former problem. If you can change the alloc methods CreateTexture2D() and Release() to use the VirtualAlloc/VirtualFree, I'm pretty sure tutorial01.cpp will work.
Thanks for exploring the issue. And sorry for my late reply because I was on a vacation last week.
But I don't want to discard TBB allocator. As suggested by Alexey, VirtualAlloc/VirtualFree is much slower than scalable_malloc and is not recommended to be used for common allocation.
Alexey, can TBB allocator resolves this issue? That is, is it possible for TBB allocator to share the virtual address with DX10 and CRT?
I'd tell you upfront if that was possible; unfortunately it is not - at least not without TBB source changes.
The way for the TBB allocator to use the same pool as malloc would be to call malloc instead of VirtualAlloc; that's relatively easy change one could do having the TBB sources. But that'solnyhalf the work or even less, becauseTBB allocator is also "greedy" and in most cases it does not return the memory back. And finding a good balance between keeping memory blocks to speed up future allocations and returning them back to be more cooperative with other memory managers is a challenge with some ambiguous tradeoffs.
Possibly the best thing I can suggest you is to pre-allocate enough memory with the scalable_allocator before allocating textures.
I suggest you adopt your 32-bit strategy to:
1. Keep allocating 1M textureusing DX10 until the allocation fails.
1.a) Determine how to prorate memory into three general sections: 1M textures, TBB pool, malloc pool
1.b) Keep the decided upon number of 1M textures in your own private pool of textures.
2. Release all the non-reserved allocated textures in step #1
3. Remove (if present) the overload of new/delete/malloc/free (via TBB header) and then explicitly use the memory allocator in those sections of code that are suitable for scalable allocation.
4. Use the private pool of previously allocated1Mtexture buffers as you require 1M textures. When you run out, adapt your code to run with this limited number of textures.
My sample app is just for reproducing the problem. Our use case is much more complex. In our system, we overload the global new/delete operators. We use scalable_allocate/scalable_free to process memory requests from new/delete. It is difficult for us to predict the how much memory would be allocated by TBB.Usually the memory consumed by TBB would be constant.However, at some point thememory allocated using TBB could be extremely high. We want TBB to reserve a medium size of virtual address instead of the maximum virtual address. So it would be great if TBB has some APIs for us to release some virtual address in off-peak duration.
Did you try TBB allocator from TBB 4.0 (preferable, 4.0 update 2)? For 4.0 we made significant changes in controlling virtual address space, and there is a hope that the changes can help in your case.
The better strategy might be to not overload default new/delete. Instead overload specific object new/delete with one using the scalable allocator. i.e. only those objects with high flux are subject to scalable allocation.
An intermediary technique would be to overload specific object new/delete with one .NOT.using the scalable allocator, rather using a concurrent_queue
If this becomes a problem (out of memory), I suggest you re-think your application such that it is impossible to make more allocations than will run.
Example (made up example with similar issues):
Problem: Make a parallel anti-virus scan
for each file
The problem with the above is you could end up with 500,000 fileTask's plus the buffering requirements plus the tasks spawned by fileTask.
A better route would be
Where the parallel_pipeline pulls in the next file only when a token is available. This puts an upper limit of concurrent file processing at 'withSomeTokens' number of tokens (say 10) instead of an unbounded number of files (say 500,000).
Your problem is not necessarily dealing with files, but it may have a large number of things to process, which in your current design apparently is experience a congestion point where an excessive amout of allocations is required. The point of the programming change is to restrict the peak allocations. Note, your program may run faster on a restricted number of buffers. The thread performing the (excessive) allocations will be available for processing current allocations while waiting for next input token.
The Test-Case created by Wallace in 2009 uses TBBversion 2.1 ( some TBB headers are included in the VS project):
#define TBB_VERSION_MAJOR 2
#define TBB_VERSION_MINOR 1
It really makes sence to try the latest version 4.0of TBB.
One reason we decided to override the global new/delete is that we want to handle all the low-memory/out-of-memorysituations by ourself.In addition,dropping the overriden global new/delete operators would have a big impact on our clients. We will update to TBB 4.0 in our next release.