I only briefly skimmed through the code in tbbmalloc/MemoryAllocator.cpp, I was not trying to model the situation, but it looks like thing which can cause unbounded memory consumption in scalable_malloc.
It's about returnEmptyBlock() function. It's called from scalable_free() when we are freeing own block, and it's called from mallocThreadShutdownNotification(). But it is NOT called on 'remote free' path (neither from freePublicObject() nor getPublicFreeListBlock()).
Assume following situation. Thread 1 allocated big amount of memory (large number of small objects), then thread 2 frees them. No free blocks will be returned to global pool, all memory will remain cached in thread 1. Then 2 scenarios are possible.
(1) Thread 1 doesn't make any further calls to scalable malloc (assume it is 'main' thread which has made 'initialization' and then blocks waiting for program completion). Memory still remains cached and unused.
(2) Thread 1 makes some further calls to scalable malloc. Memory can be reused. But if thread 1 again doesn't call scalable_free() itself, free blocks will not be returned to global pool.
And unbounded memory consumption can occur if such pattern repeats recurrently. Assume thread 1 allocates memory, and thread 2 frees it. Then thread 2 allocates memory, and thread 3 frees it. Then thread 3 allocates memory, and thread 4 frees it. And so on.
The simple solution is to provide function scalable_compact_thread_heap(), which must be called by user episodically or periodically (if he notices some problems). The hard solution is to make automatic reclamation of excessive unused memory.
Am I missing something? Or the problem doesn't worth solving? What do you think?
"I only briefly skimmed through the code in tbbmalloc/MemoryAllocator.cpp, I was not trying to model the situation, but it looks like thing which can cause unbounded memory consumption in scalable_malloc." See getPublicFreeListBlock() and privatizePublicFreeList(): scalable_malloc() will not get new memory while publically freed objects exist.
I understand that thread will not allocate new memory while there are free items in it's blocks.
I am talking about different thing. 2 possible scenarios:
(1) the thread itself frees last item in own block
(2) another thread frees last item in the thread's block
Behavior differs between these scenarios. In (1) free blocks will go to global pool, in (2) free blocks will stay in thread's cache. Imho, (2) can lead to problems sometimes.
"I am talking about different thing." Sorry for assuming it might be so simple. I did more than skim myself, but I haven't examined everything either. Even though the required scenario seems unlikely, my friend Murphy always tells me that that should not lull one into a false sense of security. However, each thread is also occupying a stack, so the thread's maximum allocations would have to be significant relative to the stack size, otherwise we're talking about a situation where the program would otherwise just skip through. But I think I should have passed on this one.
Yes, this is definitely rare scenario. But on the other hand we are talking about general-purpose memory allocator...
For example, maybe what I've described causes problems here:
And yes, I am talking about large amounts of memory cached. 2 free blocks anyway stays in thread cache.
Thread stack can be around 1 Mb (Mr. Joe The Programmer can actually create 10,000 threads with 64Kb stacks). And at the same time thread can allocate hundreds of megabytes.
I think the function would be mostly useless; I bet no user would be keenof calling this function at unknown times. Second, it could only collect unused blocks in the calling thread, because otherwise it would have to impose all the same synchronization needs I described above - so users would need to call this in every active thread.
As for periodic background compacting - it sounds like a poor man's garbage collector, in the sense that it has no benefits of real garbage collection but unfortunately has some shortcomings of it, such as rather significant overhead at random time. Though might be if the points to call this function internally are well defined and do already have rather big overhead, it might be worth looking into.
I agree that periodic background compacting might not be such a great idea, but if it could be done well I'd probably use it.
However I think that the most important point is that as Dmitriy points out the border-case problem does exist, and that's ok in my book, but it MUST definitelybe documented in the tbb guide and perhaps the tutorial docs (can't remember if the tutorial doc mentions allocators but if it does then a mention there isnecessary). Although the case is rare, the poor programmer that runs into this problem will be at a total loss, since not many people would feel like the general allocator is causing such an issue, thus possibly causing the poor soul days of painstaking debugging (if he's lucky enough to notice the error during development).
Also considering the border-case as it stands, I could imagine a scenario where someone uses some threads to create objects (say using factories) and then handing them over to be administrated by other threads (a good design would require objects to be returned to the factory's sink, but too many times I've seen these types of objects simply get deleted), so it's not as an unlikely scenario in real life since this usage is directly related to a well used design pattern, and without any notice in the docs there is no way toforecastthe issue or take it into consideration when designing the system, so it'll likely show up when fixing the issue will cost a lot (lots of refactoring to be done, and possible damages and downtimes).
What would be the cost of having all local memory reachable from other threads, using a spin mutex on all local memory, and writing an atomic time stamp each time control passes through local-memory administration? That way the local thread could do reclamation after timeout1 and/or N other operations, and other threads could intervene after timeout2 before getting another 1-MB chunk, i.e., the mutex would almost never be locked by another thread.
I thought more about it... Some relief could be provided byreducing the (already low) probability of memory blowup, if the search for objects is extended with the following two rules:
Does that make sense to you? The intuition behind the proposed actions is that their overhead should be roughly equal to what would the next step (in parentheses) introduce. Note that the actions are still local to the thread heap, and the protocol of remote deallocation does not change.
Just wanted to confirm your observation....
We just started testing TBB for our application.
Wwe are experiencingthe unbound memory growth as explained. The application has thread pool(around 100+ threads), it allocates memory in many threads and all the memory pointers are stored in one place. At the end of transaction it goes and deletes all the memory from the stored location.
We have traffic of 4-5 transactions/sec.
When we enabled TBB, it started with 4G and then grew till 30 G and then crashed. It took almost 250 K transactions for it to crash.
Without TBB,other standard testtook15 G and is running fine.
Ok, my previous understanding might have been mistaken, but if I got it right now step 1 is a sure winner, step 2 is something that would need to measured.
I'm not sure if its faster or not than calling malloc (notnecessarilythe OS), and even the OS allocation depends on platform and what malloc does (for example I've placed hoard allocator to be the standard allocator and have tbb call it, while hoard then call the OS when it deems necesarry, with doubleindirection, I had no performance gains and took it out, but my point is you cannot safely know what crazy things a simple call to malloc will do).
Another way to reclaim memory in step 2 is how the Lua(VM) does GC for example, you make a few steps and bail out when your ok instead of collecting everything. And next time you need memory you continue from where you left off.
This is what the allocator does now, with only difference that the search is limited to just one bin. Step 1 will extend it to a neighbor been. As for step 2, well, it is unclear without measurements whether full reclamation through all bins will have significant impact or not. If it will, then I agree the search can be performed until a single empty block is found.
Jury sitting in my brain is still out on the issue, whether it requires fixing or not. But this are good points, Robert.
If factory-thread constantly allocates objects then all memory will be reclaimed and reused, there will be no problems. Before getting block from global pool, thread checks whether there are outstanding remote deallocations.
Problems will be when thread allocates huge amount of memory, and other threads free that memory AND first thread doesn't make any further allocations of the same size. So this must be some unusual situation to trigger unbounded memory consumption.
Yes, it have to be called from every thread (only by applications that feel some problems). Server software (which is most suspected to the problem because of long-running-ness (24x7)) has natural points to call such function - after every N processed requests. Also the function can be called before long blocking.
Well, it is not very beautiful solution, but it can be not very bad from technical point of view.
Another possible solution is to provide special macros (something like TBB_MALLOC_LOW_FRAGMENTATION_MODE) which will switch allocator to low memory consuption mode... to more careful to memory consumption mode. Allocator will stay distributed, local allocation and local free paths will stay relatively lightweight, but threads will be trying harder to get rid of unused memory (possibly imposing overheads even on fast-path... but, well, being blazingly fast is not the only ultimate goal of general-purpose memory allocator, right? :) ).
The good news are that if user is satisfied with current behaviour and doen't define TBB_MALLOC_LOW_FRAGMENTATION_MODE than he won't get any additional overheads.
I think that authors of following posts will be totally satisfied with such solution:
If the problem requires fixing then I think it's the step in the right direction. Though I think that it will be helpful to better identify real application patterns that can cause bad behaviour first. Because, for example, if thread that allocates huge amount of memory gets blocked (for example for user input in console or taken out of circulation in the thread pool) then such solution won't help...
How many memory thread allocates during transaction processing?
In what order threads are extracted from thread pool? Is it possible that some threads will be "starved" in the thread pool, i.e. will remain inactive for too long in the thread pool?
Poor man GC can be not as bad for native non-managed environment as one might think. The point is to get the best from reactive manual memory management, and the best from GC. Reactive manual memory management will provide prompthness and usual behaviour for native world, and GC will provide more guarantees for various corner cases. I believe that GC can run really infrequently (for example, one time in 1-10 secs) and impose really low overheads for fast-paths. It will be just watching out that everything is going smoothly and aptly.
The negative side is that even poor man's GC requires substantial amount of coding and redesign...