Possible problem in scalable allocator

Dmitry_Vyukov · ‎11-07-2008

I only briefly skimmed through the code in tbbmalloc/MemoryAllocator.cpp, I was not trying to model the situation, but it looks like thing which can cause unbounded memory consumption in scalable_malloc.

It's about returnEmptyBlock() function. It's called from scalable_free() when we are freeing own block, and it's called from mallocThreadShutdownNotification(). But it is NOT called on 'remote free' path (neither from freePublicObject() nor getPublicFreeListBlock()).
Assume following situation. Thread 1 allocated big amount of memory (large number of small objects), then thread 2 frees them. No free blocks will be returned to global pool, all memory will remain cached in thread 1. Then 2 scenarios are possible.
(1) Thread 1 doesn't make any further calls to scalable malloc (assume it is 'main' thread which has made 'initialization' and then blocks waiting for program completion). Memory still remains cached and unused.
(2) Thread 1 makes some further calls to scalable malloc. Memory can be reused. But if thread 1 again doesn't call scalable_free() itself, free blocks will not be returned to global pool.
And unbounded memory consumption can occur if such pattern repeats recurrently. Assume thread 1 allocates memory, and thread 2 frees it. Then thread 2 allocates memory, and thread 3 frees it. Then thread 3 allocates memory, and thread 4 frees it. And so on.

The simple solution is to provide function scalable_compact_thread_heap(), which must be called by user episodically or periodically (if he notices some problems). The hard solution is to make automatic reclamation of excessive unused memory.

Am I missing something? Or the problem doesn't worth solving? What do you think?

RafSchietekat · ‎11-07-2008

"I only briefly skimmed through the code in tbbmalloc/MemoryAllocator.cpp, I was not trying to model the situation, but it looks like thing which can cause unbounded memory consumption in scalable_malloc." See getPublicFreeListBlock() and privatizePublicFreeList(): scalable_malloc() will not get new memory while publically freed objects exist.

Dmitry_Vyukov · ‎11-08-2008

Quoting - Raf Schietekat

"I only briefly skimmed through the code in tbbmalloc/MemoryAllocator.cpp, I was not trying to model the situation, but it looks like thing which can cause unbounded memory consumption in scalable_malloc." See getPublicFreeListBlock() and privatizePublicFreeList(): scalable_malloc() will not get new memory while publically freed objects exist.

I understand that thread will not allocate new memory while there are free items in it's blocks.

I am talking about different thing. 2 possible scenarios:

(1) the thread itself frees last item in own block

(2) another thread frees last item in the thread's block

Behavior differs between these scenarios. In (1) free blocks will go to global pool, in (2) free blocks will stay in thread's cache. Imho, (2) can lead to problems sometimes.

RafSchietekat · ‎11-08-2008

"I am talking about different thing." Sorry for assuming it might be so simple. I did more than skim myself, but I haven't examined everything either. Even though the required scenario seems unlikely, my friend Murphy always tells me that that should not lull one into a false sense of security. However, each thread is also occupying a stack, so the thread's maximum allocations would have to be significant relative to the stack size, otherwise we're talking about a situation where the program would otherwise just skip through. But I think I should have passed on this one.

Dmitry_Vyukov · ‎11-09-2008

Quoting - Raf Schietekat

"I am talking about different thing." Sorry for assuming it might be so simple. I did more than skim myself, but I haven't examined everything either. Even though the required scenario seems unlikely, my friend Murphy always tells me that that should not lull one into a false sense of security. However, each thread is also occupying a stack, so the thread's maximum allocations would have to be significant relative to the stack size, otherwise we're talking about a situation where the program would otherwise just skip through. But I think I should have passed on this one.

Yes, this is definitely rare scenario. But on the other hand we are talking about general-purpose memory allocator...

For example, maybe what I've described causes problems here:

http://software.intel.com/en-us/forums/showthread.php?t=61624

Who knows?

And yes, I am talking about large amounts of memory cached. 2 free blocks anyway stays in thread cache.

Thread stack can be around 1 Mb (Mr. Joe The Programmer can actually create 10,000 threads with 64Kb stacks). And at the same time thread can allocate hundreds of megabytes.

Alexey-Kukanov · ‎11-09-2008

Yes, in order for a block emptied by a "foreign" (aka remote) thread to be moved to the global pool, it should first pass through reclamation of objects by the owning thread. This is by design, to make all "foreign" activity on the block clearly separate from local hot path. Returning an empty block remotely would require changing the list of blocks in the bin. The list is double-linked, and blocks can be removed from the middle (see outOfTLSBin), so I doubt any non-blocking schemas would succeed without incurring significant redesign with unknown performance impact; so the practical solution is to lock two subsequent blocks in the list on removal, and also lock a block on traversal or other operation that requires dereferencing pointers to its neighbours. Also, to allow a remote thread understand that the block is empty would require atomic operations on the counter of allocated objects; yet another penalty.

So while you are correct that there are corner case scenarios leading to memory blowup, I do not see how they could be resolved without performance penalty on the hot local path, and I do not feel convinced that these rare corner cases deserve the performance cost.

Dmitry_Vyukov · ‎11-10-2008

Quoting - Alexey Kukanov (Intel)

So while you are correct that there are corner case scenarios leading to memory blowup, I do not see how they could be resolved without performance penalty on the hot local path, and I do not feel convinced that these rare corner cases deserve the performance cost.

And what do you think about providing scalable_compact_thread_heap() which can be called by user manually? Or about periodic background compacting?

Alexey-Kukanov · ‎11-10-2008

Quoting - Dmitriy V'jukov

And what do you think about providing scalable_compact_thread_heap() which can be called by user manually? Or about periodic background compacting?

I think the function would be mostly useless; I bet no user would be keenof calling this function at unknown times. Second, it could only collect unused blocks in the calling thread, because otherwise it would have to impose all the same synchronization needs I described above - so users would need to call this in every active thread.

As for periodic background compacting - it sounds like a poor man's garbage collector, in the sense that it has no benefits of real garbage collection but unfortunately has some shortcomings of it, such as rather significant overhead at random time. Though might be if the points to call this function internally are well defined and do already have rather big overhead, it might be worth looking into.

robert_jay_gould · ‎11-10-2008

Quoting - Alexey Kukanov (Intel)

I think the function would be mostly useless; I bet no user would be keenof calling this function at unknown times. Second, it could only collect unused blocks in the calling thread, because otherwise it would have to impose all the same synchronization needs I described above - so users would need to call this in every active thread.

As for periodic background compacting - it sounds like a poor man's garbage collector, in the sense that it has no benefits of real garbage collection but unfortunately has some shortcomings of it, such as rather significant overhead at random time. Though might be if the points to call this function internally are well defined and do already have rather big overhead, it might be worth looking into.

I agree that periodic background compacting might not be such a great idea, but if it could be done well I'd probably use it.

However I think that the most important point is that as Dmitriy points out the border-case problem does exist, and that's ok in my book, but it MUST definitelybe documented in the tbb guide and perhaps the tutorial docs (can't remember if the tutorial doc mentions allocators but if it does then a mention there isnecessary). Although the case is rare, the poor programmer that runs into this problem will be at a total loss, since not many people would feel like the general allocator is causing such an issue, thus possibly causing the poor soul days of painstaking debugging (if he's lucky enough to notice the error during development).

Also considering the border-case as it stands, I could imagine a scenario where someone uses some threads to create objects (say using factories) and then handing them over to be administrated by other threads (a good design would require objects to be returned to the factory's sink, but too many times I've seen these types of objects simply get deleted), so it's not as an unlikely scenario in real life since this usage is directly related to a well used design pattern, and without any notice in the docs there is no way toforecastthe issue or take it into consideration when designing the system, so it'll likely show up when fixing the issue will cost a lot (lots of refactoring to be done, and possible damages and downtimes).

RafSchietekat · ‎11-10-2008

What would be the cost of having all local memory reachable from other threads, using a spin mutex on all local memory, and writing an atomic time stamp each time control passes through local-memory administration? That way the local thread could do reclamation after timeout1 and/or N other operations, and other threads could intervene after timeout2 before getting another 1-MB chunk, i.e., the mutex would almost never be locked by another thread.

Alexey-Kukanov · ‎11-12-2008

Quoting - Alexey Kukanov (Intel)

As for periodic background compacting - it sounds like a poor man's garbage collector, in the sense that it has no benefits of real garbage collection but unfortunately has some shortcomings of it, such as rather significant overhead at random time. Though might be if the points to call this function internally are well defined and do already have rather big overhead, it might be worth looking into.

I thought more about it... Some relief could be provided byreducing the (already low) probability of memory blowup, if the search for objects is extended with the following two rules:

For a given request size, if the search through existing local blocks was unsuccessful, switch to the next bigger allocation size, and try to reclaim some remotely freed memory there (before going to the global list of empty blocks).
If there is no more empty blocks in the global storage, make full reclamation for every allocation size in the local heap (before requesting more memory by OS system call).

Does that make sense to you? The intuition behind the proposed actions is that their overhead should be roughly equal to what would the next step (in parentheses) introduce. Note that the actions are still local to the thread heap, and the protocol of remote deallocation does not change.

sadbhaw · ‎11-17-2008

Quoting - Dmitriy V'jukov

I only briefly skimmed through the code in tbbmalloc/MemoryAllocator.cpp, I was not trying to model the situation, but it looks like thing which can cause unbounded memory consumption in scalable_malloc.

It's about returnEmptyBlock() function. It's called from scalable_free() when we are freeing own block, and it's called from mallocThreadShutdownNotification(). But it is NOT called on 'remote free' path (neither from freePublicObject() nor getPublicFreeListBlock()).
Assume following situation. Thread 1 allocated big amount of memory (large number of small objects), then thread 2 frees them. No free blocks will be returned to global pool, all memory will remain cached in thread 1. Then 2 scenarios are possible.
(1) Thread 1 doesn't make any further calls to scalable malloc (assume it is 'main' thread which has made 'initialization' and then blocks waiting for program completion). Memory still remains cached and unused.
(2) Thread 1 makes some further calls to scalable malloc. Memory can be reused. But if thread 1 again doesn't call scalable_free() itself, free blocks will not be returned to global pool.
And unbounded memory consumption can occur if such pattern repeats recurrently. Assume thread 1 allocates memory, and thread 2 frees it. Then thread 2 allocates memory, and thread 3 frees it. Then thread 3 allocates memory, and thread 4 frees it. And so on.

The simple solution is to provide function scalable_compact_thread_heap(), which must be called by user episodically or periodically (if he notices some problems). The hard solution is to make automatic reclamation of excessive unused memory.

Am I missing something? Or the problem doesn't worth solving? What do you think?

Hi,

Just wanted to confirm your observation....

We just started testing TBB for our application.

Wwe are experiencingthe unbound memory growth as explained. The application has thread pool(around 100+ threads), it allocates memory in many threads and all the memory pointers are stored in one place. At the end of transaction it goes and deletes all the memory from the stored location.

We have traffic of 4-5 transactions/sec.

When we enabled TBB, it started with 4G and then grew till 30 G and then crashed. It took almost 250 K transactions for it to crash.

Without TBB,other standard testtook15 G and is running fine.

Thank you,

Sadbhaw

robert_jay_gould · ‎11-19-2008

Quoting - Alexey Kukanov (Intel)

I thought more about it... Some relief could be provided byreducing the (already low) probability of memory blowup, if the search for objects is extended with the following two rules:

For a given request size, if the search through existing local blocks was unsuccessful, switch to the next bigger allocation size, and try to reclaim some remotely freed memory there (before going to the global list of empty blocks).

If there is no more empty blocks in the global storage, make full reclamation for every allocation size in the local heap (before requesting more memory by OS system call).

Does that make sense to you? The intuition behind the proposed actions is that their overhead should be roughly equal to what would the next step (in parentheses) introduce. Note that the actions are still local to the thread heap, and the protocol of remote deallocation does not change.

Ok, my previous understanding might have been mistaken, but if I got it right now step 1 is a sure winner, step 2 is something that would need to measured.

I'm not sure if its faster or not than calling malloc (notnecessarilythe OS), and even the OS allocation depends on platform and what malloc does (for example I've placed hoard allocator to be the standard allocator and have tbb call it, while hoard then call the OS when it deems necesarry, with doubleindirection, I had no performance gains and took it out, but my point is you cannot safely know what crazy things a simple call to malloc will do).

Another way to reclaim memory in step 2 is how the Lua(VM) does GC for example, you make a few steps and bail out when your ok instead of collecting everything. And next time you need memory you continue from where you left off.

Alexey-Kukanov · ‎11-19-2008

Robert, thanks for the feedback.

Quoting - robert.jay.gould

Another way to reclaim memory in step 2 is how the Lua(VM) does GC for example, you make a few steps and bail out when your ok instead of collecting everything. And next time you need memory you continue from where you left off.

This is what the allocator does now, with only difference that the search is limited to just one bin. Step 1 will extend it to a neighbor been. As for step 2, well, it is unclear without measurements whether full reclamation through all bins will have significant impact or not. If it will, then I agree the search can be performed until a single empty block is found.

Dmitry_Vyukov · ‎11-25-2008

Quoting - robert.jay.gould

I agree that periodic background compacting might not be such a great idea, but if it could be done well I'd probably use it.

However I think that the most important point is that as Dmitriy points out the border-case problem does exist, and that's ok in my book, but it MUST definitelybe documented in the tbb guide and perhaps the tutorial docs (can't remember if the tutorial doc mentions allocators but if it does then a mention there isnecessary). Although the case is rare, the poor programmer that runs into this problem will be at a total loss, since not many people would feel like the general allocator is causing such an issue, thus possibly causing the poor soul days of painstaking debugging (if he's lucky enough to notice the error during development).

Also considering the border-case as it stands, I could imagine a scenario where someone uses some threads to create objects (say using factories) and then handing them over to be administrated by other threads (a good design would require objects to be returned to the factory's sink, but too many times I've seen these types of objects simply get deleted), so it's not as an unlikely scenario in real life since this usage is directly related to a well used design pattern, and without any notice in the docs there is no way toforecastthe issue or take it into consideration when designing the system, so it'll likely show up when fixing the issue will cost a lot (lots of refactoring to be done, and possible damages and downtimes).

Jury sitting in my brain is still out on the issue, whether it requires fixing or not. But this are good points, Robert.

If factory-thread constantly allocates objects then all memory will be reclaimed and reused, there will be no problems. Before getting block from global pool, thread checks whether there are outstanding remote deallocations.

Problems will be when thread allocates huge amount of memory, and other threads free that memory AND first thread doesn't make any further allocations of the same size. So this must be some unusual situation to trigger unbounded memory consumption.

Dmitry_Vyukov · ‎11-25-2008

Quoting - Alexey Kukanov (Intel)

I think the function would be mostly useless; I bet no user would be keenof calling this function at unknown times. Second, it could only collect unused blocks in the calling thread, because otherwise it would have to impose all the same synchronization needs I described above - so users would need to call this in every active thread.

Yes, it have to be called from every thread (only by applications that feel some problems). Server software (which is most suspected to the problem because of long-running-ness (24x7)) has natural points to call such function - after every N processed requests. Also the function can be called before long blocking.

Well, it is not very beautiful solution, but it can be not very bad from technical point of view.

Dmitry_Vyukov · ‎11-25-2008

Quoting - Dmitriy V'jukov

Yes, it have to be called from every thread (only by applications that feel some problems). Server software (which is most suspected to the problem because of long-running-ness (24x7)) has natural points to call such function - after every N processed requests. Also the function can be called before long blocking.

Well, it is not very beautiful solution, but it can be not very bad from technical point of view.

Another possible solution is to provide special macros (something like TBB_MALLOC_LOW_FRAGMENTATION_MODE) which will switch allocator to low memory consuption mode... to more careful to memory consumption mode. Allocator will stay distributed, local allocation and local free paths will stay relatively lightweight, but threads will be trying harder to get rid of unused memory (possibly imposing overheads even on fast-path... but, well, being blazingly fast is not the only ultimate goal of general-purpose memory allocator, right? :) ).

The good news are that if user is satisfied with current behaviour and doen't define TBB_MALLOC_LOW_FRAGMENTATION_MODE than he won't get any additional overheads.

I think that authors of following posts will be totally satisfied with such solution:

http://software.intel.com/en-us/forums/showthread.php?t=61624

http://software.intel.com/en-us/forums/showpost.php?p=69500

Dmitry_Vyukov · ‎11-25-2008

Quoting - Alexey Kukanov (Intel)

Does that make sense to you? The intuition behind the proposed actions is that their overhead should be roughly equal to what would the next step (in parentheses) introduce. Note that the actions are still local to the thread heap, and the protocol of remote deallocation does not change.

If the problem requires fixing then I think it's the step in the right direction. Though I think that it will be helpful to better identify real application patterns that can cause bad behaviour first. Because, for example, if thread that allocates huge amount of memory gets blocked (for example for user input in console or taken out of circulation in the thread pool) then such solution won't help...

Dmitry_Vyukov · ‎11-25-2008

Quoting - sadbhaw

Hi,

Just wanted to confirm your observation....

We just started testing TBB for our application.

Wwe are experiencingthe unbound memory growth as explained. The application has thread pool(around 100+ threads), it allocates memory in many threads and all the memory pointers are stored in one place. At the end of transaction it goes and deletes all the memory from the stored location.

We have traffic of 4-5 transactions/sec.

When we enabled TBB, it started with 4G and then grew till 30 G and then crashed. It took almost 250 K transactions for it to crash.

Without TBB,other standard testtook15 G and is running fine.

Thank you,

Sadbhaw

How many memory thread allocates during transaction processing?

In what order threads are extracted from thread pool? Is it possible that some threads will be "starved" in the thread pool, i.e. will remain inactive for too long in the thread pool?

Dmitry_Vyukov · ‎11-25-2008

Quoting - Alexey Kukanov (Intel)

As for periodic background compacting - it sounds like a poor man's garbage collector, in the sense that it has no benefits of real garbage collection but unfortunately has some shortcomings of it, such as rather significant overhead at random time. Though might be if the points to call this function internally are well defined and do already have rather big overhead, it might be worth looking into.

Poor man GC can be not as bad for native non-managed environment as one might think. The point is to get the best from reactive manual memory management, and the best from GC. Reactive manual memory management will provide prompthness and usual behaviour for native world, and GC will provide more guarantees for various corner cases. I believe that GC can run really infrequently (for example, one time in 1-10 secs) and impose really low overheads for fast-paths. It will be just watching out that everything is going smoothly and aptly.

The negative side is that even poor man's GC requires substantial amount of coding and redesign...

Chris_M__Thomasson · ‎11-25-2008

Quoting - Dmitriy V'jukov

Poor man GC can be not as bad for native non-managed environment as one might think. The point is to get the best from reactive manual memory management, and the best from GC. Reactive manual memory management will provide prompthness and usual behaviour for native world, and GC will provide more guarantees for various corner cases. I believe that GC can run really infrequently (for example, one time in 1-10 secs) and impose really low overheads for fast-paths. It will be just watching out that everything is going smoothly and aptly.

The negative side is that even poor man's GC requires substantial amount of coding and redesign...

Let me see here. You describe problem as follows:

Thread A allocates shi%load of objects (many MB worth).

ThreadB allocates shi%load of objects (many MB worth)..

ThreadC allocates shi%load of objects (many MB worth)..

ThreadD allocates shi%load of objects (many MB worth)..

ThreadE allocates shi%load of objects (many MB worth)..

Threads A-E do not allocate any more...

Thread F frees all of those objects onto remote list.

BAM!!!!

All objects leaked for duration of program! OH SHI%!

;^(...

Is that right? Well, AFAICT, vZOOM allocator "attempts" to solve this potential problem by putting a upper bound on leaked memory. It can do this with single-word CAS in the remote free portion, but let me show you how it does it with DWCAS:

[cpp]/* very quick pseudo-code w/ "some/a lot" of detail omitted for brevity */


#define REMOTE_FREE_THRESHOLD 1024 /* whatever */


struct block {
  union block* next;
};


struct remote_free_anchor {
  union block* head;
  int32_t count;
};


struct slab {
  per_thread* owner;
  struct remote_free_anchor flist;
  /* [...] */
};


void free(void* mem) {
  struct block* block = mem;
  struct thread* thread = pthread_getspecific(...);
  struct slab* slab = get_slab_from_block(block);
  if (slab->owner == thread) {
    /* [...] */
  } else {[/cpp]

[cpp]    /* forget remote cache for now... */[/cpp]

[cpp]
    struct remote_free_anchor xchg, cmp = slab->flist;
    do {
      block->next = cmp.head;
      if (cmp.count > REMOTE_FREE_THRESHOLD) {
        xchg.head = NULL;
        xchg.count = 0;
      } else {
        xchg.head = block;
        xchg.count = cmp.count + 1;
      }
    } while(! DWCAS(&slab->flist, &cmp, &xchg));
    if (cmp.count > REMOTE_FREE_THRESHOLD) {
      while (block) {
        struct block* next = block->next;
        coalesce_into_global_heap(block);
        block = next;
      }
    }
  }
}[/cpp]

AFAICT, there can be an upper bound on leaked objects per-size-bin... Also, vZOOM has per-bin thresholds... Any thoughts?