My name is Chris M. Thomasson and I have developed a non-blocking algorithm that allows one to transform an inherently single-threaded memory allocator into a scaleable multi-threaded allocator. This invention allows a programmer to just create an allocator algorithm without worrying about any multi-threading issues. Once your done with your single-threaded allocator, well, when you apply my non-blocking algorithm, and PRESTO! Its now a scaleable multi-threaded allocator! Is that cool or what?
Okay, here is an initial proof of concept, I have transformed the Windows Heap object created with the HEAP_NO_SERIALIZE flag. This flag disables the internal lock used by the heap, which makes it a heck of a lot more efficient. However, since the heap is no longer serialized, one cannot use it in a multi-threaded application as a general allocator; until now... The algorithm I created is fairly simple; basically, it goes like this:
I create thread local user allocator in every thread.
Alloc request is forwarded to this thread-local user allocator directly.
If free request goes from thread that allocate block, then free request is forwarded to this thread-local user allocator.
If free request goes from another thread, then you accumulate this block in per-thread stack-based freelist.
Blocks from this freelist are actually freed in batches when thread allocates/deallocates another block.
I mentioned this algorithm a while back:
Well, I have released a pre-alpha version of it which demonstrates how to transform the Windows Heap object under GPL license; the example source-code can be found here:
Can you please run the test as-is, which uses stock malloc impl, and record the output... Then uncomment the following line:
which engages my algorithm; run the test again; and post the differences in timings between the two. I think you will be fairly surprised at the performance enhancement. The lock in the Windows Heap object that `malloc' is based on severely damages its scalability when used in a multi-threaded environment. When `USE_MEM_ALLOC' is defined, ALL of the locking is removed and the heaps are created on a per-thread basis. This distribution and clever support algorithm I invented allows for the Windows Heap to scale quite nicely indeed.
I am going to tweak the algorithm such that one could plug in different allocators on a per-thread basis! In other words, thread A can use a different allocation algorithm than thread B, and so on. My algorithm actually supports this. I think the possibilities are truly amazing. It even allows thread A to allocate and free in thread B even though they use completely different allocation algorithm. That's pretty darn cool. Some may think its even impossible; its not!
BTW, I am thinking that this algorithm might be able to be used to transform other single-threaded things besides allocators... Humm, need to think some more on that.
The test portion of the code uses the pthread-win32 library which can be downloaded from here:
I am thinking of removing dependence on it and just use pure Windows primitives.
It looks like a challenge. Ok, let's see how I would implement it commercial library, if I would have one.
struct block_t *next;
struct per_thread_t *thread;
void* slab_malloc( size_t sz )
per_thread_t *_this = pthread_getspecific(...);
block_t *block = slab_sys_local_pop( _this );
void slab_free( void *mem )
block_t *block = ((block_t*)mem) - 1;
per_thread_t *_this = pthread_getspecific(...);
if ( block->thread == _this )
slab_sys_local_push( block );
slab_sys_shared_push( block );
block_t* slab_sys_local_pop( per_thread_t *_this )
block_t *block = _this->local.front;
_this->local.front = block->next;
return block + 1;
return slab_sys_shared_pop (_this);
void slab_sys_local_push( block_t *_this)
_this->next = _this->thread->local.front;
_this->thread->local.front = _this;
block_t* slab_sys_shared_pop( per_thread_t *_this)
block_t *block = XCHG( &_this->shared.front, 0 );
if ( block )
_this->local.front = block->next;
return block + 1;
void slab_sys_shared_push( block_t *_this )
cmp = _this->thread->shared.front;
_this->next = cmp & ~1;
while ( ! CAS( &_this->thread->shared.front,
( cmp & 1 ) ? _this | 1 : _this ) );
if ( cmp & 1 && XADD( &_this->refs, -1 ) == 1 )
/* destroy _this->thread */
void slab_sys_tls_dtor( void *state )
per_thread_t *_this = state;
_this->refs = 1 + BLOCK_COUNT_PER_SUPERBLOCK;
int count = 1;
block_t* block = _this->local.front;
while ( block )
count += 1;
block = block->next;
block = XCHG( &_this->shared.front, 1 );
while ( block )
count += 1;
block = block->next;
if ( XADD( &_this->refs, -count ) == count )
/* destroy _this */
If the new size is not bigger than the fixed size of the slab, why the block should be reallocated, even if "owned" by another thread? It can simply be reused. If the new size is bigger than can be served by the slab, it should be reallocated, againindependently on whether the block is local or remote.
As for caching blocks in remote threads, in the TBB allocator it is impractical to implement.
Why? Please clarify.
Well, if allocator is used with tasks, then, yes, it is impractical. Because stealing is rare, so there is a little remote deallocations.
But TBB's scalable allocator is a stand-alone thing. It can be used in arbitrary software. And substantial amount of software is relying on producer-consumer pattern in some form, for which caching can be crucial. Because there is *constant* stream of remote deallocations.
blockSize in the TBB allocator is not the same as sizeof(Block). The former is the total size of any slab in bytes, while the latter is the size of the slab header (yes, Block is in fact the slab header structure). So sizeof(Block) does not formally need to be power of two. And yes, due to alignment requirements the actual allocation area in the slab might start not immediately after the header; in a sense, it effectively increases the "virtual" size of the header (i.e. memory overhead). But there is no problem with alignment requirements of an application, because aligned allocations should use a separate set of API functions and explicitly specify the required alignment; this allows to select the slab with proper alignment, though it may mean even more memory overhead - allfor sake of speed.
The only moment is still unclear for me. How you cache nodes in caching mode in VZOOM memory allocator?
Nodes must be sorted into batches by owner thread id. If number of threads is constant and known, then it's simple - we create array of lists, we give every thread unique id 0,1,2... Then the task is trivial.
But if number of threads is unknown and variable, then one have to employ something like hash-map or... don't know... it must something very cheap...
I've remembered that I was already thinking about it:
I was able to come up with 2 variants: "Cache list + global lock" and "Hashed cache array"
Depending on the operating system two references could be used to access the cache list. One could be via Thread Local Storage, which depending on O/S and runtime system may be reduce to an overhead of supplying asegment override prefix and the second would be your global table of tables for inter-thread allocation/deallocation.
Yes of course; I was not thinking clearly; your right! As for remote caching of memory, well, agree with you. It can create odd scenarios. Dmitriy and I discussed this already else
Cache list plus global lock does not seem viable. Not only it adds a serialization bottleneck into otherwise distributed design, but also it makes the thread go through the bunch of different block headers, many of which could be cold in CPU cache.
The second design looks more appealing, though instead of hash collisions I would probably try a kind of LRU policy to flush some lists - isn't it cache, after all? :) On the other hand, this design might lead to a scenario when you flush just one or two objects at a time - almost as if you do returning an object immediately, exceptthe corresponding block header being already cold in your CPU cache.
The biggest problem with object caching policies is thata general-purpose allocator never knows how many more objects would come from this exactly thread (in case of TBB allocator, from this exactly slab block - the free lists, both local and global, are per slab in the current design) so you do not know whether it makes sense to cache an object to return a whole bunch at once.
Also isn't it you who think that sharing is almost the root of all evils in multithreaded programming? ;)) And I tend to agree:) From this point of view, object exchange between threads should be as rare as possible, so contention and atomic operations on the cold path of cross-thread deallocation should be well amortized by fast hot path of working with local free list.
Yes, I am not fully satisfied with it too. The idea was that remote deallocations are infrequent, further thread tries to access global lock once per N (N=128, 256...) deallocations, further thread can use try_lock() and if attempt to lock fails just try next time.
But yes, blocks will be cold in cache with substantial probability. And yes, this scheme requires periodical activity from thread.
Yeah, the problem of most of such designs is that they perfectly fit some situations, but at the same time it's easy to come up with counter-example when such design will suck
Btw, why is it impossible to create per thread remote freelists, instead of per slab freelists? Because you don't want thread to go through the bunch of different block headers in order to sort them out into different slabs? Right?
Yes, it's me who was saying that sharing is the root of all evil.
Actually I think that I was saying both things, that it's the root of all evil and that it's necessary thing :) because w/o sharing it's not multi-threaded program no more, it's just a set of independent processes which is unable to solve single task.
Well, I think that you are right for substantial number of well-written multi-threaded programs - i.e. there is no need to optimize remote deallocations.
Although, it seems that substantial number of multi-threaded programs relying on producer-consumer pattern in some form (stages, pipelines etc). And producer-consumer pattern means that there is *constant* flow of remote deallocations.
Also because we are talking about general-purpose allocator (and not about allocator for some particular program), we must consider also "no so well-written programs". Is not it good for library that non-expert user can create no so well-written program and it will still behave adequately, because underlying library uses a bunch of smart distributed and amortized techniques?
I see. It's interesting. Thank you.
I was thinking that you are able to return a bunch of remote blocks to owner at once (with one atomic RMW). So you are able only to reuse cached nodes, or to return them to owner one by one. It's interesting how you sort cached blocks by size, so thread is able to quickly find cached block of needed size. Even such caching can have great positive impact in some situations. For example when 2 threads exchanging messages with roughly equal frequency.
But what bothers me is the following. Assume there is 'asymmetric' message exchange, i.e. only one thread sending messages to another thread. Such caching won't help here, it will only make things worse, because there will be additional bookkeeping, and some blocks will be already cold in cache.
I think it's possible to create general-purpose caching scheme, in which it will be possible to return a bunch of memory blocks to owner with one atomic RMW. It will have N^2 space complexity.
Assign to every thread dense indexes from 0 to N (if thread dies then it's number will be reused by next created thread). Create in TLS array of remote thread descriptors. This array is accessed only by owner, so it's easy to implement growing of array, i.e. if thread want to fetch descriptor of thread N+1, but current array size is N, then thread just allocates new bigger array, copy information and replace pointer to array.
The rest is pie. Thread caches remote blocks in according descriptor in its TLS array.
Here is little problem. When thread wants to allocate block it has to check ALL remote thread descriptors. In order to solve this 2 lists of blocks can be maintained: one per-remote-thread list (needed when we want to return a bunch of blocks to owner), and one 'global' (i.e. per-thread) list which combines all non-empty per-remote-thread lists (needed when thread want to allocate block from cache).
Although some details must be still worked over...
What do you think?
Maybe it's already not worth it... I'm not sure... What do you think?
On the other hand, if it can be implemented efficiently (especially w/o additional overheads for other usage patterns), then it's always nicely to provide efficient support for one more usage pattern...
It isn't impossible, but I think the current design is at least not worse, and might be faster. One of the reasons is exactly as you mentioned - I don't want to go through all different headers for different allocation sizes. The other one is that single remote free list, even non-blocking, is a bottleneck again; per slab lists should have less contention to update. Also the overhead of reclaiming remotely freed objects is distributed more over time than with a single public list, yet it is on demand only. Though it might be interesting to test other schemes.
I also agree that some caching scheme is worth benchmarking at least, and agree that it should return a bunch of memory objects at once. Also, in the light of probable widespread of NUMA systems, I would avoid reusing remote memory - so returning a bunch becomes the only reason to cache.
Well, yes of course "really remote" memory reuse is very bad while intra-node (actually, it's rather intra-socket nowadays) reuse might be fine. But, it adds to bookkeeping complexity and reduces portability. Being NUMA-oblivious on object level is simpler. On the level of VM mapping and slab reuse, however, it makes more sense to me to play with non-portable NUMA APIs - as the VM mapping is non-portable either.
In my region of Russia, we have seen the first snow a couple weeks ago, much earlier this year than usually. It did not lie down of course. But when it will lie down, I'd prefer to go cross-contry skiing on my spare time :)