- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My name is Chris M. Thomasson and I have developed a non-blocking algorithm that allows one to transform an inherently single-threaded memory allocator into a scaleable multi-threaded allocator. This invention allows a programmer to just create an allocator algorithm without worrying about any multi-threading issues. Once your done with your single-threaded allocator, well, when you apply my non-blocking algorithm, and PRESTO! Its now a scaleable multi-threaded allocator! Is that cool or what?
Okay, here is an initial proof of concept, I have transformed the Windows Heap object created with the HEAP_NO_SERIALIZE flag. This flag disables the internal lock used by the heap, which makes it a heck of a lot more efficient. However, since the heap is no longer serialized, one cannot use it in a multi-threaded application as a general allocator; until now... The algorithm I created is fairly simple; basically, it goes like this:
___________________________________________________________________________
I create thread local user allocator in every thread.
Alloc request is forwarded to this thread-local user allocator directly.
If free request goes from thread that allocate block, then free request is forwarded to this thread-local user allocator.
If free request goes from another thread, then you accumulate this block in per-thread stack-based freelist.
Blocks from this freelist are actually freed in batches when thread allocates/deallocates another block.
___________________________________________________________________________
I mentioned this algorithm a while back:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb87f9cb2cb62d84
Well, I have released a pre-alpha version of it which demonstrates how to transform the Windows Heap object under GPL license; the example source-code can be found here:
http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_win_v000_c.html
Can you please run the test as-is, which uses stock malloc impl, and record the output... Then uncomment the following line:
#define USE_MEM_ALLOC
which engages my algorithm; run the test again; and post the differences in timings between the two. I think you will be fairly surprised at the performance enhancement. The lock in the Windows Heap object that `malloc' is based on severely damages its scalability when used in a multi-threaded environment. When `USE_MEM_ALLOC' is defined, ALL of the locking is removed and the heaps are created on a per-thread basis. This distribution and clever support algorithm I invented allows for the Windows Heap to scale quite nicely indeed.
I am going to tweak the algorithm such that one could plug in different allocators on a per-thread basis! In other words, thread A can use a different allocation algorithm than thread B, and so on. My algorithm actually supports this. I think the possibilities are truly amazing. It even allows thread A to allocate and free in thread B even though they use completely different allocation algorithm. That's pretty darn cool. Some may think its even impossible; its not!
BTW, I am thinking that this algorithm might be able to be used to transform other single-threaded things besides allocators... Humm, need to think some more on that.
P.S.
The test portion of the code uses the pthread-win32 library which can be downloaded from here:
http://sourceware.org/pthreads-win32
I am thinking of removing dependence on it and just use pure Windows primitives.
Any thoughts?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
LONG
mem_thread_sys_reclaim(
union mem_block* const reset
) {
LONG count = 0;
union mem_block* mblk = XCHG(&g_mem_tls->rfree, reset);
while (mblk) {
union mem_block* const next = mblk->next;
if (! HeapFree(g_mem_tls->heap, HEAP_NO_SERIALIZE, mblk)) {
assert(0); abort();
}
++count;
mblk = next;
}
if (! reset) {
g_mem_tls->refs += count;
}
return count;
}
Can you spot the bug? Well, IMVHO, its kind of easy... See, the reclaim function is suppose to free the blocks that have been deferred by remote threads... Well, a free is suppose to actually decrement the owner thread reference count... I was incrementing it! OUCH!
g_mem_tls->refs += count;
}
to:
g_mem_tls->refs -= count;
}
Anyway, I am not sure how the reclaim threshold feature is going to effect performance. Humm... Need to think about this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel TBB's scalable malloc uses the same technique.
The interesting things about TBB's scalable malloc:
- for large objects (currently bigger than 8064 bytes) global allocator is used (plain malloc, or mmap - chosen by define)
- there is also global cache of superfluous memory superblocks, i.e. if thread has too many superblocks it moves superblocks to global cache
- there is no caching, i.e. if thread constantly frees remote memory it will constantly execute atomic ops (very bad in some scenarios - producer/consumer)
- inter-thread queue organized as lock-free stack (IBM freelist), or as mutex-protected stack
- thread grabs lock-free stack with CAS, not with XCHG
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel TBB's scalable malloc uses the same technique.
The interesting things about TBB's scalable malloc:
- for large objects (currently bigger than 8064 bytes) global allocator is used (plain malloc, or mmap - chosen by define)
- there is also global cache of superfluous memory superblocks, i.e. if thread has too many superblocks it moves superblocks to global cache
- there is no caching, i.e. if thread constantly frees remote memory it will constantly execute atomic ops (very bad in some scenarios - producer/consumer)
- inter-thread queue organized as lock-free stack (IBM freelist), or as mutex-protected stack
- thread grabs lock-free stack with CAS, not with XCHG
I think that TBB's approach has some advantages over automatically turning purely single-threaded allocator into multi-threaded. Consider points 1 and 2.
Though it's beyond question that your automatic approach is faaaaaaaaar better than just protecting single-threaded allocator with single mutex (or with fine-grained mutexes) - the thing we currently see in many allocators.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think that TBB's approach has some advantages over automatically turning purely single-threaded allocator into multi-threaded. Consider points 1 and 2.
Though it's beyond question that your automatic approach is faaaaaaaaar better than just protecting single-threaded allocator with single mutex (or with fine-grained mutexes) - the thing we currently see in many allocators.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Indeed. Interaction yields fruits.
extern "C" void scalable_free (void *object) {
Block *block;
ThreadId myTid;
FreeObject *objectToFree;
if (!object) {
return;
}
if (isLargeObject(object)) {
freeLargeObject(object);
return;
}
objectToFree = (FreeObject *)object;
myTid = getThreadId();
block = (Block *) ((uintptr_t)object & ~(blockSize - 1));/* mask low bits to get the block */
MALLOC_ASSERT (block->allocatedCount, ASSERT_TEXT);
if (myTid == block->owner) {
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Indeed. Interaction yields fruits.
extern "C" void scalable_free (void *object) {
Block *block;
ThreadId myTid;
FreeObject *objectToFree;
if (!object) {
return;
}
if (isLargeObject(object)) {
freeLargeObject(object);
return;
}
objectToFree = (FreeObject *)object;
myTid = getThreadId();
block = (Block *) ((uintptr_t)object & ~(blockSize - 1));/* mask low bits to get the block */
MALLOC_ASSERT (block->allocatedCount, ASSERT_TEXT);
if (myTid == block->owner) {
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, Block is fixed-size slab allocator.
If requested via realloc size is smaller than current size, then no copying. This is true and for small fixed-size objects and for big objects.
If requested via realloc size is bigger than current size, then, yes, new block is allocated and memory copied.
I don't get you. What do you mean by "blocks from a slab could be scattered across multiple threads instead of always being directed to the owner"?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another interesting moment about TBB's scalable malloc - how it determines large objects. This is exactly what we discussed several days ago:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb0cc5c0218b8570
/* Does this object start on a 16K boundary + the size of a large object header? */
static inline unsigned int isLargeObject(void *object)
{
return ((uintptr_t)object & (blockSize - 1)) == sizeof(LargeObjectHeader);
}
Hmmm... I thinking how this can be improved...
For example, we have 64kb superblocks in slab-allocator. We can not use objects starting at 0kb, 4kb, 8kb, 12kb, ..., 62 kb in slab-allocator. And place big objects exactly at those offsets. So we will have a bit more freedom where we can place big objects.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> block = (Block *) ((uintptr_t)object & ~(blockSize - 1));/* mask low bits to get the block */
I haven't looked at the allocation code in TBB and in particular the definition of Block but...
The above would requre sizeof(Block) to be a power of 2 (not much of a problem)
and that the first few bytes object reside within Block. This could introduce problems when (if) alligned allocations are supported and then would require the size of Block to be at least as large as the largest aligned allocation supported.Then consider thatthe alignment requirements of the application are unknown to the author of the allocation routine.
I would think a better route would be to prepend a locator in front of the returned object pointer (residing within the prepended Block object). This could either bethe Block* itself, or a -byte count from the object pointer to the base of the Block.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> block = (Block *) ((uintptr_t)object & ~(blockSize - 1));/* mask low bits to get the block */
I haven't looked at the allocation code in TBB and in particular the definition of Block but...
The above would requre sizeof(Block) to be a power of 2 (not much of a problem)
and that the first few bytes object reside within Block. This could introduce problems when (if) alligned allocations are supported and then would require the size of Block to be at least as large as the largest aligned allocation supported.Then consider thatthe alignment requirements of the application are unknown to the author of the allocation routine.
I would think a better route would be to prepend a locator in front of the returned object pointer (residing within the prepended Block object). This could either bethe Block* itself, or a -byte count from the object pointer to the base of the Block.
Allocator ALWAYS knows alignment requirements of the platform. I'm not sure what you mean by "alignment requirements of the application".
What you are proposing is exactly the thing which developers (and Chris) tried to avoid. Header-prefixed allocations can be too expensive for small objects.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another interesting moment about TBB's scalable malloc - how it determines large objects. This is exactly what we discussed several days ago:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb0cc5c0218b8570
/* Does this object start on a 16K boundary + the size of a large object header? */
static inline unsigned int isLargeObject(void *object)
{
return ((uintptr_t)object & (blockSize - 1)) == sizeof(LargeObjectHeader);
}
Hmmm... I thinking how this can be improved...
For example, we have 64kb superblocks in slab-allocator. We can not use objects starting at 0kb, 4kb, 8kb, 12kb, ..., 62 kb in slab-allocator. And place big objects exactly at those offsets. So we will have a bit more freedom where we can place big objects.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you mean that they can be cached for a way too long (i.e. infinitely)?
Caching for limited amount of time can not be a problem, because when thread dies there still can be blocks owned by that thread in use.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Possibly we have some misunderstanding because of terminology.
What is the problem to handle objects bigger than superblock? One just have to allocate then via ::malloc. They are big so it must not be a problem. Or they can be allocated directly via mmap()/VirtualAlloc(). This is approach used in TBB's scalable malloc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> block = (Block *) ((uintptr_t)object & ~(blockSize - 1));/* mask low bits to get the block */
I haven't looked at the allocation code in TBB and in particular the definition of Block but...
The above would requre sizeof(Block) to be a power of 2 (not much of a problem)
and that the first few bytes object reside within Block. This could introduce problems when (if) alligned allocations are supported and then would require the size of Block to be at least as large as the largest aligned allocation supported.Then consider thatthe alignment requirements of the application are unknown to the author of the allocation routine.
I would think a better route would be to prepend a locator in front of the returned object pointer (residing within the prepended Block object). This could either bethe Block* itself, or a -byte count from the object pointer to the base of the Block.
Jim Dempsey
http://groups.google.com/group/comp.programming.threads/msg/f2b1cac59c8e98a3
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Possibly we have some misunderstanding because of terminology.
What is the problem to handle objects bigger than superblock? One just have to allocate then via ::malloc. They are big so it must not be a problem. Or they can be allocated directly via mmap()/VirtualAlloc(). This is approach used in TBB's scalable malloc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you mean that they can be cached for a way too long (i.e. infinitely)?
Caching for limited amount of time can not be a problem, because when thread dies there still can be blocks owned by that thread in use.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have it in an experimental next version of vZOOM commercial library that is not available to anyone yet. However, I think I will post it here. I have to create a proof, and run some more tests, but it definitely makes things work WAY faster when there are a significant number of reclaimed remote blocks. Let's see if you can guess what I did...
Here is a hint, I found out a easy way to completely remove the NASTY loop in the `slab_sys_shared_pop()' function... How do you think I accomplished this and still maintain a valid reference count?
:^D


- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page