topic Re: transforming single-threaded allocators... in Intel® Moderncode for Parallel Architectures

transforming single-threaded allocators...

Chris_M__Thomasson — Wed, 08 Oct 2008 04:34:02 GMT

My name is Chris M. Thomasson and I have developed a non-blocking algorithm that allows one to transform an inherently single-threaded memory allocator into a scaleable multi-threaded allocator. This invention allows a programmer to just create an allocator algorithm without worrying about any multi-threading issues. Once your done with your single-threaded allocator, well, when you apply my non-blocking algorithm, and PRESTO! Its now a scaleable multi-threaded allocator! Is that cool or what?

Okay, here is an initial proof of concept, I have transformed the Windows Heap object created with the HEAP_NO_SERIALIZE flag. This flag disables the internal lock used by the heap, which makes it a heck of a lot more efficient. However, since the heap is no longer serialized, one cannot use it in a multi-threaded application as a general allocator; until now... The algorithm I created is fairly simple; basically, it goes like this:
___________________________________________________________________________
I create thread local user allocator in every thread.

Alloc request is forwarded to this thread-local user allocator directly.

If free request goes from thread that allocate block, then free request is forwarded to this thread-local user allocator.

If free request goes from another thread, then you accumulate this block in per-thread stack-based freelist.

Blocks from this freelist are actually freed in batches when thread allocates/deallocates another block.
___________________________________________________________________________

I mentioned this algorithm a while back:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb87f9cb2cb62d84

Well, I have released a pre-alpha version of it which demonstrates how to transform the Windows Heap object under GPL license; the example source-code can be found here:

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_win_v000_c.html

Can you please run the test as-is, which uses stock malloc impl, and record the output... Then uncomment the following line:

#define USE_MEM_ALLOC

which engages my algorithm; run the test again; and post the differences in timings between the two. I think you will be fairly surprised at the performance enhancement. The lock in the Windows Heap object that `malloc' is based on severely damages its scalability when used in a multi-threaded environment. When `USE_MEM_ALLOC' is defined, ALL of the locking is removed and the heaps are created on a per-thread basis. This distribution and clever support algorithm I invented allows for the Windows Heap to scale quite nicely indeed.

I am going to tweak the algorithm such that one could plug in different allocators on a per-thread basis! In other words, thread A can use a different allocation algorithm than thread B, and so on. My algorithm actually supports this. I think the possibilities are truly amazing. It even allows thread A to allocate and free in thread B even though they use completely different allocation algorithm. That's pretty darn cool. Some may think its even impossible; its not!

BTW, I am thinking that this algorithm might be able to be used to transform other single-threaded things besides allocators... Humm, need to think some more on that.

P.S.

The test portion of the code uses the pthread-win32 library which can be downloaded from here:

http://sourceware.org/pthreads-win32

I am thinking of removing dependence on it and just use pure Windows primitives.

Any thoughts?

Re: transforming single-threaded allocators...

Chris_M__Thomasson — Wed, 08 Oct 2008 05:21:03 GMT

I posted the WRONG VERSION! I did not include the reclaim threshold experimental feature I am playing around with. Also, there was a bug in the following function `mem_thread_sys_reclaim()'. Here was the original code:

LONG
mem_thread_sys_reclaim(
union mem_block* const reset
) {
LONG count = 0;
union mem_block* mblk = XCHG(&g_mem_tls->rfree, reset);
while (mblk) {
union mem_block* const next = mblk->next;
if (! HeapFree(g_mem_tls->heap, HEAP_NO_SERIALIZE, mblk)) {
assert(0); abort();
}
++count;
mblk = next;
}
if (! reset) {
g_mem_tls->refs += count;
}
return count;
}

Can you spot the bug? Well, IMVHO, its kind of easy... See, the reclaim function is suppose to free the blocks that have been deferred by remote threads... Well, a free is suppose to actually decrement the owner thread reference count... I was incrementing it! OUCH!

;^(...

Anyway, its fixed on the web-site. I needed to change the code:

if (! reset) {
g_mem_tls->refs += count;
}

to:

if (! reset) {
g_mem_tls->refs -= count;
}

So, you should re-download the source-code! Sorry for any trouble.

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_win_v000_c.html

Anyway, I am not sure how the reclaim threshold feature is going to effect performance. Humm... Need to think about this.

Re: transforming single-threaded allocators...

Chris_M__Thomasson — Wed, 08 Oct 2008 05:23:35 GMT

BTW, in case anybody is interested, here is the first time I made the base non-blocking algorithm itself public:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8245f4b48591fc69

Re: transforming single-threaded allocators...

Chris_M__Thomasson — Wed, 08 Oct 2008 05:42:18 GMT

I improved the test application; the web-page is updated. See, before, it was actually calling `my_malloc/free' functions under the protection of the test mutex. Well, that's no good. So, I moved them outside of the critical-region. Now, it would be an interesting experiment to run both versions of the test. One with the allocation functions in the critical-section, and one without. On my end, my algorithm beats normal malloc/free in either occasion, but really whips them when they are not protected by the mutex. Can you please post some numbers? I am curious to see how horrible stock malloc/free actually are!

;^D

Re: transforming single-threaded allocators...

Dmitry_Vyukov — Wed, 08 Oct 2008 08:14:47 GMT

Intel TBB's scalable malloc uses the same technique.

The interesting things about TBB's scalable malloc:

- for large objects (currently bigger than 8064 bytes) global allocator is used (plain malloc, or mmap - chosen by define)

- there is also global cache of superfluous memory superblocks, i.e. if thread has too many superblocks it moves superblocks to global cache

- there is no caching, i.e. if thread constantly frees remote memory it will constantly execute atomic ops (very bad in some scenarios - producer/consumer)

- inter-thread queue organized as lock-free stack (IBM freelist), or as mutex-protected stack

- thread grabs lock-free stack with CAS, not with XCHG

Re: transforming single-threaded allocators...

Dmitry_Vyukov — Wed, 08 Oct 2008 08:20:27 GMT

Quoting - Dmitriy V'jukov

Intel TBB's scalable malloc uses the same technique.

The interesting things about TBB's scalable malloc:

- for large objects (currently bigger than 8064 bytes) global allocator is used (plain malloc, or mmap - chosen by define)

- there is also global cache of superfluous memory superblocks, i.e. if thread has too many superblocks it moves superblocks to global cache

- there is no caching, i.e. if thread constantly frees remote memory it will constantly execute atomic ops (very bad in some scenarios - producer/consumer)

- inter-thread queue organized as lock-free stack (IBM freelist), or as mutex-protected stack

- thread grabs lock-free stack with CAS, not with XCHG

I think that TBB's approach has some advantages over automatically turning purely single-threaded allocator into multi-threaded. Consider points 1 and 2.

Though it's beyond question that your automatic approach is faaaaaaaaar better than just protecting single-threaded allocator with single mutex (or with fine-grained mutexes) - the thing we currently see in many allocators.

Re: transforming single-threaded allocators...

Chris_M__Thomasson — Wed, 08 Oct 2008 10:01:28 GMT

Quoting - Dmitriy V'jukov

I think that TBB's approach has some advantages over automatically turning purely single-threaded allocator into multi-threaded. Consider points 1 and 2.

Though it's beyond question that your automatic approach is faaaaaaaaar better than just protecting single-threaded allocator with single mutex (or with fine-grained mutexes) - the thing we currently see in many allocators.

Here are some of my current thoughts:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/1689f58f066a5995

I seems like Intel's per-thread allocator instances areprobably aheck of a lot more optimized than Windows Heap API w/ `HEAP_NO_SERIALIZE' flag. I also like how they move super-blocks out of threads and defer to global allocator onlarge allocations. However, I do find it a little strange that they don't use XCHG for reclaiming remote deallocations.

As for my "generic" algorithm, well, with Windows Heaps I simply cannot get it to beat my region allocator:

http://webpages.charter.net/appcore/vzoom/malloc/sergey_vzmem_thread.html

Also, I can't get it to beat the allocator in the current vZOOM commercial package. I guess its good in a sense because it can definitely transform basically any single-threaded allocation technique into a distributed thread-safe entity. IMVHO, its a good algorithm to study...

Perhaps, I am going to try some optimizations... Such as limiting the size of per-thread heaps by utilizing the max_size parameter in the HeapCreate function. Perhaps mess around with a caching algorithm on remote frees. When `HeapAlloc()' fails, it defers to a global low-frag heap. However, I don't know how much this is going to buy me because it seems like calling into `HeapAlloc()/Free()' is an expensive operation no matter if `HEAP_NO_SERIALIZE' is set or not.

Although, the "generic" nature of the current algorithm cannot take advantage of header-less blocks. This has major overhead on a per-allocation basis. Since the algorithm does not know anything about the underlying per-thread allocators, well, it basically has to have a nasty header. I am not sure how to overcome this without forcing a deeper interaction between the algorithm and the allocators its managing.

BTW, do you know off hand if TBB allocator uses header-less blocks?

Re: transforming single-threaded allocators...

Dmitry_Vyukov — Wed, 08 Oct 2008 10:23:14 GMT

Quoting - lockfree

Indeed. Interaction yields fruits.

Quoting - lockfree

BTW, do you know off hand if TBB allocator uses header-less blocks?

extern "C" void scalable_free (void *object) {
Block *block;
ThreadId myTid;
FreeObject *objectToFree;

if (!object) {
return;
}

if (isLargeObject(object)) {
freeLargeObject(object);
return;
}

objectToFree = (FreeObject *)object;

myTid = getThreadId();

block = (Block *) ((uintptr_t)object & ~(blockSize - 1));/* mask low bits to get the block */
MALLOC_ASSERT (block->allocatedCount, ASSERT_TEXT);

if (myTid == block->owner) {

Re: transforming single-threaded allocators...

Chris_M__Thomasson — Wed, 08 Oct 2008 11:59:59 GMT

Quoting - Dmitriy V'jukov

Indeed. Interaction yields fruits.

Quoting - lockfree

BTW, do you know off hand if TBB allocator uses header-less blocks?

extern "C" void scalable_free (void *object) {
Block *block;
ThreadId myTid;
FreeObject *objectToFree;

if (!object) {
return;
}

if (isLargeObject(object)) {
freeLargeObject(object);
return;
}

objectToFree = (FreeObject *)object;

myTid = getThreadId();

block = (Block *) ((uintptr_t)object & ~(blockSize - 1));/* mask low bits to get the block */
MALLOC_ASSERT (block->allocatedCount, ASSERT_TEXT);

if (myTid == block->owner) {

That will definitely do it. I use the same technique in vZOOM and the region allocator. Its basically the only way to accomplish it efficiently. I take it that their `Block' object is a slab of fixed sized objects, and that don't have support for `realloc()' right? I mean they could do it for sure, if they used copying...

As for remote deallocations, well, I am thinking of a way to get it done, but its going to be a little tricky. This is because a blocks from a slab could be scattered across multiple threads instead of always being directed to the owner. Interesting problem indeed!

;^D

Re: transforming single-threaded allocators...

Dmitry_Vyukov — Wed, 08 Oct 2008 12:29:49 GMT

Quoting - lockfree

I take it that their `Block' object is a slab of fixed sized objects, and that don't have support for `realloc()' right? I mean they could do it for sure, if they used copying...

Yes, Block is fixed-size slab allocator.

If requested via realloc size is smaller than current size, then no copying. This is true and for small fixed-size objects and for big objects.

If requested via realloc size is bigger than current size, then, yes, new block is allocated and memory copied.

Quoting - lockfree

I don't get you. What do you mean by "blocks from a slab could be scattered across multiple threads instead of always being directed to the owner"?

Re: transforming single-threaded allocators...

Dmitry_Vyukov — Wed, 08 Oct 2008 12:43:28 GMT

Another interesting moment about TBB's scalable malloc - how it determines large objects. This is exactly what we discussed several days ago:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb0cc5c0218b8570

/* Does this object start on a 16K boundary + the size of a large object header? */
static inline unsigned int isLargeObject(void *object)
{
return ((uintptr_t)object & (blockSize - 1)) == sizeof(LargeObjectHeader);
}

Hmmm... I thinking how this can be improved...

For example, we have 64kb superblocks in slab-allocator. We can not use objects starting at 0kb, 4kb, 8kb, 12kb, ..., 62 kb in slab-allocator. And place big objects exactly at those offsets. So we will have a bit more freedom where we can place big objects.

Re: transforming single-threaded allocators...

Chris_M__Thomasson — Wed, 08 Oct 2008 12:43:46 GMT

I forget to clarify. I was referring to moment when `realloc()' is invoked from a thread that does not own the slab in which the block resides. AFAICT,this would require creating a new block in calling threads slab, copying the origin block, then pushing the origin block onto its remote owner thread.

As for my comment on blocks being scattered, well, I again forget to clarify another important moment. I was thinking of ways to cache blocks in remote threads. Well, when the owner thread dies, there could be multiple threads which have blocks from its slabs cached. Humm...

Am I making sense now? Anyway, sorry about that!

:^/

Re: transforming single-threaded allocators...

jimdempseyatthecove — Wed, 08 Oct 2008 12:53:11 GMT

>> block = (Block *) ((uintptr_t)object & ~(blockSize - 1));/* mask low bits to get the block */

I haven't looked at the allocation code in TBB and in particular the definition of Block but...

The above would requre sizeof(Block) to be a power of 2 (not much of a problem)
and that the first few bytes object reside within Block. This could introduce problems when (if) alligned allocations are supported and then would require the size of Block to be at least as large as the largest aligned allocation supported.Then consider thatthe alignment requirements of the application are unknown to the author of the allocation routine.

I would think a better route would be to prepend a locator in front of the returned object pointer (residing within the prepended Block object). This could either bethe Block* itself, or a -byte count from the object pointer to the base of the Block.

Jim Dempsey

Re: transforming single-threaded allocators...

Dmitry_Vyukov — Wed, 08 Oct 2008 13:44:17 GMT

Quoting - jimdempseyatthecove

>> block = (Block *) ((uintptr_t)object & ~(blockSize - 1));/* mask low bits to get the block */

I haven't looked at the allocation code in TBB and in particular the definition of Block but...

The above would requre sizeof(Block) to be a power of 2 (not much of a problem)
and that the first few bytes object reside within Block. This could introduce problems when (if) alligned allocations are supported and then would require the size of Block to be at least as large as the largest aligned allocation supported.Then consider thatthe alignment requirements of the application are unknown to the author of the allocation routine.

I would think a better route would be to prepend a locator in front of the returned object pointer (residing within the prepended Block object). This could either bethe Block* itself, or a -byte count from the object pointer to the base of the Block.

Allocator ALWAYS knows alignment requirements of the platform. I'm not sure what you mean by "alignment requirements of the application".

What you are proposing is exactly the thing which developers (and Chris) tried to avoid. Header-prefixed allocations can be too expensive for small objects.

Re: transforming single-threaded allocators...

Chris_M__Thomasson — Wed, 08 Oct 2008 13:48:32 GMT

Quoting - Dmitriy V'jukov

Humm... Very interesting fine-grain approach. I am thinking how to handle allocations that need to span multiple super-blocks. Well, I guess a possible, perhaps naive, approach would be to align them on a boundary that is greater thanthe size of asuper-block. Then one could tell allocations that are over 64k apart from all the others...

Re: transforming single-threaded allocators...

Dmitry_Vyukov — Wed, 08 Oct 2008 13:56:26 GMT

Quoting - lockfree

Do you mean that they can be cached for a way too long (i.e. infinitely)?

Caching for limited amount of time can not be a problem, because when thread dies there still can be blocks owned by that thread in use.

Re: transforming single-threaded allocators...

Dmitry_Vyukov — Wed, 08 Oct 2008 14:04:14 GMT

Quoting - lockfree

Possibly we have some misunderstanding because of terminology.

What is the problem to handle objects bigger than superblock? One just have to allocate then via ::malloc. They are big so it must not be a problem. Or they can be allocated directly via mmap()/VirtualAlloc(). This is approach used in TBB's scalable malloc.

Re: transforming single-threaded allocators...

Chris_M__Thomasson — Wed, 08 Oct 2008 14:36:48 GMT

Quoting - jimdempseyatthecove

Noblocks within the slab would reside anywhere in the header whatsoever; what makes you think that? Consider this trivial naive simple layout:

_________________________________________________________________

#define L2_CACHE_LINE 128

#define PAGE_SIZE 8192

#define BLKBUF_SIZE (PAGE_SIZE - L2_CACHE_LINE)

struct header {

/* [blah, blah]; */

};

union header_l2pad {

struct header this;

char l2pad[L2_CACHE_LINE];

};

struct slab {

union header_l2pad header;

unsigned char blkbuf[BLKBUF_SIZE];

};

typedef char sassert[

(sizeof(union header_l2pad)== L2_CACHE_LINE) &&

(sizeof(struct slab)== PAGE_SIZE) ? 1 : -1

];

_________________________________________________________________

If you ensure that `struct slab' objects reside on a `PAGE_SIZE' boundary, then you can find the header from any address residing in `struct slab::blkbuf' by using the masking technique. The actual buffer space starts out aligned on a l2 cache line boundary. This is fairly trivial task indeed. If the application needs specific alignment, well, it does what they all do and over allocate and perform the alignment themselves:

http://groups.google.com/group/comp.programming.threads/msg/f2b1cac59c8e98a3

The allocator does not care if the application requests more memory than it needs because of special alignment requirements. If the application needs page alignment, well, in this setup objects larger than `BLKBUF_SIZE' would be considered large and can defer to `malloc/free', or whatever.

Re: transforming single-threaded allocators...

Chris_M__Thomasson — Wed, 08 Oct 2008 15:04:31 GMT

Quoting - Dmitriy V'jukov

Possibly we have some misunderstanding because of terminology.

What is the problem to handle objects bigger than superblock? One just have to allocate then via ::malloc. They are big so it must not be a problem. Or they can be allocated directly via mmap()/VirtualAlloc(). This is approach used in TBB's scalable malloc.

Nevermind I was thinking of something else. Sorry. One time I did a region allocator based on reserved address space, and large object could be created out of multiple contigous pages. It went something like:

____________________________________________________________________

#define PAGE_SIZE 8192

struct page {

unsigned char buf[PAGE_SIZE];

};

struct vmregion {

struct page* base;

size_t count;

size_t max_count;

size_t refs;

};

static pthread_mutex_t g_vmregion_mtx = PTHREAD_MUTEX_INITIALIZER;

static struct vmregion g_vmregion = { 0, 0,0, 0};

/* reserve a very large number of pages and store base in

`g_vmregion::base' and max count in `g_vmregion::max_count'*/

struct page*

vmregion_commit(

size_t pages

) {

struct page* p = NULL;

size_t count;

pthread_mutex_lock(&g_vmregion_mtx);

count = g_vmregion.count + pages;

if (count <= g_vmregion.max_count) {

p = g_vmregion.base + g_vmregion.count;

g_vmregion.count = count;

++g_vmregion.refs;

}

pthread_mutex_unlock(&g_vmregion_mtx);

if (p) {

/* commit the pages */

}

return p;

}

void

vmregion_decommit(

struct page* pages,

size_t count

) {

/* decommit the pages */

pthread_mutex_lock(&g_vmregion_mtx);

--g_vmregion.refs;

if (! g_vmregion.refs) {

g_vmregion.count = 0;

}

pthread_mutex_unlock(&g_vmregion_mtx);

}

____________________________________________________________________

But, this has all the caveats that any region type allocation does. Oh well.

Re: transforming single-threaded allocators...

Chris_M__Thomasson — Wed, 08 Oct 2008 15:17:01 GMT

Quoting - Dmitriy V'jukov

Do you mean that they can be cached for a way too long (i.e. infinitely)?

Caching for limited amount of time can not be a problem, because when thread dies there still can be blocks owned by that thread in use.

They can definitely be cached infinitely in certain scenarios indeed... Think of a thread which freesseveral blocks from several different threads, and caches them... However, from that point onward,the threadsimply never interacts with the allocator interface again, and stays alive for the duration of the program. Well, theslabs which hold those blocks will persist for the lifetime of the program. This can be a problem indeed...

In my algorithm I use reference counting trick to ensure that slabs are freed as soon as the last reference to them has been dropped. You can see example of this moment in the following algorihtms:

http://webpages.charter.net/appcore/vzoom/malloc/sergey_vzmem_thread.html

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb87f9cb2cb62d84

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_win_v000_c.html

All of those algorithms use the same basic reference counting trick. However, it only works because there is no remote caching. Well, there is definitely one way to solve this problem: You could require the application threads to periodically or episodically free a NULL pointer, or call a special allocator function (e.g., mflush()), which would scan the calling threads cache and perform cleanup as necessary. Humm, what do you think?