Re: transforming single-threaded allocators... - Page 3

Chris_M__Thomasson · ‎10-07-2008

My name is Chris M. Thomasson and I have developed a non-blocking algorithm that allows one to transform an inherently single-threaded memory allocator into a scaleable multi-threaded allocator. This invention allows a programmer to just create an allocator algorithm without worrying about any multi-threading issues. Once your done with your single-threaded allocator, well, when you apply my non-blocking algorithm, and PRESTO! Its now a scaleable multi-threaded allocator! Is that cool or what?

Okay, here is an initial proof of concept, I have transformed the Windows Heap object created with the HEAP_NO_SERIALIZE flag. This flag disables the internal lock used by the heap, which makes it a heck of a lot more efficient. However, since the heap is no longer serialized, one cannot use it in a multi-threaded application as a general allocator; until now... The algorithm I created is fairly simple; basically, it goes like this:
___________________________________________________________________________
I create thread local user allocator in every thread.

Alloc request is forwarded to this thread-local user allocator directly.

If free request goes from thread that allocate block, then free request is forwarded to this thread-local user allocator.

If free request goes from another thread, then you accumulate this block in per-thread stack-based freelist.

Blocks from this freelist are actually freed in batches when thread allocates/deallocates another block.
___________________________________________________________________________

I mentioned this algorithm a while back:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb87f9cb2cb62d84

Well, I have released a pre-alpha version of it which demonstrates how to transform the Windows Heap object under GPL license; the example source-code can be found here:

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_win_v000_c.html

Can you please run the test as-is, which uses stock malloc impl, and record the output... Then uncomment the following line:

#define USE_MEM_ALLOC

which engages my algorithm; run the test again; and post the differences in timings between the two. I think you will be fairly surprised at the performance enhancement. The lock in the Windows Heap object that `malloc' is based on severely damages its scalability when used in a multi-threaded environment. When `USE_MEM_ALLOC' is defined, ALL of the locking is removed and the heaps are created on a per-thread basis. This distribution and clever support algorithm I invented allows for the Windows Heap to scale quite nicely indeed.

I am going to tweak the algorithm such that one could plug in different allocators on a per-thread basis! In other words, thread A can use a different allocation algorithm than thread B, and so on. My algorithm actually supports this. I think the possibilities are truly amazing. It even allows thread A to allocate and free in thread B even though they use completely different allocation algorithm. That's pretty darn cool. Some may think its even impossible; its not!

BTW, I am thinking that this algorithm might be able to be used to transform other single-threaded things besides allocators... Humm, need to think some more on that.

P.S.

The test portion of the code uses the pthread-win32 library which can be downloaded from here:

http://sourceware.org/pthreads-win32

I am thinking of removing dependence on it and just use pure Windows primitives.

Any thoughts?

Dmitry_Vyukov · ‎10-23-2008

Quoting - CHRIS M THOMASSON

Quoting - Dmitriy V'jukov

Maybe it's already not worth it... I'm not sure... What do you think?
On the other hand, if it can be implemented efficiently (especially w/o additional overheads for other usage patterns), then it's always nicely to provide efficient support for one more usage pattern...

Humm... I think I understand what your getting at. I am not sure if its worth it or not. It definitely extra bookkeeping. Humm... I really need to stop and think about this scheme. At the moment, I am not really sure how much better it is than the current scheme vZOOM uses. I apologize for not putting in the sorting and coalescing of blocks from the same slab in the `mflush()' function. This sorting does have overheads indeed, however,I assume that calls to `mflush()' will be far and few between.

I was confused because you didn't display sorting and organization into cohorts. I think it's Ok to sort only in mflush(). I think it's even better than my variant.

Though I think you must employ something like this ("Hashed cache array" variant):

http://groups.google.ru/group/comp.programming.threads/msg/547b4edb494beeb9?pli=1

And if we would have dense thread ids, it can help with sorting and organizing into cohorts.