My name is Chris M. Thomasson and I have developed a non-blocking algorithm that allows one to transform an inherently single-threaded memory allocator into a scaleable multi-threaded allocator. This invention allows a programmer to just create an allocator algorithm without worrying about any multi-threading issues. Once your done with your single-threaded allocator, well, when you apply my non-blocking algorithm, and PRESTO! Its now a scaleable multi-threaded allocator! Is that cool or what?
Okay, here is an initial proof of concept, I have transformed the Windows Heap object created with the HEAP_NO_SERIALIZE flag. This flag disables the internal lock used by the heap, which makes it a heck of a lot more efficient. However, since the heap is no longer serialized, one cannot use it in a multi-threaded application as a general allocator; until now... The algorithm I created is fairly simple; basically, it goes like this:
I create thread local user allocator in every thread.
Alloc request is forwarded to this thread-local user allocator directly.
If free request goes from thread that allocate block, then free request is forwarded to this thread-local user allocator.
If free request goes from another thread, then you accumulate this block in per-thread stack-based freelist.
Blocks from this freelist are actually freed in batches when thread allocates/deallocates another block.
I mentioned this algorithm a while back:
Well, I have released a pre-alpha version of it which demonstrates how to transform the Windows Heap object under GPL license; the example source-code can be found here:
Can you please run the test as-is, which uses stock malloc impl, and record the output... Then uncomment the following line:
which engages my algorithm; run the test again; and post the differences in timings between the two. I think you will be fairly surprised at the performance enhancement. The lock in the Windows Heap object that `malloc' is based on severely damages its scalability when used in a multi-threaded environment. When `USE_MEM_ALLOC' is defined, ALL of the locking is removed and the heaps are created on a per-thread basis. This distribution and clever support algorithm I invented allows for the Windows Heap to scale quite nicely indeed.
I am going to tweak the algorithm such that one could plug in different allocators on a per-thread basis! In other words, thread A can use a different allocation algorithm than thread B, and so on. My algorithm actually supports this. I think the possibilities are truly amazing. It even allows thread A to allocate and free in thread B even though they use completely different allocation algorithm. That's pretty darn cool. Some may think its even impossible; its not!
BTW, I am thinking that this algorithm might be able to be used to transform other single-threaded things besides allocators... Humm, need to think some more on that.
The test portion of the code uses the pthread-win32 library which can be downloaded from here:
I am thinking of removing dependence on it and just use pure Windows primitives.
I was confused because you didn't display sorting and organization into cohorts. I think it's Ok to sort only in mflush(). I think it's even better than my variant.
Though I think you must employ something like this ("Hashed cache array" variant):
And if we would have dense thread ids, it can help with sorting and organizing into cohorts.