Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

transforming single-threaded allocators...

Chris_M__Thomasson
New Contributor I
1,854 Views

My name is Chris M. Thomasson and I have developed a non-blocking algorithm that allows one to transform an inherently single-threaded memory allocator into a scaleable multi-threaded allocator. This invention allows a programmer to just create an allocator algorithm without worrying about any multi-threading issues. Once your done with your single-threaded allocator, well, when you apply my non-blocking algorithm, and PRESTO! Its now a scaleable multi-threaded allocator! Is that cool or what?

Okay, here is an initial proof of concept, I have transformed the Windows Heap object created with the HEAP_NO_SERIALIZE flag. This flag disables the internal lock used by the heap, which makes it a heck of a lot more efficient. However, since the heap is no longer serialized, one cannot use it in a multi-threaded application as a general allocator; until now... The algorithm I created is fairly simple; basically, it goes like this:
___________________________________________________________________________
I create thread local user allocator in every thread.

Alloc request is forwarded to this thread-local user allocator directly.

If free request goes from thread that allocate block, then free request is forwarded to this thread-local user allocator.

If free request goes from another thread, then you accumulate this block in per-thread stack-based freelist.

Blocks from this freelist are actually freed in batches when thread allocates/deallocates another block.
___________________________________________________________________________


I mentioned this algorithm a while back:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/eb87f9cb2cb62d84


Well, I have released a pre-alpha version of it which demonstrates how to transform the Windows Heap object under GPL license; the example source-code can be found here:

http://webpages.charter.net/appcore/vzoom/malloc/vzmalloc_win_v000_c.html


Can you please run the test as-is, which uses stock malloc impl, and record the output... Then uncomment the following line:


#define USE_MEM_ALLOC


which engages my algorithm; run the test again; and post the differences in timings between the two. I think you will be fairly surprised at the performance enhancement. The lock in the Windows Heap object that `malloc' is based on severely damages its scalability when used in a multi-threaded environment. When `USE_MEM_ALLOC' is defined, ALL of the locking is removed and the heaps are created on a per-thread basis. This distribution and clever support algorithm I invented allows for the Windows Heap to scale quite nicely indeed.

I am going to tweak the algorithm such that one could plug in different allocators on a per-thread basis! In other words, thread A can use a different allocation algorithm than thread B, and so on. My algorithm actually supports this. I think the possibilities are truly amazing. It even allows thread A to allocate and free in thread B even though they use completely different allocation algorithm. That's pretty darn cool. Some may think its even impossible; its not!

BTW, I am thinking that this algorithm might be able to be used to transform other single-threaded things besides allocators... Humm, need to think some more on that.

P.S.

The test portion of the code uses the pthread-win32 library which can be downloaded from here:

http://sourceware.org/pthreads-win32

I am thinking of removing dependence on it and just use pure Windows primitives.

Any thoughts?

0 Kudos
42 Replies
Dmitry_Vyukov
Valued Contributor I
674 Views
Quoting - lockfree
I have implemented a _significant_ improvement in the way nodes are reclaimed in the following algorithm:

I have it in an experimental next version of vZOOM commercial library that is not available to anyone yet. However, I think I will post it here. I have to create a proof, and run some more tests, but it definitely makes things work WAY faster when there are a significant number of reclaimed remote blocks. Let's see if you can guess what I did...


Here is a hint, I found out a easy way to completely remove the NASTY loop in the `slab_sys_shared_pop()' function... How do you think I accomplished this and still maintain a valid reference count?


:^D

It looks like a challenge. Ok, let's see how I would implement it commercial library, if I would have one.

struct block_t
{
struct block_t *next;
struct per_thread_t *thread;
};

struct block_anchor_t
{
block_t *front;
};

struct per_thread_t
{
block_anchor_t local;
block_anchor_t shared;
block_t *blocks;
size_t block_sz;
int refs;
};

void* slab_malloc( size_t sz )
{
per_thread_t *_this = pthread_getspecific(...);
block_t *block = slab_sys_local_pop( _this );
return block;
}

void slab_free( void *mem )
{
block_t *block = ((block_t*)mem) - 1;
per_thread_t *_this = pthread_getspecific(...);
if ( block->thread == _this )
{
slab_sys_local_push( block );
}
else
{
slab_sys_shared_push( block );
}
}

block_t* slab_sys_local_pop( per_thread_t *_this )
{
block_t *block = _this->local.front;
if (block)
{
_this->local.front = block->next;
return block + 1;
}
return slab_sys_shared_pop (_this);
}

void slab_sys_local_push( block_t *_this)
{
_this->next = _this->thread->local.front;
_this->thread->local.front = _this;
}

block_t* slab_sys_shared_pop( per_thread_t *_this)
{
block_t *block = XCHG( &_this->shared.front, 0 );
if ( block )
{
_this->local.front = block->next;
return block + 1;
}
return 0;
}

void slab_sys_shared_push( block_t *_this )
{
block_t *cmp;
do
{
cmp = _this->thread->shared.front;
_this->next = cmp & ~1;
}
while ( ! CAS( &_this->thread->shared.front,
cmp,
( cmp & 1 ) ? _this | 1 : _this ) );
if ( cmp & 1 && XADD( &_this->refs, -1 ) == 1 )
{
/* destroy _this->thread */
}
}

void slab_sys_tls_dtor( void *state )
{
per_thread_t *_this = state;
_this->refs = 1 + BLOCK_COUNT_PER_SUPERBLOCK;
int count = 1;
block_t* block = _this->local.front;
while ( block )
{
count += 1;
block = block->next;
}
block = XCHG( &_this->shared.front, 1 );
while ( block )
{
count += 1;
block = block->next;
}
if ( XADD( &_this->refs, -count ) == count )
{
/* destroy _this */
}
}

0 Kudos
Chris_M__Thomasson
New Contributor I
674 Views
Well Dimity, you got it right!
:^D
Bingo: I simply moved the loop/countingto the thread dtor. The tweak works and definitely improves performance. AFAICT, wrt my original algorithm, moving the loop to dtor was the only way to accomplish this improvement. I can't really think of another way to do it so efficiently. Removing/deferring loopscan usually bea major plus to basically any algorithm, IMVHO that is!
=^)
0 Kudos
Alexey-Kukanov
Employee
674 Views
Quoting - lockfree
I forget to clarify. I was referring to moment when `realloc()' is invoked from a thread that does not own the slab in which the block resides. AFAICT,this would require creating a new block in calling threads slab, copying the origin block, then pushing the origin block onto its remote owner thread.
As for my comment on blocks being scattered, well, I again forget to clarify another important moment. I was thinking of ways to cache blocks in remote threads. Well, when the owner thread dies, there could be multiple threads which have blocks from its slabs cached. Humm...
Am I making sense now? Anyway, sorry about that!
:^/

If the new size is not bigger than the fixed size of the slab, why the block should be reallocated, even if "owned" by another thread? It can simply be reused. If the new size is bigger than can be served by the slab, it should be reallocated, againindependently on whether the block is local or remote.

As for caching blocks in remote threads, in the TBB allocator it is impractical to implement.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
674 Views

As for caching blocks in remote threads, in the TBB allocator it is impractical to implement.

Why? Please clarify.

Well, if allocator is used with tasks, then, yes, it is impractical. Because stealing is rare, so there is a little remote deallocations.

But TBB's scalable allocator is a stand-alone thing. It can be used in arbitrary software. And substantial amount of software is relying on producer-consumer pattern in some form, for which caching can be crucial. Because there is *constant* stream of remote deallocations.

0 Kudos
Alexey-Kukanov
Employee
674 Views
>> block = (Block *) ((uintptr_t)object & ~(blockSize - 1));/* mask low bits to get the block */

I haven't looked at the allocation code in TBB and in particular the definition of Block but...

The above would requre sizeof(Block) to be a power of 2 (not much of a problem)
and that the first few bytes object reside within Block. This could introduce problems when (if) alligned allocations are supported and then would require the size of Block to be at least as large as the largest aligned allocation supported.Then consider thatthe alignment requirements of the application are unknown to the author of the allocation routine.

I would think a better route would be to prepend a locator in front of the returned object pointer (residing within the prepended Block object). This could either bethe Block* itself, or a -byte count from the object pointer to the base of the Block.

Jim Dempsey

blockSize in the TBB allocator is not the same as sizeof(Block). The former is the total size of any slab in bytes, while the latter is the size of the slab header (yes, Block is in fact the slab header structure). So sizeof(Block) does not formally need to be power of two. And yes, due to alignment requirements the actual allocation area in the slab might start not immediately after the header; in a sense, it effectively increases the "virtual" size of the header (i.e. memory overhead). But there is no problem with alignment requirements of an application, because aligned allocations should use a separate set of API functions and explicitly specify the required alignment; this allows to select the slab with proper alignment, though it may mean even more memory overhead - allfor sake of speed.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
674 Views
Quoting - lockfree
Well Dimity, you got it right!
:^D
Bingo: I simply moved the loop/countingto the thread dtor. The tweak works and definitely improves performance. AFAICT, wrt my original algorithm, moving the loop to dtor was the only way to accomplish this improvement. I can't really think of another way to do it so efficiently. Removing/deferring loopscan usually bea major plus to basically any algorithm, IMVHO that is!
=^)

The only moment is still unclear for me. How you cache nodes in caching mode in VZOOM memory allocator?

Nodes must be sorted into batches by owner thread id. If number of threads is constant and known, then it's simple - we create array of lists, we give every thread unique id 0,1,2... Then the task is trivial.

But if number of threads is unknown and variable, then one have to employ something like hash-map or... don't know... it must something very cheap...

0 Kudos
Dmitry_Vyukov
Valued Contributor I
674 Views
Quoting - Dmitriy V'jukov

The only moment is still unclear for me. How you cache nodes in caching mode in VZOOM memory allocator?

Nodes must be sorted into batches by owner thread id. If number of threads is constant and known, then it's simple - we create array of lists, we give every thread unique id 0,1,2... Then the task is trivial.

But if number of threads is unknown and variable, then one have to employ something like hash-map or... don't know... it must something very cheap...

I've remembered that I was already thinking about it:

http://groups.google.ru/group/comp.programming.threads/msg/547b4edb494beeb9

I was able to come up with 2 variants: "Cache list + global lock" and "Hashed cache array"

0 Kudos
jimdempseyatthecove
Honored Contributor III
674 Views

Dmitriy,

Depending on the operating system two references could be used to access the cache list. One could be via Thread Local Storage, which depending on O/S and runtime system may be reduce to an overhead of supplying asegment override prefix and the second would be your global table of tables for inter-thread allocation/deallocation.

Jim Dempsey

0 Kudos
Chris_M__Thomasson
New Contributor I
674 Views

If the new size is not bigger than the fixed size of the slab, why the block should be reallocated, even if "owned" by another thread? It can simply be reused. If the new size is bigger than can be served by the slab, it should be reallocated, againindependently on whether the block is local or remote.

As for caching blocks in remote threads, in the TBB allocator it is impractical to implement.

Yes of course; I was not thinking clearly; your right! As for remote caching of memory, well, agree with you. It can create odd scenarios. Dmitriy and I discussed this already else

0 Kudos
Alexey-Kukanov
Employee
674 Views
Quoting - Dmitriy V'jukov
I've remembered that I was already thinking about it:

http://groups.google.ru/group/comp.programming.threads/msg/547b4edb494beeb9

I was able to come up with 2 variants: "Cache list + global lock" and "Hashed cache array"

Cache list plus global lock does not seem viable. Not only it adds a serialization bottleneck into otherwise distributed design, but also it makes the thread go through the bunch of different block headers, many of which could be cold in CPU cache.

The second design looks more appealing, though instead of hash collisions I would probably try a kind of LRU policy to flush some lists - isn't it cache, after all? :) On the other hand, this design might lead to a scenario when you flush just one or two objects at a time - almost as if you do returning an object immediately, exceptthe corresponding block header being already cold in your CPU cache.

The biggest problem with object caching policies is thata general-purpose allocator never knows how many more objects would come from this exactly thread (in case of TBB allocator, from this exactly slab block - the free lists, both local and global, are per slab in the current design) so you do not know whether it makes sense to cache an object to return a whole bunch at once.

Also isn't it you who think that sharing is almost the root of all evils in multithreaded programming? ;)) And I tend to agree:) From this point of view, object exchange between threads should be as rare as possible, so contention and atomic operations on the cold path of cross-thread deallocation should be well amortized by fast hot path of working with local free list.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
674 Views

Cache list plus global lock does not seem viable. Not only it adds a serialization bottleneck into otherwise distributed design, but also it makes the thread go through the bunch of different block headers, many of which could be cold in CPU cache.

Yes, I am not fully satisfied with it too. The idea was that remote deallocations are infrequent, further thread tries to access global lock once per N (N=128, 256...) deallocations, further thread can use try_lock() and if attempt to lock fails just try next time.

But yes, blocks will be cold in cache with substantial probability. And yes, this scheme requires periodical activity from thread.

The second design looks more appealing, though instead of hash collisions I would probably try a kind of LRU policy to flush some lists - isn't it cache, after all? :) On the other hand, this design might lead to a scenario when you flush just one or two objects at a time - almost as if you do returning an object immediately, exceptthe corresponding block header being already cold in your CPU cache.

Yeah, the problem of most of such designs is that they perfectly fit some situations, but at the same time it's easy to come up with counter-example when such design will suck

The biggest problem with object caching policies is thata general-purpose allocator never knows how many more objects would come from this exactly thread (in case of TBB allocator, from this exactly slab block - the free lists, both local and global, are per slab in the current design) so you do not know whether it makes sense to cache an object to return a whole bunch at once.

Btw, why is it impossible to create per thread remote freelists, instead of per slab freelists? Because you don't want thread to go through the bunch of different block headers in order to sort them out into different slabs? Right?

0 Kudos
Dmitry_Vyukov
Valued Contributor I
674 Views

Also isn't it you who think that sharing is almost the root of all evils in multithreaded programming? ;)) And I tend to agree:) From this point of view, object exchange between threads should be as rare as possible, so contention and atomic operations on the cold path of cross-thread deallocation should be well amortized by fast hot path of working with local free list.

Yes, it's me who was saying that sharing is the root of all evil.

Actually I think that I was saying both things, that it's the root of all evil and that it's necessary thing :) because w/o sharing it's not multi-threaded program no more, it's just a set of independent processes which is unable to solve single task.

Well, I think that you are right for substantial number of well-written multi-threaded programs - i.e. there is no need to optimize remote deallocations.

Although, it seems that substantial number of multi-threaded programs relying on producer-consumer pattern in some form (stages, pipelines etc). And producer-consumer pattern means that there is *constant* flow of remote deallocations.

Also because we are talking about general-purpose allocator (and not about allocator for some particular program), we must consider also "no so well-written programs". Is not it good for library that non-expert user can create no so well-written program and it will still behave adequately, because underlying library uses a bunch of smart distributed and amortized techniques?

0 Kudos
Chris_M__Thomasson
New Contributor I
674 Views
Quoting - Dmitriy V'jukov

The only moment is still unclear for me. How you cache nodes in caching mode in VZOOM memory allocator?

Nodes must be sorted into batches by owner thread id. If number of threads is constant and known, then it's simple - we create array of lists, we give every thread unique id 0,1,2... Then the task is trivial.

But if number of threads is unknown and variable, then one have to employ something like hash-map or... don't know... it must something very cheap...

The caching mode in vZOOM does something "kind of" like; let me quickly sketch this out here:
______________________________________________________________________
struct block {
struct block* next;
};
struct slab {
struct block* local;
struct block* remote;
struct thread* owner;
size_t blksize;
size_t remote_count;
unsigned remote_max;
[...];
};
struct thread {
struct slab* slabs[16];
};
void free(void* mem) {
struct block* blk = mem;
struct slab* slab = get_slab_from_block(blk);
struct thread* thread = pthread_getspecific(...);
if (slab->owner == thread) {
blk->next = slab->local;
slab->local = blk;
} else {
struct slab* lslab = get_slab_from_size(thread, slab->blksize);
if (lslab->remote_count < lslab->remote_max) {
blk->next = lslab->remote;
lslab->remote = blk;
++slab->remote_count;
} else {
remote_free_to_owner_slab(slab, blk);
}
} else {
remote_free_to_owner_slab(slab, blk);
}
}
void* malloc(size_t size) {
struct thread* thread = pthread_getspecific(...);
struct slab* slab = get_slab_from_size(thread, size);
struct block* blk = slab->remote;
if (blk) {
slab->remote = blk->next;
--slab->remote_count;
} else {
blk = slab->local;
if (! blk) {
blk = reclaim_remote_blocks_from_slab(slab);
} else {
slab->local = blk->next;
}
}
return blk;
}
void sys_slab_flush(struct slab* slab) {
struct block* blk = slab->remote;
while (blk) {
struct block* const next = blk->next;
struct slab* bslab = get_slab_from_block(blk);
remote_free_to_owner_slab(bslab, blk);
blk = next;
}
slab->remote = NULL;
slab->remote_count = 0;
}
void mflush(void) {
unsigned i;
struct thread* thread = pthread_getspecific(...);
for (i = 0; i < 16; ++i) {
struct slab* slab = thread->slabs;
if (slab) {
sys_slab_flush(slab);
}
}
}
voidmcache(size_t size, unsigned max_count) {
struct thread* thread = pthread_getspecific(...);
struct slab* slab = get_slab_from_size(thread, size);
slab->remote_max = max_count;
if (slab->remote_count > max_count) {
sys_slab_flush(slab);
}
}
______________________________________________________________________
Applications can enable caching mode on a per-thread-slab basis.They can set it upfor a slab which contains a specific size. For instance, an application can say they want caching enabled for consumer threads A-D for allocations the size of (sizeof(void*) * 2)). There is a caveat... If a thread enables caching of ANY of its slabs, itshould call `mflush()' every now and then.
Allocations always check the remote list of the slab first...
0 Kudos
Dmitry_Vyukov
Valued Contributor I
674 Views
Quoting - lockfree
The caching mode in vZOOM does something "kind of" like; let me quickly sketch this out here:

Applications can enable caching mode on a per-thread-slab basis.They can set it upfor a slab which contains a specific size. For instance, an application can say they want caching enabled for consumer threads A-D for allocations the size of (sizeof(void*) * 2)). There is a caveat... If a thread enables caching of ANY of its slabs, itshould call `mflush()' every now and then.
Allocations always check the remote list of the slab first...

I see. It's interesting. Thank you.

I was thinking that you are able to return a bunch of remote blocks to owner at once (with one atomic RMW). So you are able only to reuse cached nodes, or to return them to owner one by one. It's interesting how you sort cached blocks by size, so thread is able to quickly find cached block of needed size. Even such caching can have great positive impact in some situations. For example when 2 threads exchanging messages with roughly equal frequency.

But what bothers me is the following. Assume there is 'asymmetric' message exchange, i.e. only one thread sending messages to another thread. Such caching won't help here, it will only make things worse, because there will be additional bookkeeping, and some blocks will be already cold in cache.

I think it's possible to create general-purpose caching scheme, in which it will be possible to return a bunch of memory blocks to owner with one atomic RMW. It will have N^2 space complexity.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
674 Views
Quoting - Dmitriy V'jukov

I think it's possible to create general-purpose caching scheme, in which it will be possible to return a bunch of memory blocks to owner with one atomic RMW. It will have N^2 space complexity.

Something like:

Assign to every thread dense indexes from 0 to N (if thread dies then it's number will be reused by next created thread). Create in TLS array of remote thread descriptors. This array is accessed only by owner, so it's easy to implement growing of array, i.e. if thread want to fetch descriptor of thread N+1, but current array size is N, then thread just allocates new bigger array, copy information and replace pointer to array.

The rest is pie. Thread caches remote blocks in according descriptor in its TLS array.

Here is little problem. When thread wants to allocate block it has to check ALL remote thread descriptors. In order to solve this 2 lists of blocks can be maintained: one per-remote-thread list (needed when we want to return a bunch of blocks to owner), and one 'global' (i.e. per-thread) list which combines all non-empty per-remote-thread lists (needed when thread want to allocate block from cache).

Although some details must be still worked over...

What do you think?

0 Kudos
Dmitry_Vyukov
Valued Contributor I
674 Views
Quoting - Dmitriy V'jukov

Something like:

Assign to every thread dense indexes from 0 to N (if thread dies then it's number will be reused by next created thread). Create in TLS array of remote thread descriptors. This array is accessed only by owner, so it's easy to implement growing of array, i.e. if thread want to fetch descriptor of thread N+1, but current array size is N, then thread just allocates new bigger array, copy information and replace pointer to array.

The rest is pie. Thread caches remote blocks in according descriptor in its TLS array.

Here is little problem. When thread wants to allocate block it has to check ALL remote thread descriptors. In order to solve this 2 lists of blocks can be maintained: one per-remote-thread list (needed when we want to return a bunch of blocks to owner), and one 'global' (i.e. per-thread) list which combines all non-empty per-remote-thread lists (needed when thread want to allocate block from cache).

Although some details must be still worked over...

What do you think?

Maybe it's already not worth it... I'm not sure... What do you think?

On the other hand, if it can be implemented efficiently (especially w/o additional overheads for other usage patterns), then it's always nicely to provide efficient support for one more usage pattern...

0 Kudos
Alexey-Kukanov
Employee
674 Views
Quoting - Dmitriy V'jukov
Btw, why is it impossible to create per thread remote freelists, instead of per slab freelists? Because you don't want thread to go through the bunch of different block headers in order to sort them out into different slabs? Right?

It isn't impossible, but I think the current design is at least not worse, and might be faster. One of the reasons is exactly as you mentioned - I don't want to go through all different headers for different allocation sizes. The other one is that single remote free list, even non-blocking, is a bottleneck again; per slab lists should have less contention to update. Also the overhead of reclaiming remotely freed objects is distributed more over time than with a single public list, yet it is on demand only. Though it might be interesting to test other schemes.

I also agree that some caching scheme is worth benchmarking at least, and agree that it should return a bunch of memory objects at once. Also, in the light of probable widespread of NUMA systems, I would avoid reusing remote memory - so returning a bunch becomes the only reason to cache.

0 Kudos
Chris_M__Thomasson
New Contributor I
674 Views

It isn't impossible, but I think the current design is at least not worse, and might be faster. One of the reasons is exactly as you mentioned - I don't want to go through all different headers for different allocation sizes. The other one is that single remote free list, even non-blocking, is a bottleneck again; per slab lists should have less contention to update. Also the overhead of reclaiming remotely freed objects is distributed more over time than with a single public list, yet it is on demand only. Though it might be interesting to test other schemes.

I also agree that some caching scheme is worth benchmarking at least, and agree that it should return a bunch of memory objects at once. Also, in the light of probable widespread of NUMA systems, I would avoid reusing remote memory - so returning a bunch becomes the only reason to cache.

I suspect that you already read through my quickly sketched caching algorithm; this was for SMP on an intra NUMA node basis.Anyway, why would you refrain from reusing remote blocks on NUMA? See, what if the memory was begin reused on the same node? Lets say that there was 6 processors per-node... If memory from P1 on NodeA, was passed to P4 on NodeA, why wouldn't P4 be allowed to reuse memory when the request came from a thread residing on NodeA?IMHO, caching algorithms can be highly NUMA friendly indeed.
Passing memory on a inter node basis is a VERY expensive operation as far as I am concerned. So, I agree with you that caching on remote NUMA nodes is BADDDDD. However, you did not specify that moment, so I can only guess here...
vZOOM on NUMA takes advantage of NUMA APIS for the platforms underlyingOS. Its takes REALLY remote deallocations into account indeed! When I write "remote deallocation" I am referring to intra-thread deallocations residing on the same NUMA node. When I write "REALLY remote deallocation" well, I am talking about retarded program which frees memory on from CPU1-NodeA to CPU1-NodeB; NOT GOOD!
I have not worked on NUMA for a while not, but I DO know all of the caveats. When I finally decide to purchasemy NEATSiCortex box (should be in next few months), well, I will be able to play when its snowing outside! I live in South Lake Tahoe. We had the first snow of the year two days ago. It was 13degrees last night. Well, I need something to do!!!!!
;^/

0 Kudos
Chris_M__Thomasson
New Contributor I
674 Views
Actually, during the `mflush()' function, it correlates and organizes thread id's into cohorts and can free multiple blocks to remote slabs in single CAS. However, I did not display this moment in sketched pseudo-code. I defer hashing and organizing from per-free calls to amortizedper-mflush calls. See, I expect `mfree()' to be called rather frequently. However, I expect `mflush()' calls to be called every now and then...
I don't see problem with deferring sorting until an explicit call to `mflush()' occurs.
;^D
0 Kudos
Alexey-Kukanov
Employee
662 Views
Quoting - lockfree
I suspect that you already read through my quickly sketched caching algorithm; this was for SMP on an intra NUMA node basis.Anyway, why would you refrain from reusing remote blocks on NUMA? See, what if the memory was begin reused on the same node? Lets say that there was 6 processors per-node... If memory from P1 on NodeA, was passed to P4 on NodeA, why wouldn't P4 be allowed to reuse memory when the request came from a thread residing on NodeA?IMHO, caching algorithms can be highly NUMA friendly indeed.
Passing memory on a inter node basis is a VERY expensive operation as far as I am concerned. So, I agree with you that caching on remote NUMA nodes is BADDDDD. However, you did not specify that moment, so I can only guess here...

Well, yes of course "really remote" memory reuse is very bad while intra-node (actually, it's rather intra-socket nowadays) reuse might be fine. But, it adds to bookkeeping complexity and reduces portability. Being NUMA-oblivious on object level is simpler. On the level of VM mapping and slab reuse, however, it makes more sense to me to play with non-portable NUMA APIs - as the VM mapping is non-portable either.

Quoting - lockfree
... well, I will be able to play when its snowing outside! I live in South Lake Tahoe. We had the first snow of the year two days ago. It was 13degrees last night. Well, I need something to do!!!!!
;^/

In my region of Russia, we have seen the first snow a couple weeks ago, much earlier this year than usually. It did not lie down of course. But when it will lie down, I'd prefer to go cross-contry skiing on my spare time :)

0 Kudos
Chris_M__Thomasson
New Contributor I
662 Views
Quoting - Dmitriy V'jukov
Maybe it's already not worth it... I'm not sure... What do you think?

On the other hand, if it can be implemented efficiently (especially w/o additional overheads for other usage patterns), then it's always nicely to provide efficient support for one more usage pattern...

Humm... I think I understand what your getting at. I am not sure if its worth it or not. It definitely extra bookkeeping. Humm... I really need to stop and think about this scheme. At the moment, I am not really sure how much better it is than the current scheme vZOOM uses. I apologize for not putting in the sorting and coalescing of blocks from the same slab in the `mflush()' function. This sorting does have overheads indeed, however,I assume that calls to `mflush()' will be far and few between.
0 Kudos
Reply