Re: Scalable hash map algorithm

Dmitry_Vyukov · ‎09-02-2008

I've attached implementation of my new concurrent hash map algorithm. Implementation is NOT PRODUCTION READY, it's just an illustration of the algorithm. Implementation can be compiled only with MSVC/Win32, but algorithm is fairly portable, only single-word atomic RMW operations are required, and very little of OS specific stuff.

Here are benchmark results on Q6600.
Element count = 256, readonly workload, 8 threads.
TBB concurrent_hash_map:
Single core: 338 cycles/find
Cores 0&1: 218 cycles/find (scaling 1.55)
Cores 0&2: 540 cycles/find (scaling 0.63)
All cores: 353 cycles/find (scaling 0.96)

My hash map:
Single core: 30 cycles/find
Cores 0&1: 15 cycles/find (scaling 2)
Cores 0&2: 15 cycles/find (scaling 2)
All cores: 7 cycles/find (scaling 4)

Element count = 65536, readonly workload, 8 threads.
TBB concurrent_hash_map:
Single core: 383 cycles/find
Cores 0&1: 212 cycles/find (scaling 1.80)
Cores 0&2: 584 cycles/find (scaling 0.66)
All cores: 363 cycles/find (scaling 1.06)

My hash map:
Single core: 42 cycles/find
Cores 0&1: 21 cycles/find (scaling 2)
Cores 0&2: 21 cycles/find (scaling 2)
All cores: 10 cycles/find (scaling 4)

Element count = 256, mixed workload (4 threads make 100% finds; 4 threads make 50% finds, 25% inserts, 25% removes)
TBB concurrent_hash_map:
Single core: 348 cycles/find
Cores 0&1: 212 cycles/find (scaling 1.64)
Cores 0&2: 477 cycles/find (scaling 0.73)
All cores: 304 cycles/find (scaling 1.14)

My hash map:
Single core: 51 cycles/find
Cores 0&1: 27 cycles/find (scaling 1.89)
Cores 0&2: 31 cycles/find (scaling 1.65)
All cores: 15 cycles/find (scaling 3.4)

Element count = 65536, mixed workload (4 threads make 100% finds; 4 threads make 50% finds, 25% inserts, 25% removes)
TBB concurrent_hash_map:
Single core: 377 cycles/find
Cores 0&1: 206 cycles/find (scaling 1.83)
Cores 0&2: 516 cycles/find (scaling 0.73)
All cores: 322 cycles/find (scaling 1.17)

My hash map:
Single core: 74 cycles/find
Cores 0&1: 35 cycles/find (scaling 2.12)
Cores 0&2: 43 cycles/find (scaling 1.72)
All cores: 22 cycles/find (scaling 3.36)

On read-mostly workload this algorithm is up to 50x faster than TBB's hash map. On mixed workload it is up to 30x faster than TBB's hash map. And scaling is substantially better, so on 8-core system speedup can be 100x and higher.

Memory consumption is 25% lower as compared with TBB hash map.

Few notes on algorithm.
First of all it uses amortized proxy-collector memory reclamation (implementation included into archive). It allows read-only transactions to be purely read-only, i.e. no modifications of shared state.
Read-only access (find operation) uses a kind of sequence mutex (so called SeqLock) to ensure consistency. But I made some modifications to SeqLock, so that read access is obstruction-free, as opposed to classical SeqLock which is actually blocking if reader conflicts with writer.
Mutators (insert/remove) use fine-grained locking wrt other mutators, and atomic modifications wrt readers.
Layout of data is cache-conscious so that most oprations rarely touch (either write or read) more that one cache-line.

Dmitry_Vyukov · ‎09-02-2008

I would like to hear opinion of Arch Robison on this.
In TBB roadmap:
http://softwareblogs.intel.com/2007/10/12/threading-building-blocks-product-roadmap/
there is a point:
"Container class additions and improvements
The TBB development team is working on new container classes, and interface improvements and enhanced performance for existing classes."

There are synchronization algorithms which are up to 100x faster on modern "moderate-multi-core" than traditional lock-based ones, and they have near linear scaling, so that they are more appropriate for "many-core future". In my opinion it's the way to bridle many-core. What do you think?

ARCH_R_Intel · ‎09-02-2008

References to the literature on the 100x faster synchronization algorithms would be appreciated.

Your concurrent_hash_map does appear to be aggressive about performance. I like the attention to layout, particularly the packing of a cell and handling of extra items that do not fit in a cell.

I have some concerns about deadlock, because methods like find_and_lock return while holding a lock on a granularity larger than an individual record. For example, if a thread tries to acquire locks on two separate keys (say in lexigraphic order), it risks deadlock if the keys are in the same cell. Also, method grow appears to lock all the cells before proceeding. So if a thread has a lock on an item and tries to insert another key that causes grow() to be called, does deadlock ensue?

Given the speed improvements, maybe an alternative usage model is called for, such as "do not lock more than one cell at once". An alternative interface, driven by C++ 0x lambdas, might be to replace the find_and_lock/update_and_unlock methods with a method "find_and_update(key,lambdaExpression)" where lambdaExpression does the update.

Dmitry_Vyukov · ‎09-02-2008

MADadrobiso:
References to the literature on the 100x faster synchronization algorithms would be appreciated.

I was referring to this algorithm running on eight-core system :)

If this algorithm will still have linear scaling, and TBB algorithm will continue to degrade, than difference will be >100x on eight-core system. Unfortunately I don't have access to such machine to test. Also if we are talking about 2-processors each with 4 cores, than communication between different processors packages must be even costlier than communication between cores in the same package. So algorithm which issues high cache-coherence traffic will degrade in spurts there.

And note that such system are currently present in server segment.

Dmitry_Vyukov · ‎09-02-2008

MADadrobiso:
Your concurrent_hash_map does appear to be aggressive about performance.

I think that it's not bad for low-level basic building block in the context of the library which is all about performance and scalability :)

MADadrobiso:

I like the attention to layout, particularly the packing of a cell and handling of extra items that do not fit in a cell.

Cell is designed to fully occupy 1 cache-line. Because whole block is properly aligned, cell is guaranteed to start at a cache-line boundary. So if thread executes read-only operation, then it have to fetch one cache-line at maximum. If thread executes write operation, then if will get cache-line in Modified status after executing CAS when locking the cell, no more cache-line transfers nor status changes required. Btw, currently I am investigating possibility to lock the cell with XCHG (instead of CMPXCHG), this can improve scalability on write workload to some degree.
Keys and values are separated in cell, in order to not waste many memory for alignment, if, for example, key is 8 bytes (int64), and value is 4 bytes (pointer to some structure).
All structures, which resides in arrays, are padded to occupy 2^N bytes to not waste time for division when accessing element of array.
Because of careful layout and absense of dynamic memory allocation, data structure is very memory efficient. On some workloads I seen that it uses 25% less memory than TBB hash map.

Dmitry_Vyukov · ‎09-02-2008

MADadrobiso:

I have some concerns about deadlock, because methods like find_and_lock return while holding a lock on a granularity larger than an individual record. For example, if a thread tries to acquire locks on two separate keys (say in lexigraphic order), it risks deadlock if the keys are in the same cell. Also, method grow appears to lock all the cells before proceeding. So if a thread has a lock on an item and tries to insert another key that causes grow() to be called, does deadlock ensue?

Given the speed improvements, maybe an alternative usage model is called for, such as "do not lock more than one cell at once". An alternative interface, driven by C++ 0x lambdas, might be to replace the find_and_lock/update_and_unlock methods with a method "find_and_update(key,lambdaExpression)" where lambdaExpression does the update.

You are 100% right here. Currently deadlocks are possible. I just didn't have time to investigate and document/resolve all problems...

As you noticed, 2 ways are possible. Some deadlock possibilities can be eliminated. And some usage patterns which can lead to deadlock can be just prohibited (maybe with some asserts in debug version).

For example as for possible deadlock when thread tries to lock 2 items. Hash map can provide the method which accepts the array of keys:

void find_and_lock_several_items(item_array_t* a, size_t count);



where



struct item_array_t



 key_t key;



value_t placeholder_for_return_value;



bool is_found_and_locked;

};

Or list of all locked locations can be stored in TLS, or in handle_t object, so find_and_lock (and all other functions) can be able check for recursion.

Lambdas are a good variant too. And this can be emulated with "functors" in current C++.
For example functor can have following signature:
bool update_item(key_in_t k, value_in_t old_value, value_out_t new value); // return value means, whether to update or not.

So basically, one has to decide what *interface* and what *semantics* he wants to have in the first place. And then, I believe, all problems can be solved.

Btw, explicit proxy-collector calls 'th.quescent()' can be merged into TBB's accessors. I.e. in constructor accessor will acquire reference to proxy, and in destructor it will release reference to proxy. This way there will be no user visible calls to proxy-collector API. BUT here is a caveat. User is disallowed to block thread for a long time while holding any references to proxies, AND user have to make explicit call to proxy-collector API before blocking thread for a long time. Otherwise, system-wide memory reclamation will be blocked.

Dmitry_Vyukov · ‎09-02-2008

Some people are saying that mutexes are the way to automatically resolve contention and other synchronization techniques (usually they call it 'lock-free') are the way to maximize contention.

Some people are saying that 'wait-free programming' (yeah, sometimes they call it 'wait-free') only incurs additional overheads, none the less.

Some people are saying that I am hereby prohibited to be that much faster.

JUST download the code and RUN some benchmarks!

RafSchietekat · ‎09-02-2008

"Some people are saying that I am hereby prohibited to be that much faster."

Not "prohibited", "forbidden". But I hope the irony didn't get lost in translation (it was a joke)?

So how much faster is this than my hash map contribution (the last version)?

Dmitry_Vyukov · ‎09-02-2008

Raf_Schietekat:

"Some people are saying that I am hereby prohibited to be that much faster."
Not "prohibited", "forbidden". But I hope the irony didn't get lost in translation (it was a joke)?

Up to this moment I was not sure. So it was a joke. Ok. No problem :)

Raf_Schietekat:

So how much faster is this than my hash map contribution (the last version)?

I was testing against the latest stable release (2.1). If you know how much your version is faster than the last stable release, then we can calculate the unknown quantity.
I briefly skimmed through the code (your contribution), well, it still uses the mutex for read access, so it modifies shared state, so it issues cache-coherence traffic...

RafSchietekat · ‎09-02-2008

"If you know how much your version is faster than the last stable release, then we can calculate the unknown quantity." Well, I was hoping to do that calculation the other way around, but it would seem to be irrelevant now, anyway...

Dmitry_Vyukov · ‎09-03-2008

randomizer:
MADadrobiso:
References to the literature on the 100x faster synchronization algorithms would be appreciated.
I was referring to this algorithm running on eight-core system :)

Oh! Did you mean references to some academia articles, and not references to hardcore brain-damaging code?

Anton_M_Intel · ‎09-03-2008

Dmitriy, do you know Hopscotch Hashing algorithm (http://groups.google.com/group/hopscotch-hashing), a high performance cache aware hash map algorithm?
How does yours differ from this one?

Dmitry_Vyukov · ‎09-03-2008

MADamalakho:
Dmitriy, do you know Hopscotch Hashing algorithm (http://groups.google.com/group/hopscotch-hashing), a high performance cache aware hash map algorithm?
How does yours differ from this one?

No, I was not aware of that algorithm. It looks very interesting. Thank you for the link.

I briefly skimmed through code. Well, it looks similar to my algorithm to some degree. They also use a kind of SeqLock to protect readers.
But they "forget" about memory reclamation. I.e. map can't grow. My map can grow, thanks to proxy-collector.
Further you can't store pointers in their hash map. Assume you value type is my_entity_t*. You insert object into map, then remove it from map. You disallowed to delete it! It must live forever now! Because there can be concurrent readers which still reads the object.

I have to read more. And make some benchmarks against their implementation.

Dmitry_Vyukov · ‎09-04-2008

MADamalakho:
Dmitriy, do you know Hopscotch Hashing algorithm (http://groups.google.com/group/hopscotch-hashing), a high performance cache aware hash map algorithm?
How does yours differ from this one?

Ok, I've read the paper and source code.
As for 'synchronization part' of algorithm. Their implementation is very close to mine. Read operations use timestamping, i.e. they don't modify shared data, by can retried several times. Write operations use fine-grained locking, but their implementation uses substantially coarser-grained locking, because their algorithm requires bigger buckets. My algorithm uses lock per cache-line or even per 1/2 of cache-line, i.e. 3 or 4 elements at maximum.
But they omit memory reclamation. And as I understand it's intentional. They says something like 'we use open-hashing because it doesn't require GC, and GC is additional overheads in multi-threaded environment'. So basically you are disallowed to store pointers in table, you can store only embeded data.
But then they write 'in benchmarks each bucket encompasses pointers to data and value (satellite key and data). This scheme is thus general hash-map'. Well, it seems that they just preallocate all keys and data, and doesn't delete any objects while benchmarking. In my oipinion, this can't be called 'general hash-map'... at least in C/C++... until I am missing something.
As for hashing algorithm. It differs substantially. The only similarity is that both algorithms try to optimize cache-line usage. It's difficult to say which algorithm is better until we've made exhaustive testing in different situations.

I am still going to organize a little shootout between then, at least on synthetic benchmark. It will be interesting. I think that in favorable conditions, my algorithm will be faster. In not so favorable conditions... it's difficult to say ahead of time. However hashing algorithms differs substantially.

Dmitry_Vyukov · ‎09-04-2008

MADadrobiso:
References to the literature on the 100x faster synchronization algorithms would be appreciated.

Your concurrent_hash_map does appear to be aggressive about performance. I like the attention to layout, particularly the packing of a cell and handling of extra items that do not fit in a cell.

What do you think about incorporating such techniques into TBB?

ARCH_R_Intel · ‎09-04-2008

randomizer:

Oh! Did you mean references to some academia articles, and not references to hardcore brain-damaging code?

Yes, I meant academic or trade journal articles, not code. Maybe you should write one. You do seem to have the knack for exploiting x86 memory ordering rules for all they can yield.

ARCH_R_Intel · ‎09-04-2008

randomizer:

What do you think about incorporating such techniques into TBB?

I think eventually we'll try to incorporate such elaborate packing. Right now, the key issue with TBB hash tables is figuring out the concurrency semantics for the improved hash table. We're looking into trying to parameterize the the locking mechanism so users can define their own accessor model, and thus not be tied down to the reader-writer model we have in tbb::concurrent_hash_map.

One of the sticking points is whether non-blocking erasure should be allowed; i.e., erase operations that return before the item is actually deleted. There seem to be pros and cons to this, depending on the use case.

It is perplexing that there are so many ways to implement concurrent hash tables, and each has various merits. The differences seem much more severe than for sequential hash tables. I'm starting to wonder if a "one size fits most" concurrent table is unrealistic.

Dmitry_Vyukov · ‎09-05-2008

MADadrobiso:

I think eventually we'll try to incorporate such elaborate packing. Right now, the key issue with TBB hash tables is figuring out the concurrency semantics for the improved hash table. We're looking into trying to parameterize the the locking mechanism so users can define their own accessor model, and thus not be tied down to the reader-writer model we have in tbb::concurrent_hash_map.

Can you elaborate here a bit more, please? What is 'own accessor model'? How it can look like?

MADadrobiso:

One of the sticking points is whether non-blocking erasure should be allowed; i.e., erase operations that return before the item is actually deleted. There seem to be pros and cons to this, depending on the use case.

My personal opinion is that Yes, non-blocking erasure should be allowed.

But behavior should (1) provide sequential self-consistency, i.e. if thread removes item, then subsequent operation of *that* thread must not see item in the table and (2) respect casual relations between threads, i.e. thread 1 removes item from the table, then sends a message to thread 2, then thread 2 receives a message, then thread 2 searches for that item - thread 2 must not see the item in the table.

As for all other situations, I don't see how they can end up being non-intuitive. What you mean by 'cons' here?

My reasoning is that total order is not defined across all operations of all threads. So term 'before' is not defined across all operations. So reasoning based on wall clock makes no sense in multi-threaded environment. Only casual relations makes sense.

This is how my hash map works. When thread removes item, thread only marks item as removed. And if key or value is pointer, then thread also queues actual key/data for deletion with pcx_thread::defer() function. So all other threads can still access this item as long as they want. Exactly this is underlying reason for high-performance - *asynchronous* processing, i.e. each thread operates on his own speed.

And this is exactly how hardware memory models work. I.e. there is no enforcement based on wall clock, only casual relations are respected.

Anton_M_Intel · ‎09-05-2008

randomizer:
MADadrobiso:

I think eventually we'll try to incorporate such elaborate packing. Right now, the key issue with TBB hash tables is figuring out the concurrency semantics for the improved hash table. We're looking into trying to parameterize the the locking mechanism so users can define their own accessor model, and thus not be tied down to the reader-writer model we have in tbb::concurrent_hash_map.

Can you elaborate here a bit more, please? What is 'own accessor model'? How it can look like?

The proposal is to separate out the accessors from the container and make them flexible enough to enable several ways of data accessing and lifetime management:

"classical" RW protected access
direct access for atomic or const values.
deferred item deletion - similar to reference counters.
support for garbage-collected items.
support to make erase() methods blocking.

My prototype shows that it is feasible.
I gave your implementation only a quick look. And I find also read with timestamps useful.
But we'll definitely consider it more closely in the next few month while implementing new hash containers.

randomizer:
MADadrobiso:

One of the sticking points is whether non-blocking erasure should be allowed; i.e., erase operations that return before the item is actually deleted. There seem to be pros and cons to this, depending on the use case.

My personal opinion is that Yes, non-blocking erasure should be allowed.

But behavior should (1) provide sequential self-consistency, i.e. if thread removes item, then subsequent operation of *that* thread must not see item in the table and (2) respect casual relations between threads, i.e. thread 1 removes item from the table, then sends a message to thread 2, then thread 2 receives a message, then thread 2 searches for that item - thread 2 must not see the item in the table.

Thank you for this argumentation. It seems like what we were looking for.
My arguments that blocking semantics is only subset of non-blocking one (i.e. no lock acquired for item being deleted) and that there are applications like web-services which don't require blocking, they have no such convincing effect :)

Alexey-Kukanov · ‎09-05-2008

randomizer:
behavior should (1) provide sequential self-consistency, i.e. if thread removes item, then subsequent operation of *that* thread must not see item in the table and (2) respect casual relations between threads, i.e. thread 1 removes item from the table, then sends a message to thread 2, then thread 2 receives a message, then thread 2 searches for that item - thread 2 must not see the item in the table.

As for all other situations, I don't see how they can end up being non-intuitive. What you mean by 'cons' here?

This is a good description of the guarantees the container can provide; thanks a lot.

There is at least one questionable point though (the "other" situation, if you want). What if one and the same thread looks up for a key in a hash table, finds an element and keeps a reference to it (in the terms of tbb::concurrent_hash_map, a const_accessor; well, it's a bit stronger than just a reference - it's more of a reader lock), and then repeats the search for the same key - should it find exactly the same element? And if not (which is perfectly possible with the "non-blocking erasure"), is this behavior sequentially self-consistent?

Dmitry_Vyukov · ‎09-05-2008

MADamalakho:

The proposal is to separate out the accessors from the container and make them flexible enough to enable several ways of data accessing and lifetime management:
"classical" RW protected access
direct access for atomic or const values.
deferred item deletion - similar to reference counters.
support for garbage-collected items.
support to make erase() methods blocking.
My prototype shows that it is feasible.

Ok, I see.
But I'm not sure that it's possible to implement all types of accesses for all data structures w/o imposing huge overheads for all types of accesses. I.e. it will destroy 'pay as you go' principle.
For example for my hash map it's difficult to implement reactive detection of moment when item is actually deleted. In order to implement this feature we have to impose huge overheads for readers, no matter whether user will use blocking erase() or not.
So basically we anyway end up having only some restricted set of accessors for every particular data structure.
Or I am completely missing something?