Scalable hash map algorithm - Page 2

Dmitry_Vyukov · ‎09-02-2008

I've attached implementation of my new concurrent hash map algorithm. Implementation is NOT PRODUCTION READY, it's just an illustration of the algorithm. Implementation can be compiled only with MSVC/Win32, but algorithm is fairly portable, only single-word atomic RMW operations are required, and very little of OS specific stuff.

Here are benchmark results on Q6600.
Element count = 256, readonly workload, 8 threads.
TBB concurrent_hash_map:
Single core: 338 cycles/find
Cores 0&1: 218 cycles/find (scaling 1.55)
Cores 0&2: 540 cycles/find (scaling 0.63)
All cores: 353 cycles/find (scaling 0.96)

My hash map:
Single core: 30 cycles/find
Cores 0&1: 15 cycles/find (scaling 2)
Cores 0&2: 15 cycles/find (scaling 2)
All cores: 7 cycles/find (scaling 4)

Element count = 65536, readonly workload, 8 threads.
TBB concurrent_hash_map:
Single core: 383 cycles/find
Cores 0&1: 212 cycles/find (scaling 1.80)
Cores 0&2: 584 cycles/find (scaling 0.66)
All cores: 363 cycles/find (scaling 1.06)

My hash map:
Single core: 42 cycles/find
Cores 0&1: 21 cycles/find (scaling 2)
Cores 0&2: 21 cycles/find (scaling 2)
All cores: 10 cycles/find (scaling 4)

Element count = 256, mixed workload (4 threads make 100% finds; 4 threads make 50% finds, 25% inserts, 25% removes)
TBB concurrent_hash_map:
Single core: 348 cycles/find
Cores 0&1: 212 cycles/find (scaling 1.64)
Cores 0&2: 477 cycles/find (scaling 0.73)
All cores: 304 cycles/find (scaling 1.14)

My hash map:
Single core: 51 cycles/find
Cores 0&1: 27 cycles/find (scaling 1.89)
Cores 0&2: 31 cycles/find (scaling 1.65)
All cores: 15 cycles/find (scaling 3.4)

Element count = 65536, mixed workload (4 threads make 100% finds; 4 threads make 50% finds, 25% inserts, 25% removes)
TBB concurrent_hash_map:
Single core: 377 cycles/find
Cores 0&1: 206 cycles/find (scaling 1.83)
Cores 0&2: 516 cycles/find (scaling 0.73)
All cores: 322 cycles/find (scaling 1.17)

My hash map:
Single core: 74 cycles/find
Cores 0&1: 35 cycles/find (scaling 2.12)
Cores 0&2: 43 cycles/find (scaling 1.72)
All cores: 22 cycles/find (scaling 3.36)

On read-mostly workload this algorithm is up to 50x faster than TBB's hash map. On mixed workload it is up to 30x faster than TBB's hash map. And scaling is substantially better, so on 8-core system speedup can be 100x and higher.

Memory consumption is 25% lower as compared with TBB hash map.

Few notes on algorithm.
First of all it uses amortized proxy-collector memory reclamation (implementation included into archive). It allows read-only transactions to be purely read-only, i.e. no modifications of shared state.
Read-only access (find operation) uses a kind of sequence mutex (so called SeqLock) to ensure consistency. But I made some modifications to SeqLock, so that read access is obstruction-free, as opposed to classical SeqLock which is actually blocking if reader conflicts with writer.
Mutators (insert/remove) use fine-grained locking wrt other mutators, and atomic modifications wrt readers.
Layout of data is cache-conscious so that most oprations rarely touch (either write or read) more that one cache-line.

ARCH_R_Intel · ‎09-05-2008

randomizer:

So basically we anyway end up having only some restricted set of accessors for every particular data structure.
Or I am completely missing something?

You are not missing anything. The hard decision is to decide how to balance potential generality against imposing costs on everyone. Our notion is that there ought to be a way to templatize the table (or operations on it) with respect to the locking discipline. For example, consider having a "Synchronizer" type asa template parameter to the table, and an "Accessor" type as a template parameter to method "find" or "insert". Thenconsiderwhat are the minimal set of signatures on Synchronizer and Accessor that allow a table to operate sanely. Here's a crude early stab I took at coming up with signatures. A is an Accessor; S is a synchronizer, and T isthe type of a table item.There is assumed to be an instance of synchronizer S wrapped around each table item, and A is a template parameter to the method find or insert. E.g., different calls might have a reader-lock type for A or a writer-lock type for A, or no locking at all.

S(const T& x)

Construct copy of x with access protection. [Should add rvalue-reference form for C++0x.]

~S()

Destructor

T& S::unprotected_ref();

Return reference to underlying type.

bool A::try_acquire(S& s);

Attempt to acquire an access right. Returns true if successful; false otherwise.

void S::wait_while_acquired();

Wait until all access rights have been released.

With this much support, I can choose (by choosing how the operations on A and S are implemented) whether to support only reader locking (reference counting), or full reader-writer locking, or no locking at all (which is useful if concurrent erase is not to be supported). Anton hassome ideas for extending it to handle more cases.

You could see how far you might templatize your hash map and its operations to open up more flexibility (perhaps in a very different direction from the above).

Anton_M_Intel · ‎09-05-2008

randomizer:

But I'm not sure that it's possible to implement all types of accesses for all data structures w/o imposing huge overheads for all types of accesses. I.e. it will destroy 'pay as you go' principle.

This principle must be kept. My initial proposal was to introduce fixed set of accessor types that are suitable to be used both within concurrent_hash_map and independently. I'm not sure whether it can handle "all cases in the world" but for listed above there is no significant overhead (while compiler can optimize out empty inline methods and const values :) )

Dmitry_Vyukov · ‎09-05-2008

MADadrobiso:
Yes, I meant academic or trade journal articles, not code. Maybe you should write one. You do seem to have the knack for exploiting x86 memory ordering rules for all they can yield.

I am just coding, I am not writing articles. Maybe 20 years later I will be writing articles... :)
Actually there are no such articles. At all. At least I am not aware of any. And I am trying to be aware.

My algorithms are based on work of very few people. What I can provide is some links to interesting comp.programming.threads discussions:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/5d4e81b46352905d
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/7d202e726424c0bf
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c102fae504a67876

And also there are some advanced lock-free libraries:
http://atomic-ptr-plus.sourceforge.net/
http://webpages.charter.net/appcore/

And also there is my mini-newsgroup, where I post all my algorithms:
http://groups.google.com/group/lock-free

morantza · ‎09-07-2008

I am about to finish the resize, it should be published soon.

When you are using C++ with its current standards, e.g. no GC, there are common known ways to support it without changing the code.

Since the released C++ code uses templates, it is possible to use something like smart pointer.

Good references:

1) The C++ Programming Language by Bjarne Stroustrup, Addison-Wesley

2) Design Patterns by Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides, Addison-Wesley

The idea is that you can implement a pointer with a reference counter, and then use it to the Data or Key template parameter.

Especially read 11.10 Dereferencing from the first reference.

Ill post an example how to create such smart reference pointer in C++.

Please let me know if you have further questions.

Thanks for the input,

Moran Tzafrir.

Dmitry_Vyukov · ‎09-08-2008

MADakukanov:
randomizer:
behavior should (1) provide sequential self-consistency, i.e. if thread removes item, then subsequent operation of *that* thread must not see item in the table and (2) respect casual relations between threads, i.e. thread 1 removes item from the table, then sends a message to thread 2, then thread 2 receives a message, then thread 2 searches for that item - thread 2 must not see the item in the table.

As for all other situations, I don't see how they can end up being non-intuitive. What you mean by 'cons' here?

This is a good description of the guarantees the container can provide; thanks a lot.

There is at least one questionable point though (the "other" situation, if you want). What if one and the same thread looks up for a key in a hash table, finds an element and keeps a reference to it (in the terms of tbb::concurrent_hash_map, a const_accessor; well, it's a bit stronger than just a reference - it's more of a reader lock), and then repeats the search for the same key - should it find exactly the same element? And if not (which is perfectly possible with the "non-blocking erasure"), is this behavior sequentially self-consistent?

Whether it's sequentially self-consistent behaviour or not depends on definition of accessor. If accessor is defined only as thing which provides access to object, than yes, it's self-consistent behaviour. If accessor is defined as thing which provides access to object and also gives some guarantees about state of hash map (for example that while you are holding accessor to element in hash map, that element can't be removed from hash map), than no, it's not self-consistent behaviour.
And I'm not sure how accessors are currenlty defined in TBB, whether they only provide access to element, or also provide some additional guarantees about state of hash map.

If we must choose only one variant, than personally I would prefer the former, i.e. accessor only provides access to element, just because this gives more freedom to implementor, and this allows substantially higher performance.

But it seems that in the context of proposal for different types of accessors:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/permalink/30262959/30262958/ShowThread.aspx#30262958
we can not choose any more. Accessor const_accessor will only provide access to element (it can use low overhead obstruction-free timestamp-based validation), accessor const_accessor_and_hash_map_state_locker will provide access to object and also provide some guarantees about state of hash map (it will use read locking of reader-writer mutex).

Dmitry_Vyukov · ‎09-09-2008

MADamalakho:

This principle must be kept. My initial proposal was to introduce fixed set of accessor types that are suitable to be used both within concurrent_hash_map and independently. I'm not sure whether it can handle "all cases in the world" but for listed above there is no significant overhead (while compiler can optimize out empty inline methods and const values :) )

For example, to support reference counted object lifetime management ('deferred item deletion - similar to reference counters') additional level of indirection is required. I.e. separate proxy object must be allocated, and this proxy-object will hold user key/data along with reference counter. If reference counting is not used than additional level of indirection and additional memory allocations/deallocations are not required.
And whether reference counting will be used or not is not know ahead of time, because this is determined by type of accessor. How you are going to solve this?

Dmitry_Vyukov · ‎09-09-2008

MADadrobiso:
E.g., different calls might have a reader-lock type for A or a writer-lock type for A, or no locking at all.

S(const T& x)

Construct copy of x with access protection. [Should add rvalue-reference form for C++0x.]

~S()

Destructor

T& S::unprotected_ref();

Return reference to underlying type.

bool A::try_acquire(S& s);

Attempt to acquire an access right. Returns true if successful; false otherwise.

void S::wait_while_acquired();

Wait until all access rights have been released.

As proof of concept I've tried to apply this interface to my hash map.
In my hash map read access is protected by a kind of SeqLock which allows 'retries'. I.e. find function looks like this:

value_t hash_map::find(key_t k)

 for (;;) {

 handle_t h = lock.begin_read_transaction();

 value_t v = find_impl(k);

 if (lock.commit_read_transaction(h))

 return v;

And I think the same situation will be if we will be using a kind of STM.
And this pattern is not supported by Synchronizer/Accessor interfaces.

RafSchietekat · ‎09-09-2008

Just make sure that TBB doesn't end up with ~~just~~only a lightning-fast hash map that makes me have to turn off the TV to be able to get it to reliably do what I want. (Seriously: TBB should primarily aim to be the best enabling toolkit for Joe Programmer, with best overall a close second.)

Dmitry_Vyukov · ‎09-09-2008

Raf_Schietekat:
Just make sure that TBB doesn't end up with just a lightning-fast hash map that makes me have to turn off the TV to be able to get it to reliably do what I want. (Seriously: TBB should primarily aim to be the best enabling toolkit for Joe Programmer, with best overall a close second.)

Joe Programmer is better to enjoy with single-threaded programming.
You can say "And what about taking advantage of modern multi-core processors?". Do you realize that current TBB hash map is substantially (more than 10x) slower on quad-core, than basic single-threaded hash map on single-core? And the more cores we will have the more slow TBB hash map will be.
After all who said that raw multi-threaded multi-core programming can be easy? Well, some higher level programming models (like TBB tasks) can provide good ratio between simplicity and scalability. But it's definitely not raw multi-threaded multi-core programming.

RafSchietekat · ‎09-09-2008

I was not aware of any such performance problems with the hash map.

I'm sure that achieving ultimate performance will never be easy, but it's not just about that (anymore), or so I hear.

Dmitry_Vyukov · ‎09-09-2008

Raf_Schietekat:

I was not aware of any such performance problems with the hash map.
I'm sure that achieving ultimate performance will never be easy, but it's not just about that (anymore), or so I hear.

The good sanity check for multi-threaded code is to check it against best single-threaded code. If it's not faster then, well, what's the point? Yeah, all 1000 cores are 100% busy, but performance is worse than single-threaded code.

ARCH_R_Intel · ‎09-09-2008

randomizer:

The good sanity check for multi-threaded code is to check it against best single-threaded code.

That's true when the code is the application. It gets trickier to evaluate when looking at only one piece of an application like a data structure. Consider a table that supports fetch-and-update. Using a single lock might be faster than locking each item separately, but if most of the time is spend in the update calculation (and not the search/lock itself), we are better off with the slower implementation that permits more concurrency, but not too slow.

That said, there is much room for improving the performance of TBB's concurrent_hash_map, though I never expect it to be as fast as a singled-threaded hash map. That is unless someone comes up with zero-overhead transactional memory.

Anton_M_Intel · ‎09-10-2008

randomizer:

And whether reference counting will be used or not is not know ahead of time, because this is determined by type of accessor. How you are going to solve this?

Accessors are abstraction of data accessing and is part of data storage abstraction of Synchronizer (the name is as in Arch's specification). So, it's known which accessor (-s?) will be used with container.

Anton_M_Intel · ‎09-10-2008

randomizer:

As proof of concept I've tried to apply this interface to my hash map.
In my hash map read access is protected by a kind of SeqLock which allows 'retries'. I.e. find function looks like this:
...[skipped]...
And I think the same situation will be if we will be using a kind of STM.
And this pattern is not supported by Synchronizer/Accessor interfaces.

I believe that this is rather part of container's algorithm than something related to access abstraction. You could rewrite it using accessor which stores data copy inside. And so, read retries became part of container logic.
But it is good example to consider, thank you.

Dmitry_Vyukov · ‎09-10-2008

MADamalakho:
randomizer:

And whether reference counting will be used or not is not know ahead of time, because this is determined by type of accessor. How you are going to solve this?

Accessors are abstraction of data accessing and is part of data storage abstraction of Synchronizer (the name is as in Arch's specification). So, it's known which accessor (-s?) will be used with container.

It's known which accessors *potentially* can be used with container. But it's *not* known which accessors *really* will be used with container. So it's impossible to keep 'pay as you go' principle. Everybody has to pay for possibility of reference counting by additional level of indirection and additional malloc/free calls.

Anton_M_Intel · ‎09-10-2008

randomizer:

You can say "And what about taking advantage of modern multi-core processors?". Do you realize that current TBB hash map is substantially (more than 10x) slower on quad-core, than basic single-threaded hash map on single-core? And the more cores we will have the more slow TBB hash map will be.
After all who said that raw multi-threaded multi-core programming can be easy? Well, some higher level programming models (like TBB tasks) can provide good ratio between simplicity and scalability. But it's definitely not raw multi-threaded multi-core programming.

I'm not sure in the measurements you provide. I compared current version of concurrent_hash_map with locked std::map (since hash map is non-standard), and it looks and scales well for size > few thousand items. And we are able to achieve much better performance and scalability while keeping it simple for users thanks to your sample implementation, Hopscotch Hashing, some other papers, and my proposals.

Anton_M_Intel · ‎09-10-2008

randomizer:
MADamalakho:

Accessors are abstraction of data accessing and is part of data storage abstraction of Synchronizer (the name is as in Arch's specification). So, it's known which accessor (-s?) will be used with container.

It's known which accessors *potentially* can be used with container. But it's *not* known which accessors *really* will be used with container. So it's impossible to keep 'pay as you go' principle. Everybody has to pay for possibility of reference counting by additional level of indirection and additional malloc/free calls.

Consider this code:
template class Synchronizer {
class Accessor {};
};
template class hash_map;

Accessor of certain type is defined within certain Synchronizer class that is used for instantiation of the container. So, you pay *only when you really need* some complicated stuff like GC.
Default Synchronizer with deferred deletion does not necessary require heavy reference counting.

Dmitry_Vyukov · ‎09-10-2008

MADamalakho:

I'm not sure in the measurements you provide. I compared current version of concurrent_hash_map with locked std::map (since hash map is non-standard), and it looks and scales well for size > few thousand items. And we are able to achieve much better performance and scalability while keeping it simple for users thanks to your sample implementation, Hopscotch Hashing, some other papers, and my proposals.

On what hardware you did tests?
Note that I was talking about following situation. Multi-threaded container is tested on multi-core machine with many threads. Single-threaded container is tested with one thread, and it doesn't protected with mutex.

Anton_M_Intel · ‎09-10-2008

randomizer:

Note that I was talking about following situation. Multi-threaded container is tested on multi-core machine with many threads. Single-threaded container is tested with one thread, and it doesn't protected with mutex.

It is purely academic interest. But in real application concurrent containers should be used just as synchronization object - a crossing point of separate threads/tasks which process data independently from each other.

Dmitry_Vyukov · ‎09-10-2008

MADamalakho:

And we are able to achieve much better performance and scalability while keeping it simple for users thanks to your sample implementation, Hopscotch Hashing, some other papers, and my proposals.

Few comments on this.
My map (at least in current version) is not simple for users. It has some caveats with memory reclamation, and supports only types for which read of inconsistent state is safe. Basically, it is not C++ type safe.
Hopscotch (in current version) is even worse. In addition, it doesn't support satellite data.
How one can explain those caveats to Joe Programmer? I don't know.
And other papers are targeted at Java, where life is much easier with GC.

But some improvements are definitely possible. To what point? I don't know. I'm constantly trying to merge high performance with user friendliness in the context of C++, but results are always not comforting. My current point of view is that only 2 solutions are possible: (1) low-level primitives usable only by experts and (2) higher-level programming model usable by all. And low-level primitives usable by all are not possible.

Dmitry_Vyukov · ‎09-11-2008

Raf, adrobiso, amalakho

I understand that synchronization primitive in common case can't be as fast as single-threaded primitive (I've get a bit excited here). I also perfectly understand your concern with convenience and simplicity for end user. I am not pushing such solutions into TBB, I've just shown what, I think, may be interesting to you. Final choice is up to you.

I agree with Anton that in well designed application "concurrent containers should be used just as synchronization object - a crossing point of separate threads/tasks which process data independently from each other". But there are still some moments wrt synchronization primitives performance which affect end user. First, performance of synchronization primitives affects notion of what is "well designed application". For example, if overhead per task execution in TBB is around 500 cycles, then I MUST design my application so that payload of task be not less than 5000 cycles. And if overhead per task is around 50 cycles, then I CAN design my application so that payload of task be between 5000-500 cycles too. In some situations high overhead per task execution can force me to refuse simple and straightforward design of application, and use instead some roundabout design.
Also performance of synchronization primitives can have great impact on 'not so well designed applications'. And if TBB is library for a Joe Programmer too, then it must count with 'not so well designed applications' too.

I see several ways for TBB in this context:
1. Concentrate on simplicity for end user, and get performance which can be obtained with this approach.
2. Concentrate on performance, and get simplicity which can be obtained with this approach.
3. Try to incorporate some of high-performance techniques, not sacrificing simplicity for end user.
4. Sacrifice a bit simplicity for end user for the sake of performance.
5. End up with 2 versions of primitives, first is for simplicity, and second is for performance.
6. Concentrate only on higher-level programming models (tasks for HPC, and agents for general-purpose computing), incorporating advanced synchronization techniques internally on low level.