Re: Scalable hash map algorithm - Page 3

Dmitry_Vyukov · ‎09-02-2008

I've attached implementation of my new concurrent hash map algorithm. Implementation is NOT PRODUCTION READY, it's just an illustration of the algorithm. Implementation can be compiled only with MSVC/Win32, but algorithm is fairly portable, only single-word atomic RMW operations are required, and very little of OS specific stuff.

Here are benchmark results on Q6600.
Element count = 256, readonly workload, 8 threads.
TBB concurrent_hash_map:
Single core: 338 cycles/find
Cores 0&1: 218 cycles/find (scaling 1.55)
Cores 0&2: 540 cycles/find (scaling 0.63)
All cores: 353 cycles/find (scaling 0.96)

My hash map:
Single core: 30 cycles/find
Cores 0&1: 15 cycles/find (scaling 2)
Cores 0&2: 15 cycles/find (scaling 2)
All cores: 7 cycles/find (scaling 4)

Element count = 65536, readonly workload, 8 threads.
TBB concurrent_hash_map:
Single core: 383 cycles/find
Cores 0&1: 212 cycles/find (scaling 1.80)
Cores 0&2: 584 cycles/find (scaling 0.66)
All cores: 363 cycles/find (scaling 1.06)

My hash map:
Single core: 42 cycles/find
Cores 0&1: 21 cycles/find (scaling 2)
Cores 0&2: 21 cycles/find (scaling 2)
All cores: 10 cycles/find (scaling 4)

Element count = 256, mixed workload (4 threads make 100% finds; 4 threads make 50% finds, 25% inserts, 25% removes)
TBB concurrent_hash_map:
Single core: 348 cycles/find
Cores 0&1: 212 cycles/find (scaling 1.64)
Cores 0&2: 477 cycles/find (scaling 0.73)
All cores: 304 cycles/find (scaling 1.14)

My hash map:
Single core: 51 cycles/find
Cores 0&1: 27 cycles/find (scaling 1.89)
Cores 0&2: 31 cycles/find (scaling 1.65)
All cores: 15 cycles/find (scaling 3.4)

Element count = 65536, mixed workload (4 threads make 100% finds; 4 threads make 50% finds, 25% inserts, 25% removes)
TBB concurrent_hash_map:
Single core: 377 cycles/find
Cores 0&1: 206 cycles/find (scaling 1.83)
Cores 0&2: 516 cycles/find (scaling 0.73)
All cores: 322 cycles/find (scaling 1.17)

My hash map:
Single core: 74 cycles/find
Cores 0&1: 35 cycles/find (scaling 2.12)
Cores 0&2: 43 cycles/find (scaling 1.72)
All cores: 22 cycles/find (scaling 3.36)

On read-mostly workload this algorithm is up to 50x faster than TBB's hash map. On mixed workload it is up to 30x faster than TBB's hash map. And scaling is substantially better, so on 8-core system speedup can be 100x and higher.

Memory consumption is 25% lower as compared with TBB hash map.

Few notes on algorithm.
First of all it uses amortized proxy-collector memory reclamation (implementation included into archive). It allows read-only transactions to be purely read-only, i.e. no modifications of shared state.
Read-only access (find operation) uses a kind of sequence mutex (so called SeqLock) to ensure consistency. But I made some modifications to SeqLock, so that read access is obstruction-free, as opposed to classical SeqLock which is actually blocking if reader conflicts with writer.
Mutators (insert/remove) use fine-grained locking wrt other mutators, and atomic modifications wrt readers.
Layout of data is cache-conscious so that most oprations rarely touch (either write or read) more that one cache-line.

RafSchietekat · ‎09-11-2008

At this point I want to apologise to all programmers named Joe for dragging them into this.

I would prefer getting a solid kick without too much worrying involved (no surprises allowed), with additional to ultimate performance available if and when I'm prepared to deal with the associated complexity (it's not just about being an "expert" or not).

Dmitry_Vyukov · ‎09-12-2008

Raf_Schietekat:

I would prefer getting a solid kick without too much worrying involved (no surprises allowed), with additional to ultimate performance available if and when I'm prepared to deal with the associated complexity (it's not just about being an "expert" or not).

Btw, what exactly do you mean by 'associated complexity'? What 'associated complexity' do you see as end user, not as developer?

Dmitry_Vyukov · ‎09-12-2008

randomizer:
Raf_Schietekat:

I would prefer getting a solid kick without too much worrying involved (no surprises allowed), with additional to ultimate performance available if and when I'm prepared to deal with the associated complexity (it's not just about being an "expert" or not).

Btw, what exactly do you mean by 'associated complexity'? What 'associated complexity' do you see as end user, not as developer?

Mutex based solutions have associated complexity for end user. The most known is possibility of deadlock. The not so well known is spurious and unexpected performance destruction when transiting from single-core to multi-core.
So usage of mutex based synchronization instead of... let's call it advanced synchronization can be considered at least as trading bad for worse.

Anton_M_Intel · ‎09-12-2008

randomizer:

I see several ways for TBB in this context:
1. Concentrate on simplicity for end user, and get performance which can be obtained with this approach.
2. Concentrate on performance, and get simplicity which can be obtained with this approach.
3. Try to incorporate some of high-performance techniques, not sacrificing simplicity for end user.
4. Sacrifice a bit simplicity for end user for the sake of performance.
5. End up with 2 versions of primitives, first is for simplicity, and second is for performance.
6. Concentrate only on higher-level programming models (tasks for HPC, and agents for general-purpose computing), incorporating advanced synchronization techniques internally on low level.

We'll work hard to make the fusion of simplicity and performance as deep as possible. So, I like the point 3 and a bit 4.

But I'm on vacation for a couple of weeks to get married :-) So, I'll be able to continue this work and discussion when I'm back.

RafSchietekat · ‎09-12-2008

"Btw, what exactly do you mean by 'associated complexity'? What 'associated complexity' do you see as end user, not as developer?"

I was thinking about the difficulties associated with departing from sequential consistency, and in the present context whether it should be allowed to have N>1 different accessors to an object with key K in one map M all refer to private copies because "delete" doesn't want to wait anymore. Unfortunately, deadlock can be the cruel punishment for using one simple approach, you mentioned "performance destruction" as another adverse outcome (I have to admit that I don't know what this is about). Still, the primary goal seems to be to achieve correctness, which would be easier to deal with if one needs to actively release a security handle before "strange" things can start to occur. It's a design challenge...

michaelmarcin · ‎09-12-2008

MADamalakho:
But I'm on vacation for a couple of weeks to get married :-) So, I'll be able to continue this work and discussion when I'm back.

Congratulations. Best of luck.

Dmitry_Vyukov · ‎09-15-2008

Raf_Schietekat:

I was thinking about the difficulties associated with departing from sequential consistency, and in the present context whether it should be allowed to have N>1 different accessors to an object with key K in one map M all refer to private copies because "delete" doesn't want to wait anymore.

Ok. What problems do you see here? Can you provide example where this will result in counter-intuitive behavior for end user?

Raf_Schietekat:

you mentioned "performance destruction" as another adverse outcome (I have to admit that I don't know what this is about).

See benchmark results in first post. I call 100x performance degradation 'performance destruction'.

Raf_Schietekat:

Still, the primary goal seems to be to achieve correctness

Nobody saying the opposite. The problem here is - what is 'correctness'?
At least I can say with confidence that semantics provided by my hash map are usable too.

Dmitry_Vyukov · ‎09-15-2008

MADadrobiso:
That said, there is much room for improving the performance of TBB's concurrent_hash_map, though I never expect it to be as fast as a singled-threaded hash map. That is unless someone comes up with zero-overhead transactional memory.

The cost of find operation in my hash map is about 30 cycles, it's roughly equal to that of sequential hash map. But modifying operations are substantially heavier than in sequential hash map.

As for transactional memory. It's highly probably that hardware transactional memories will provide performance equal to single-threaded on short read-only transactions. And it seems that we will see HTM much sooner than one can expect. Sun is very close to HTM:
http://blogs.sun.com/dave/resource/transact08-dice.pdf
And AMD is working hard on HTM:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8d9708596c832a18

But there are some things to mention.
First of all, there is no guarantee that resize operation will succeed till next Ice Age. It can just meet constant contention from short modifying transactions (insert, update, remove).
Second, in order to achieve sequential consistency one have to combine find operation and user processing into one transaction. So read-only transactions won't be short and fast.
Oh, and forget to mention, it's disallowed to issue any 'external' actions inside transactions. I.e. one can't write to files, can't send/recv from socket etc.

I'm not saying that HTM will be useless. It can be very powerful. But in the hands of thoughtless user, it will be roughly the same as mutex-based programming. Just different set of caveats. There are no deadlocks, there are livelocks. One still can saturate inter-connect. One still can totally degrade performance by bloating read- and write-sets of transactions (remember that you have to include user processing info hash map's find operation).
So no silver bullet there!

milom · ‎09-15-2008

randomizer:
As for transactional memory. It's highly probably that hardware transactional memories will provide performance equal to single-threaded on short read-only transactions. And it seems that we will see HTM much sooner than one can expect. Sun is very close to HTM... And AMD is working hard on HTM... So no silver bullet there!

I agree that initial support for hardware-based transactional memory will be best for short transactions, but nobody is talking about making them read-only. For example, Sun's Rock supports a small number of writes per transaction using the processor's store buffer. It allows much larger read sets by using "transactionally read" bits in the L1 data cache tags. From what I've heard, AMD is planning to support a bounded TM that guarantees it will support operations on a small fixed number, say 16, memory operations (read or writes).

Transactional memory isn't a silver bullet for all of multicore programming. However, it is going to be immensely helpful in writing the sort of lock-free and low-lock data structures being discussed in this thread. TM has similar benefits (such as avoiding the lock acquire and providing fine-grained synchronization) without fewer tricky issues regarding the memory consistency, compiler transformations, or subtle races.

Yet, if I was designing a chip today, I would add support for "speculative locking" not transactional memory. The mechanisms look a lot like transactional memory, but it targets the speculative execution of critical sections. For example, a single-lock queue combined with hardware speculative locking would give the concurrency of a two-lock queue. The key advantages to such speculative locking is that (1) the system call always just fall back on acquiring the lock and (2) it can be applied to existing programs to make them faster or scale better.

Dmitry_Vyukov · ‎09-15-2008

milom:

I agree that initial support for hardware-based transactional memory will be best for short transactions, but nobody is talking about making them read-only.

I am talking about making them read-only. If nobody else talking about that... well... what can I do? I can only say that they will as non-scalable as current mutex-based or lock-free algorithms... well maybe just a bit simpler to implement.

milom:

For example, Sun's Rock supports a small number of writes per transaction using the processor's store buffer. It allows much larger read sets by using "transactionally read" bits in the L1 data cache tags. From what I've heard, AMD is planning to support a bounded TM that guarantees it will support operations on a small fixed number, say 16, memory operations (read or writes).

Definitely TM must support writes! I am not saying that TM must not support writes, I am saying that TM will be fast/scalable only for short read-only transactions.

milom:

Transactional memory isn't a silver bullet for all of multicore programming. However, it is going to be immensely helpful in writing the sort of lock-free and low-lock data structures being discussed in this thread. TM has similar benefits (such as avoiding the lock acquire and providing fine-grained synchronization) without fewer tricky issues regarding the memory consistency, compiler transformations, or subtle races. Yet, if I was designing a chip today, I would add support for "speculative locking" not transactional memory. The mechanisms look a lot like transactional memory, but it targets the speculative execution of critical sections. For example, a single-lock queue combined with hardware speculative locking would give the concurrency of a two-lock queue. The key advantages to such speculative locking is that (1) the system call always just fall back on acquiring the lock and (2) it can be applied to existing programs to make them faster or scale better.

Can you elaborate more on "speculative locking"? What API and semantics it will have?

milom · ‎09-15-2008

randomizer:
Can you elaborate more on "speculative locking"? What API and semantics it will have?

The semantics are basically the same as a lock. Yet, the implementation speculates past the lock, executes the critical speculatively, and then commits if no conflicting accesses from other processors occurred. To do this, it obtains read permission to the lock variable to act as a "guard", but in the common case the lock variable isn't written, so no coherence traffic is generated. Otherwise, it has much in common with the subsequent transactional memory proposals.

A paper that describes the idea:

http://pages.cs.wisc.edu/~rajwar/papers/tlr_ieeemicro03.pdf

RafSchietekat · ‎09-15-2008

Dmitriy wrote: "Ok. What problems do you see here? Can you provide example where this will result in counter-intuitive behavior for end user?" I don't have a specific example, really, you'll have to ask my friend Murphy to come up with one (I'm sure he will, but probably not right now either).

"See benchmark results in first post. I call 100x performance degradation 'performance destruction'." Of course... Still, if manipulating the data structure is incidental (and Amdahl rules out significant benefits/penalties) but correctness is crucial, it may not be the worst thing to locally destroy performance and globally preserve correctness. Dare I say the words "premature optimisation"? Of course, where needed, a faster version had better be available, too.

"Correctness" is about being confident that you can use this stuff and not lose a million dollars here and there if your programmers didn't all "have a PhD". Or about making all software multicore and only have to think really hard where it really matters for performance.

But I don't know how it will turn out in the real world, I was just pointing out that different people have different priorities, and not everybody is equally intimately familiar with the subject matter. Can we leave it at that? I would love to be mistaken, so that puts me in a lose-lose position here...

Dmitry_Vyukov · ‎09-16-2008

Raf_Schietekat:

Dmitriy wrote: "Ok. What problems do you see here? Can you provide example where this will result in counter-intuitive behavior for end user?" I don't have a specific example, really, you'll have to ask my friend Murphy to come up with one (I'm sure he will, but probably not right now either).

"See benchmark results in first post. I call 100x performance degradation 'performance destruction'." Of course... Still, if manipulating the data structure is incidental (and Amdahl rules out significant benefits/penalties) but correctness is crucial, it may not be the worst thing to locally destroy performance and globally preserve correctness. Dare I say the words "premature optimisation"? Of course, where needed, a faster version had better be available, too.

Well, for the time being I don't see how my map is not correct. For now I am considering it as just faster.
Yes, in perfect world threads access shared containers really rarely. But performance of shared containers can still affect real-world programs. Here I post some my thought on this:
http://softwarecommunity.intel.com/isn/Community/en-US/forums/permalink/30263737/30263498/ShowThread.aspx#30263498
And note that there are not only single shared hash map in the program, there are also memory allocation, message queueing, object lifetime management (reference counting), logging, global application settings, database caches, statistics and so on and on. And all those things are shared, so actually frequency of accesses to shared objects can be much higher than one think.

Dmitry_Vyukov · ‎09-16-2008

Raf_Schietekat:

"Correctness" is about being confident that you can use this stuff and not lose a million dollars here and there if your programmers didn't all "have a PhD". Or about making all software multicore and only have to think really hard where it really matters for performance.

But I don't know how it will turn out in the real world, I was just pointing out that different people have different priorities, and not everybody is equally intimately familiar with the subject matter. Can we leave it at that? I would love to be mistaken, so that puts me in a lose-lose position here...

It's not fair. You are implicitly implying that my hash map is not correct, and you take this as axiom. First show some evidence.

RafSchietekat · ‎09-16-2008

Sorry, I never meant to imply that the map itself was not correct, only that it might be more difficult not to make mistakes dealing with some less intuitive properties. That's why it may make sense to have a choice between a forgiving implementation that happens to be rather slow, and also one that requires more attention but with big rewards where it matters. The first hurdle is getting multithreaded applications to run correctly, the second hurdle is to make them as fast as they can be but without breaking them. Correct me if I'm mistaken, but I had the impression that the first hurdle was still a significant one, even if it is nice to believe that all "future" programmers will just absorb all this as a natural part of their activity and merely wonder what the fuss was all about.

Dmitry_Vyukov · ‎09-17-2008

Raf_Schietekat:
Sorry, I never meant to imply that the map itself was not correct, only that it might be more difficult not to make mistakes dealing with some less intuitive properties. That's why it may make sense to have a choice between a forgiving implementation that happens to be rather slow, and also one that requires more attention but with big rewards where it matters. The first hurdle is getting multithreaded applications to run correctly, the second hurdle is to make them as fast as they can be but without breaking them. Correct me if I'm mistaken, but I had the impression that the first hurdle was still a significant one, even if it is nice to believe that all "future" programmers will just absorb all this as a natural part of their activity and merely wonder what the fuss was all about.

I can't conclude whether it's significant one or not. I'm a bit biased here. This is why I'm asking for examples.
I can only say that currently I don't see where deferred deletion can cause problems.
My hash map provides 2 basic properties: (1) sequential self-consistency, (2) cause-and-effect relation. My personal opinion is that those properties are enough to write correct programs.

Consider example:

void some_thread()

 const_accessor a;

 if (hash_map.find(key, a))

 // work with item here

 // really doesn't concerned when exactly item will be removed from map and when deleted

What problems can be here, if item will be removed from collection, while this thread still precesses item?

ARCH_R_Intel · ‎09-17-2008

I do not see a problem with Dmitriy's semantics for removing a key. Note that in Java, clients of ConcurrentHashMapcan operate freely on a value even after it is removed from the map, so the practice has a major precedent.

Dmitry_Vyukov · ‎09-17-2008

MADadrobiso:
I do not see a problem with Dmitriy's semantics for removing a key. Note that in Java, clients of ConcurrentHashMapcan operate freely on a value even after it is removed from the map, so the practice has a major precedent.

Great point!
I believe that .NET's hash map provides the same semantics. And I think that most Java/.NET programmers don't even mediate on this moment, i.e. they can be not aware of semantics, they just write their programs, and programs just work out-of-the-box.

Also note that my hash map can be easily extended to support also current TBB's hash map semantics. Reader just have to lock bucket. It will significantly degrade performance, but if user need exactly such semantics, then he can get them. And if user is satisfied with just access to the item, then he can use read access how it's currently implemented in my hash map.

RafSchietekat · ‎09-17-2008

Two against one, that's so like a TBB scheduler (unfair), so I'll yield (it's not a perfect comparison...).

Well, not without dismissing Java as a precedent: it does not have synchronous destructor handling, so emulating C++ accessors would involve requiring close() to be called a lot, and it also doesn't store values, only pointers to garbage-collected objects.

(Added clarification and smiley.)

Dmitry_Vyukov · ‎09-18-2008

Raf_Schietekat:

it does not have synchronous destructor handling, so emulating C++ accessors would involve requiring close() to be called a lot

Java/C# don't need accessors on read side at all. GC will do all the work.

Raf_Schietekat:

, and it also doesn't store values, only pointers to garbage-collected objects.

It's easy to specialize C++ template class only for pointers. So for pointers there can be one implementation, and for "fat embedded" objects - another.

RafSchietekat · ‎09-18-2008

"Java/C# don't need accessors on read side at all. GC will do all the work." Exactly, but that's also why nobody would expect any synchronous locking semantics. If all the object provides is, e.g., a snapshot, then that's what it should be called, so that the user knows what he's dealing with; changing the semantics of an "accessor" is just not cool. It's like training somebody on Microsoft Windows, and then letting him use Unix without first telling him all about unlink(): this will almost never cause a problem... until it does.