- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are benchmark results on Q6600.
Element count = 256, readonly workload, 8 threads.
TBB concurrent_hash_map:
Single core: 338 cycles/find
Cores 0&1: 218 cycles/find (scaling 1.55)
Cores 0&2: 540 cycles/find (scaling 0.63)
All cores: 353 cycles/find (scaling 0.96)
My hash map:
Single core: 30 cycles/find
Cores 0&1: 15 cycles/find (scaling 2)
Cores 0&2: 15 cycles/find (scaling 2)
All cores: 7 cycles/find (scaling 4)
Element count = 65536, readonly workload, 8 threads.
TBB concurrent_hash_map:
Single core: 383 cycles/find
Cores 0&1: 212 cycles/find (scaling 1.80)
Cores 0&2: 584 cycles/find (scaling 0.66)
All cores: 363 cycles/find (scaling 1.06)
My hash map:
Single core: 42 cycles/find
Cores 0&1: 21 cycles/find (scaling 2)
Cores 0&2: 21 cycles/find (scaling 2)
All cores: 10 cycles/find (scaling 4)
Element count = 256, mixed workload (4 threads make 100% finds; 4 threads make 50% finds, 25% inserts, 25% removes)
TBB concurrent_hash_map:
Single core: 348 cycles/find
Cores 0&1: 212 cycles/find (scaling 1.64)
Cores 0&2: 477 cycles/find (scaling 0.73)
All cores: 304 cycles/find (scaling 1.14)
My hash map:
Single core: 51 cycles/find
Cores 0&1: 27 cycles/find (scaling 1.89)
Cores 0&2: 31 cycles/find (scaling 1.65)
All cores: 15 cycles/find (scaling 3.4)
Element count = 65536, mixed workload (4 threads make 100% finds; 4 threads make 50% finds, 25% inserts, 25% removes)
TBB concurrent_hash_map:
Single core: 377 cycles/find
Cores 0&1: 206 cycles/find (scaling 1.83)
Cores 0&2: 516 cycles/find (scaling 0.73)
All cores: 322 cycles/find (scaling 1.17)
My hash map:
Single core: 74 cycles/find
Cores 0&1: 35 cycles/find (scaling 2.12)
Cores 0&2: 43 cycles/find (scaling 1.72)
All cores: 22 cycles/find (scaling 3.36)
On read-mostly workload this algorithm is up to 50x faster than TBB's hash map. On mixed workload it is up to 30x faster than TBB's hash map. And scaling is substantially better, so on 8-core system speedup can be 100x and higher.
Memory consumption is 25% lower as compared with TBB hash map.
Few notes on algorithm.
First of all it uses amortized proxy-collector memory reclamation (implementation included into archive). It allows read-only transactions to be purely read-only, i.e. no modifications of shared state.
Read-only access (find operation) uses a kind of sequence mutex (so called SeqLock) to ensure consistency. But I made some modifications to SeqLock, so that read access is obstruction-free, as opposed to classical SeqLock which is actually blocking if reader conflicts with writer.
Mutators (insert/remove) use fine-grained locking wrt other mutators, and atomic modifications wrt readers.
Layout of data is cache-conscious so that most oprations rarely touch (either write or read) more that one cache-line.
Link Copied
- « Previous
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:
"Java/C# don't need accessors on read side at all. GC will do all the work." Exactly, but that's also why nobody would expect any synchronous locking semantics.
I don't see connection between those things.
Raf_Schietekat:
If all the object provides is, e.g., a snapshot, then that's what it should be called, so that the user knows what he's dealing with; changing the semantics of an "accessor" is just not cool.
This is the question I was asking before in this topic: What are *official* semantics of TBB's accessors now? Do they have to provide more than just access to the object? Well, at least their name suggests that they must provide access to object, just like smart pointer or Java reference.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
An "accessor" has a synchronous destructor, which is often used in C++. In Java losing reference reachability is a non-event, so any locking scopes have to be explicit, they cannot be assumed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf_Schietekat:"I don't see connection between those things."
An "accessor" has a synchronous destructor, which is often used in C++. In Java losing reference reachability is a non-event, so any locking scopes have to be explicit, they cannot be assumed.
Ok. I see your point.
But if there is some object on the stack it does not presently follow item in hash map is locked. It only means that there are possibly some 'scoped semantics' for something, and maybe not.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also I've added support for satellite data. The archive contains example with strings used as keys and values. The lifetime of satellite data is managed with PDR (Partial-cope-on-write Deferred Reclamation) system, so than find operation remains purely read-only (no shared memory is modified).
I will resubmit it as contribution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The proposal is to separate out the accessors from the container and make them flexible enough to enable several ways of data accessing and lifetime management:
- "classical" RW protected access
- direct access for atomic or const values.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for expressing the interest. We are working on the new interface specifications of concurrent_unordered_map. But it turned out to be more difficult and long work than I expected before. So, it won't come out soon unfortunately.
Sure, the principle "pay as you go" should be kept.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure, the principle "pay as you go" should be kept.
Both variants require payments. Deferred garbage collection increases memory footprint and introduces latency. Prompt reclamation inevitably degrades performance and scalability.
So one will pay whether he goes or marks time :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Both variants require payments. Deferred garbage collection increases memory footprint and introduces latency. Prompt reclamation inevitably degrades performance and scalability.
So one will pay whether he goes or marks time :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Btw, there are deferred reclamation schemes (I think it's a better name than GC in this context) which do provide guarantees regarding memory consumption (bounded memory consumption). Safe Memory Reclamation (SMR) is the best example, it guarantees that there will be no more than N*NUMBER_OF_THREADS (N is very small, probably 1) deferred objects. Proxy-collector based on deferrential reference counting (like the one I used in the hash map) can be modified to allow no more than N deferred objects. Such solution will provide very good compromise between scalability and memory pressure.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Btw, there are deferred reclamation schemes (I think it's a better name than GC in this context) which do provide guarantees regarding memory consumption (bounded memory consumption). Safe Memory Reclamation (SMR) is the best example, it guarantees that there will be no more than N*NUMBER_OF_THREADS (N is very small, probably 1) deferred objects. Proxy-collector based on deferrential reference counting (like the one I used in the hash map) can be modified to allow no more than N deferred objects. Such solution will provide very good compromise between scalability and memory pressure.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In TBB roadmap:
http://softwareblogs.intel.com/2007/10/12/threading-building-blocks-product-roadmap/
there is a point:
"Container class additions and improvements
The TBB development team is working on new container classes, and interface improvements and enhanced performance for existing classes."
There are synchronization algorithms which are up to 100x faster on modern "moderate-multi-core" than traditional lock-based ones, and they have near linear scaling, so that they are more appropriate for "many-core future". In my opinion it's the way to bridle many-core. What do you think?
how do you measure so precisecpu cycle?on my linux/XEON the resolution is 4000....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
how do you measure so precisecpu cycle?on my linux/XEON the resolution is 4000....
Resolution of what?
I used RDTSC. Measure time of 10^6 operations, then divide by 10^6.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Resolution of what?
I used RDTSC. Measure time of 10^6 operations, then divide by 10^6.
could you give some advice about "Memory Pool" implementation?I saw some implementationis based on freelist & lock protection.
but the lock,such as spin_mutex and the other "CAS" implementations are not regarded as scable on multi-core platform(right?), is it necessary to design a new algorithm as your concurrent_hasmap?
mitigate the lock used,load-balancethe freelist write request to multi-sub-hashmap? that means we can not only rely the lock itself
I am strudying your code & tbb concurrent datastructure and try to find some help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
could you give some advice about "Memory Pool" implementation?I saw some implementationis based on freelist & lock protection.
but the lock,such as spin_mutex and the other "CAS" implementations are not regarded as scable on multi-core platform(right?), is it necessary to design a new algorithm as your concurrent_hasmap?
mitigate the lock used,load-balancethe freelist write request to multi-sub-hashmap? that means we can not only rely the lock itself
I am strudying your code & tbb concurrent datastructure and try to find some help.
I tend to prefer headerless slab allocators with per-thread caches along the lines of the one you can find in TBB (scalable_alloc). The basic algorithm is as follows. Each thread has a bunch of caches for various block sizes (usually powers of 2). Allocations requests are done from that caches and very cheap:
[cpp]void* alloc(size_t sz) { cache* c = thread_caches[get_nearest_cache_idx(sz)]; return c->pop(); } [/cpp]
During deallocation thread checks as to whether block came from own cache or remote and act accordingly:
[cpp]void free(void* p) { cache* c = get_cache_from_block(p); if (is_own(c)) { c->push(p); } else { c->remote_push(p); } }[/cpp]
Here is sketch of cache implementation:
[cpp]struct cache { void* local_head; void* remote_head; void* pop() { retry: if (local_head) { void* p = local_head; local_head = *(void**)p; return p; } else if (remote_head) { local_head = ATOMIC_XCHG(remote_head, 0); goto retry; } else { local_head = alloc_more_blocks_from_system(); goto retry; } } void push(void* p) { *(void**)p = local_head; local_head = p; } void remote_push(void* p) { void* cmp = remote_head; do { *(void**)p = cmp; } while (ATOMIC_CAS(&remote_head, &cmp, p)); } };[/cpp]
As you may see the algorithm is naturally distributed, and prefers locality (if a thread allocs and frees a block then the operations are atomic-free and do not touch any remote memory).
There are a lot of improvements on the basic algorithm: batching of remote frees; privatization of freed blocks ->this makes both alloc and free completely atomic-free; double-buffering of remote blocks, etc
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Does anyone know the performance status of Intel TBB 2.2 Update 3's concurrent_hash_map? I've noticed that in the What's New for Intel TBB 2.2 (released after Dmitriy's original post) it mentions "Improved performance of the concurrent hash map".
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is this update Production Ready?
Does it resolve the deadlock issues raised by Arch Robinson?
Does it compile and run as 64 bit (MSVC/x64)?
BTW, The zip files you have posted in google groups for relacy and concurrent data structures are broken. I believe google has disabled zip file downloads. Can you make these files available elsewhere? specifically the concurrent data structures.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No
>Does it resolve the deadlock issues raised by Arch Robinson?
This discussion is almost 2 years old and contains 8 pages, what exactly issue do you mean?
> Does it compile and run as 64 bit (MSVC/x64)?
Don't remember. Most likely No.
> BTW, The zip files you have posted in google groups for relacy and concurrent data structures are broken. I believe google has disabled zip file downloads. Can you make these files available elsewhere? specifically the concurrent data structures.
Yeah, I know. Do you know some appropriate and reliable place for them?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have some concerns about deadlock, because methods like find_and_lock return while holding a lock on a granularity larger than an individual record. For example, if a thread tries to acquire locks on two separate keys (say in lexigraphic order), it risks deadlock if the keys are in the same cell. Also, method grow appears to lock all the cells before proceeding. So if a thread has a lock on an item and tries to insert another key that causes grow() to be called, does deadlock ensue?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The deadlock scenario outlined is impossible provided correct usage of the hash map, which is to not lock more than once item. The idea was just to split the update method into two parts to go without lambdas, but logically it's a single operation which must not be "preempted" by other operations from the same thread.
And there are way to deal with potential deadlocks as outlined in post #5.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
- Next »