Re: Scalable hash map algorithm - Page 4

Dmitry_Vyukov · ‎09-02-2008

I've attached implementation of my new concurrent hash map algorithm. Implementation is NOT PRODUCTION READY, it's just an illustration of the algorithm. Implementation can be compiled only with MSVC/Win32, but algorithm is fairly portable, only single-word atomic RMW operations are required, and very little of OS specific stuff.

Here are benchmark results on Q6600.
Element count = 256, readonly workload, 8 threads.
TBB concurrent_hash_map:
Single core: 338 cycles/find
Cores 0&1: 218 cycles/find (scaling 1.55)
Cores 0&2: 540 cycles/find (scaling 0.63)
All cores: 353 cycles/find (scaling 0.96)

My hash map:
Single core: 30 cycles/find
Cores 0&1: 15 cycles/find (scaling 2)
Cores 0&2: 15 cycles/find (scaling 2)
All cores: 7 cycles/find (scaling 4)

Element count = 65536, readonly workload, 8 threads.
TBB concurrent_hash_map:
Single core: 383 cycles/find
Cores 0&1: 212 cycles/find (scaling 1.80)
Cores 0&2: 584 cycles/find (scaling 0.66)
All cores: 363 cycles/find (scaling 1.06)

My hash map:
Single core: 42 cycles/find
Cores 0&1: 21 cycles/find (scaling 2)
Cores 0&2: 21 cycles/find (scaling 2)
All cores: 10 cycles/find (scaling 4)

Element count = 256, mixed workload (4 threads make 100% finds; 4 threads make 50% finds, 25% inserts, 25% removes)
TBB concurrent_hash_map:
Single core: 348 cycles/find
Cores 0&1: 212 cycles/find (scaling 1.64)
Cores 0&2: 477 cycles/find (scaling 0.73)
All cores: 304 cycles/find (scaling 1.14)

My hash map:
Single core: 51 cycles/find
Cores 0&1: 27 cycles/find (scaling 1.89)
Cores 0&2: 31 cycles/find (scaling 1.65)
All cores: 15 cycles/find (scaling 3.4)

Element count = 65536, mixed workload (4 threads make 100% finds; 4 threads make 50% finds, 25% inserts, 25% removes)
TBB concurrent_hash_map:
Single core: 377 cycles/find
Cores 0&1: 206 cycles/find (scaling 1.83)
Cores 0&2: 516 cycles/find (scaling 0.73)
All cores: 322 cycles/find (scaling 1.17)

My hash map:
Single core: 74 cycles/find
Cores 0&1: 35 cycles/find (scaling 2.12)
Cores 0&2: 43 cycles/find (scaling 1.72)
All cores: 22 cycles/find (scaling 3.36)

On read-mostly workload this algorithm is up to 50x faster than TBB's hash map. On mixed workload it is up to 30x faster than TBB's hash map. And scaling is substantially better, so on 8-core system speedup can be 100x and higher.

Memory consumption is 25% lower as compared with TBB hash map.

Few notes on algorithm.
First of all it uses amortized proxy-collector memory reclamation (implementation included into archive). It allows read-only transactions to be purely read-only, i.e. no modifications of shared state.
Read-only access (find operation) uses a kind of sequence mutex (so called SeqLock) to ensure consistency. But I made some modifications to SeqLock, so that read access is obstruction-free, as opposed to classical SeqLock which is actually blocking if reader conflicts with writer.
Mutators (insert/remove) use fine-grained locking wrt other mutators, and atomic modifications wrt readers.
Layout of data is cache-conscious so that most oprations rarely touch (either write or read) more that one cache-line.

Dmitry_Vyukov · ‎09-19-2008

Raf_Schietekat:

"Java/C# don't need accessors on read side at all. GC will do all the work." Exactly, but that's also why nobody would expect any synchronous locking semantics.

I don't see connection between those things.

Raf_Schietekat:

If all the object provides is, e.g., a snapshot, then that's what it should be called, so that the user knows what he's dealing with; changing the semantics of an "accessor" is just not cool.

This is the question I was asking before in this topic: What are *official* semantics of TBB's accessors now? Do they have to provide more than just access to the object? Well, at least their name suggests that they must provide access to object, just like smart pointer or Java reference.

RafSchietekat · ‎09-19-2008

"I don't see connection between those things."

An "accessor" has a synchronous destructor, which is often used in C++. In Java losing reference reachability is a non-event, so any locking scopes have to be explicit, they cannot be assumed.

Dmitry_Vyukov · ‎09-19-2008

Raf_Schietekat:
"I don't see connection between those things."

An "accessor" has a synchronous destructor, which is often used in C++. In Java losing reference reachability is a non-event, so any locking scopes have to be explicit, they cannot be assumed.

Ok. I see your point.
But if there is some object on the stack it does not presently follow item in hash map is locked. It only means that there are possibly some 'scoped semantics' for something, and maybe not.

Dmitry_Vyukov · ‎03-09-2009

I've attached updated implementation of the hash map. A bunch of errors is fixed, not it at least looks like working implementation (heavy test works for several minutes w/o crashes, hangs, asserts, leaks, etc on dual core and quad core).
Also I've added support for satellite data. The archive contains example with strings used as keys and values. The lifetime of satellite data is managed with PDR (Partial-cope-on-write Deferred Reclamation) system, so than find operation remains purely read-only (no shared memory is modified).
I will resubmit it as contribution.

robert_jay_gould · ‎03-09-2009

Quoting - Anton Malakhov (Intel)

The proposal is to separate out the accessors from the container and make them flexible enough to enable several ways of data accessing and lifetime management:

"classical" RW protected access

direct access for atomic or const values.

Direct access for atomics and conts would be great. I spent the whole day yesterday trying to figure out how to do this with the current tools to no avail

As for garbage collection, it seems to much features to me, since C++ isn't a GC language anyways. If GC was handled through policies, so I'm not forced to pay for it unless I need it, I guess it'd be nice for newcomers.

Anton_M_Intel · ‎03-10-2009

Quoting - robert.jay.gould

Direct access for atomics and conts would be great. I spent the whole day yesterday trying to figure out how to do this with the current tools to no avail

Thanks for expressing the interest. We are working on the new interface specifications of concurrent_unordered_map. But it turned out to be more difficult and long work than I expected before. So, it won't come out soon unfortunately.

Quoting - robert.jay.gould

As for garbage collection, it seems to much features to me, since C++ isn't a GC language anyways. If GC was handled through policies, so I'm not forced to pay for it unless I need it, I guess it'd be nice for newcomers.

Sure, the principle "pay as you go" should be kept.

Dmitry_Vyukov · ‎03-10-2009

Quoting - Anton Malakhov (Intel)

Quoting - robert.jay.gould

As for garbage collection, it seems to much features to me, since C++ isn't a GC language anyways. If GC was handled through policies, so I'm not forced to pay for it unless I need it, I guess it'd be nice for newcomers.

Sure, the principle "pay as you go" should be kept.

Both variants require payments. Deferred garbage collection increases memory footprint and introduces latency. Prompt reclamation inevitably degrades performance and scalability.
So one will pay whether he goes or marks time :)

robert_jay_gould · ‎03-10-2009

Quoting - Dmitriy Vyukov

Both variants require payments. Deferred garbage collection increases memory footprint and introduces latency. Prompt reclamation inevitably degrades performance and scalability.
So one will pay whether he goes or marks time :)

True but I'd rather pay cash than credit.

Dmitry_Vyukov · ‎03-11-2009

Quoting - robert.jay.gould

True but I'd rather pay cash than credit.

Btw, there are deferred reclamation schemes (I think it's a better name than GC in this context) which do provide guarantees regarding memory consumption (bounded memory consumption). Safe Memory Reclamation (SMR) is the best example, it guarantees that there will be no more than N*NUMBER_OF_THREADS (N is very small, probably 1) deferred objects. Proxy-collector based on deferrential reference counting (like the one I used in the hash map) can be modified to allow no more than N deferred objects. Such solution will provide very good compromise between scalability and memory pressure.

robert_jay_gould · ‎03-11-2009

Quoting - Dmitriy Vyukov

Btw, there are deferred reclamation schemes (I think it's a better name than GC in this context) which do provide guarantees regarding memory consumption (bounded memory consumption). Safe Memory Reclamation (SMR) is the best example, it guarantees that there will be no more than N*NUMBER_OF_THREADS (N is very small, probably 1) deferred objects. Proxy-collector based on deferrential reference counting (like the one I used in the hash map) can be modified to allow no more than N deferred objects. Such solution will provide very good compromise between scalability and memory pressure.

I guess what I mean is I like concurrent_vector's compact() idea. I can use the vector in a tight-loop, let it grow as needed and then compact it once I'm finished, so I have control of the exact stage reclamation can occur.

If a container would do some automatic reclamation I'd hope it had a function that allowed me to turn reclamation on/off. And say in an automatic reclamation case, I'd be happy even if its "on" by default, doing its job as it esteems appropriate, as long as it give me the option to take control when I need to.

softarts · ‎12-22-2009

Quoting - Dmitriy Vyukov

I would like to hear opinion of Arch Robison on this.
In TBB roadmap:
http://softwareblogs.intel.com/2007/10/12/threading-building-blocks-product-roadmap/
there is a point:
"Container class additions and improvements
The TBB development team is working on new container classes, and interface improvements and enhanced performance for existing classes."

There are synchronization algorithms which are up to 100x faster on modern "moderate-multi-core" than traditional lock-based ones, and they have near linear scaling, so that they are more appropriate for "many-core future". In my opinion it's the way to bridle many-core. What do you think?

how do you measure so precisecpu cycle?on my linux/XEON the resolution is 4000....

Dmitry_Vyukov · ‎12-22-2009

Quoting - softarts

how do you measure so precisecpu cycle?on my linux/XEON the resolution is 4000....

Resolution of what?

I used RDTSC. Measure time of 10^6 operations, then divide by 10^6.

softarts · ‎12-22-2009

Quoting - Dmitriy Vyukov

Resolution of what?

I used RDTSC. Measure time of 10^6 operations, then divide by 10^6.

could you give some advice about "Memory Pool" implementation?I saw some implementationis based on freelist & lock protection.

but the lock,such as spin_mutex and the other "CAS" implementations are not regarded as scable on multi-core platform(right?), is it necessary to design a new algorithm as your concurrent_hasmap?
mitigate the lock used,load-balancethe freelist write request to multi-sub-hashmap? that means we can not only rely the lock itself

I am strudying your code & tbb concurrent datastructure and try to find some help.

Dmitry_Vyukov · ‎12-23-2009

Quoting - softarts

could you give some advice about "Memory Pool" implementation?I saw some implementationis based on freelist & lock protection.

but the lock,such as spin_mutex and the other "CAS" implementations are not regarded as scable on multi-core platform(right?), is it necessary to design a new algorithm as your concurrent_hasmap?
mitigate the lock used,load-balancethe freelist write request to multi-sub-hashmap? that means we can not only rely the lock itself

I am strudying your code & tbb concurrent datastructure and try to find some help.

I tend to prefer headerless slab allocators with per-thread caches along the lines of the one you can find in TBB (scalable_alloc). The basic algorithm is as follows. Each thread has a bunch of caches for various block sizes (usually powers of 2). Allocations requests are done from that caches and very cheap:

[cpp]void* alloc(size_t sz)
{
    cache* c = thread_caches[get_nearest_cache_idx(sz)];
    return c->pop();
}
[/cpp]

During deallocation thread checks as to whether block came from own cache or remote and act accordingly:

[cpp]void free(void* p)
{
    cache* c = get_cache_from_block(p);
    if (is_own(c))
    {
        c->push(p);
    }
    else
    {
        c->remote_push(p);
    }
}[/cpp]

Here is sketch of cache implementation:

[cpp]struct cache
{
    void* local_head;
    void* remote_head;

    void* pop()
    {
retry:
        if (local_head)
        {
            void* p = local_head;
            local_head = *(void**)p;
            return p;
        }
        else if (remote_head)
        {
            local_head = ATOMIC_XCHG(remote_head, 0);
            goto retry;
        }
        else
        {
            local_head = alloc_more_blocks_from_system();
            goto retry;
        }
    }

    void push(void* p)
    {
        *(void**)p = local_head;
        local_head = p;
    }

    void remote_push(void* p)
    {
        void* cmp = remote_head;
        do
        {
            *(void**)p = cmp;
        }
        while (ATOMIC_CAS(&remote_head, &cmp, p));
    }
};[/cpp]

As you may see the algorithm is naturally distributed, and prefers locality (if a thread allocs and frees a block then the operations are atomic-free and do not touch any remote memory).

There are a lot of improvements on the basic algorithm: batching of remote frees; privatization of freed blocks ->this makes both alloc and free completely atomic-free; double-buffering of remote blocks, etc

e4lam · ‎02-26-2010

Hi,

Does anyone know the performance status of Intel TBB 2.2 Update 3's concurrent_hash_map? I've noticed that in the What's New for Intel TBB 2.2 (released after Dmitriy's original post) it mentions "Improved performance of the concurrent hash map".

Thanks!

Anton_M_Intel · ‎02-27-2010

It is implemented using different algorithm than TBB <=2.1 which eliminates one locking operation on the hot path. But it is not the algorithm Dmitry suggested. Dmitry's algorithm doesn't fit into interface of concurrent_hash_map.

ephisino · ‎07-12-2010

Awesome!
Is this update Production Ready?
Does it resolve the deadlock issues raised by Arch Robinson?
Does it compile and run as 64 bit (MSVC/x64)?

BTW, The zip files you have posted in google groups for relacy and concurrent data structures are broken. I believe google has disabled zip file downloads. Can you make these files available elsewhere? specifically the concurrent data structures.

Dmitry_Vyukov · ‎07-13-2010

>Is this update Production Ready?

No

>Does it resolve the deadlock issues raised by Arch Robinson?

This discussion is almost 2 years old and contains 8 pages, what exactly issue do you mean?

> Does it compile and run as 64 bit (MSVC/x64)?

Don't remember. Most likely No.

> BTW, The zip files you have posted in google groups for relacy and concurrent data structures are broken. I believe google has disabled zip file downloads. Can you make these files available elsewhere? specifically the concurrent data structures.

Yeah, I know. Do you know some appropriate and reliable place for them?

ephisino · ‎07-13-2010

Thanks.

Here is the deadlock concern I was talking about.

I have some concerns about deadlock, because methods like find_and_lock return while holding a lock on a granularity larger than an individual record. For example, if a thread tries to acquire locks on two separate keys (say in lexigraphic order), it risks deadlock if the keys are in the same cell. Also, method grow appears to lock all the cells before proceeding. So if a thread has a lock on an item and tries to insert another key that causes grow() to be called, does deadlock ensue?

Dmitry_Vyukov · ‎07-13-2010

Ah, that. Nothing has changed since that time.
The deadlock scenario outlined is impossible provided correct usage of the hash map, which is to not lock more than once item. The idea was just to split the update method into two parts to go without lambdas, but logically it's a single operation which must not be "preempted" by other operations from the same thread.
And there are way to deal with potential deadlocks as outlined in post #5.