Issues with concurrent_unordered_map

rhl_ · ‎11-15-2011

Hi,

I have some code which uses an unordered_map. The code is broken into two phases:

- Phase 1: Building the map
- Phase 2: Further computation a which uses the map in a read only way

I was previously using the boost::unordered_map, but it did not support concurrent insertion. I switched to concurrent_unordered_map provided by tbb 4.0 (open source edition) and I have been able to get parallel speedup in Phase 1
but Phase 2 has slowed down dramatically.

In phase 2 I have multiple threads reading concurrently from this object (this worked fine with the boost object), it seems with the new object this this is not so fine.

The way phase 2 works is that it maintains a vector of pointers into the map (which is divided into equal sized pieces, each executing in parallel), each thread iterates through its portion of the vector, and at every step it performs some computation.

The computation involved basically is of this form: it looks at the value of the given key in the map, doing this it then performs some find()'s back into the map, and then stores the results of these finds, and then it keeps going..

Important: This code never modifies the underlying structure.

Is there some sort of locking going on in a call to find (or maybe, begin(), end()) that is causing this slowdown?

Can someone (preferably if possible, the implementor(s) of this structure), give me some kind of idea of what is going on under the hood with this structure, where does synchronization need to occur, and where I might _look_ for what is causing this...

Otherwise I will have to revert to the non tbb container, and take the serial performance hit with this container.

rhl_ · ‎11-15-2011

Hi, after some more investigation I am fairly confident that the problem with my slowdown lies within the find() method of the concurrent_unordered_map.

rhl_ · ‎11-15-2011

With only a single thread, I built (in plain old serial), a concurrent_unordered_map and a boost::unordered_map with the same exact data, and then printed the bucket stats:

-- Bucket Data for Concurrent map --
Bucket count: 524,288
Load factor: 4

-- Bucket Data for Serial map --
Bucket count: 3,145,739
Load factor: 0.666664

just for sanity here are the typedef lines for each of the objects:
typedef tbb::concurrent_unordered_map< Cell, Data, Hash> Map;
typedef boost::unordered_map< Cell, Data, Hash> Map;

It looks like there is a dramatically different hash_map implementation between the two.

Anyone have a clue what is going on here?

rhl_ · ‎11-15-2011

Upon reading the source for the concurrent_unorderered_map, I have a number of questions:

for example, why is it that in internal_find() we have this:

// If bucket is empty, initialize it first
if (!is_initialized(bucket))
init_bucket(bucket);

also, if the bucket is empty, and we are going to initalize it, the key is clearly not in the list.
yet the init_bucket() code adds a dummy iterator to the bucket so that the next giant for() loop executes, doing a bunch of seemingly useless things since the bucket is supposedly empty.

then it returns end().

if we are going to init_bucket() why not:

// If bucket is empty, initialize it first
if (!is_initialized(bucket)){
init_bucket(bucket);
return end();
}

???

**hopes developers actually read this forum***

jimdempseyatthecove · ‎11-16-2011

>>if the bucket is empty, and we are going to initalize it, the key is clearly not in the list.
yet the init_bucket() code adds a dummy iterator to the bucket so that the next giant for() loop executes, doing a bunch of seemingly useless things since the bucket is supposedly empty.

On a concurrent system, the moment after init_bucket(), a different thread may insert an item before your thread gets its opportunity to complete an insert (into what you think is a "supposedly"empty bucket).

Jim Dempsey

rhl_ · ‎11-16-2011

Oh, that's true. me not thinking late at night.

Nevertheless, the load factor seems un-necessarily high, and the bucket count is off by a HUGE factor compared to the boost unordered_map.

The result on the load factor and bucket count appears to be the same no matter how many threads I am using.

Is this a limitation of the a concurrent container, or simply an inferior implementation?

(Or am I just doing it wrong? I hope this is the answer)..

Anton_M_Intel · ‎11-18-2011

Hi

The implementation of concurrent_unordered_map is based on split-ordered-list algorithm which ensures no locking while resizing&rehashing the table for growth. Of course, it involves a lot of in-memory atomic synchronizarion to provide thread-safety and thus is definetly slower than serial algorithm (on one thread).

Unfortunately, there is no serial counterpart method for find() similar to concurrent_hash_map::equal_range().

BTW, I would add serial interfaces for all the concurrent containers but our architect is against it. He doubts it has more advantages than potential problems.

As for the load factor, please use method max_load_factor(0.6) to set the value you want.

Hint 1: please pay attention to your hash function which can be not well suitable for (our concurrent) hash tables with size ofpower of two (please see the Reference). The hash function must provide good randomness in lower bits, e.g. via multiplication by a prime number.

Hint 2: reserve enough initial buckets before growing the table using rehash(1000) method. It can further improve your phase I.

rhl_ · ‎11-18-2011

Hi Anton,

thanks for the reply!

I infact construct the tablet with as many buckets as I have keys. Again this seems to not help. I can try calling rehash when doing this.

The slowness seems to be agnostic to the number of threads. However it does seem to be due to this atomic+ strange implementation. I actually set the load factor to .6 but it doesn't seem to help (If I do this, then the build map gets a load_factor of .5 at the end), is the load_factor in this implementation comparable to load factors for traditional hash tables?

My hash function is the one used by Aho in the dragon book (the compilers book), pg 436:

h = 0;
for( typename Container::const_iterator i = c.begin();
i != c.end();
++i) {
h = (h << 4) + *i;
if ((g = h & 0xf0000000)) {
h = h^(g >> 24);
h = h^g;
}
}
return h;

to be honest, I don't really understand this function, but it seems to be doing a good job for me.

I see your answer to my post on your blog, so what you seem to be suggesting is that using concurrent_hash_map() should solve my problem? does the concurrent_hash_map find() do synchronization? I see the reference suggests that concurrent findss are fine in this structure, as long as you don't call insert/erase. good :)

I will try implementing this in an hour or so and see if it improves things.

Anton_M_Intel · ‎11-21-2011

Yes, it is "traditional" load factor. The growth of the table is not (doubles the bucket count when max_load>=load_factor). The .5 value looks good here.

But, the hash function looks weak. The reason is that lower 4 bits are always mapped directly from the last value in a string (I suppose "typename Container" is a string). One simple solution is final multiplication by a prime.

Also, the highest 4 bits seem always zeroed. It is bad for big tables (I suppose 'h' is of size_t type? so, on 32-bit, the table can hold up to ~268M elements which is fairly realistic db size.. for smaller types it is a catastrophe).

Yes, concurrent_hash_map should fit your needs. Sure, the find() is a concurrent operation which is thread-safe and can be executed concurrently to insert()/erase(). So, your wording above is not precise.. What the Reference says is that find() invalidates (serial) iterators as well as other concurrent operations. But it still can be used together with iterators if rehash() was called after insert/erase.

rhl_ · ‎11-21-2011

It would be safer to assume Container is vector< unsigned int> (because that is what it is)

and yes h is of size_t.

the hash works very well for the boost_unordered_map

What I meant by synchronization is: if I call multiple finds at once, will the find() calls trample each other (i.e. some form of synchronization, like locks, atomics, etc), prevent each find() from doing it's job? (Note: two find() calls will never be made for the same key by two separate threads)

RafSchietekat · ‎11-22-2011

"What I meant by synchronization is: if I call multiple finds at once, will the find() calls trample each other (i.e. some form of synchronization, like locks, atomics, etc), prevent each find() from doing it's job? (Note: two find() calls will never be made for the same key by two separate threads)"
find() operations on different keys do not block each other, and that includes the resulting accessor instances' lifetimes.

Anton_M_Intel · ‎11-24-2011

>the hash works very well for the boost_unordered_map

Nevertheless, for our hash tables your hash function is definetly weak because they have size of power of 2 and so volnerable to poor randomness in lower bits of the hash values.

RafSchietekat · ‎11-24-2011

#8 "The hash function must provide good randomness in lower bits, e.g. via multiplication by a prime number."
Such multiplication cannot transfer randomness from higher to lower bits, only the other way around.

How about multiplying with internal::hash_multiplier inside the hashing containers (allowing for a simplification of at least one tbb_hasher overload), and then using mainly the most significant bits of the product? Just an idea, I don't know what impact it would have on performance.

Anton_M_Intel · ‎11-29-2011

Thanks, Raf! You are right.. we are looking what can be done for our tbb_hasher and the Reference