Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

Performance of concurrent_hash_map growth


Before using concurrent_hash_map in parallel code I would like to initialize it with a huge (~million and more) number of keys. I am using concurrent_hash_map::insert() in a loop and observe a dramatic performance degradation with a map growth that is not acceptable for me (please see the log below). STL map grows much faster. Possible solution for me could be some API in concurrent_hash_map allowing a quick serial initialization with no overhead for locks etc., for example unsafe_insert(). If a reason of perofrmance degradation is in the memory allocations, is it possible to preallocate the neccessary size?

-I-: Map size=10000 Time=0.09885 sec Time delta=0.098992 sec
-I-: Map size=20000 Time=0.400218 sec Time delta=0.301233 sec
-I-: Map size=30000 Time=0.938947 sec Time delta=0.538718 sec
-I-: Map size=40000 Time=1.52345 sec Time delta=0.584489 sec
-I-: Map size=50000 Time=2.65433 sec Time delta=1.13086 sec
-I-: Map size=60000 Time=4.69511 sec Time delta=2.04078 sec
-I-: Map size=70000 Time=8.09903 sec Time delta=3.4039 sec
-I-: Map size=80000 Time=13.4937 sec Time delta=5.39462 sec
-I-: Map size=90000 Time=20.6317 sec Time delta=7.13807 sec
-I-: Map size=100000 Time=29.8772 sec Time delta=9.24545 sec
-I-: Map size=110000 Time=41.3442 sec Time delta=11.467 sec
-I-: Map size=120000 Time=54.5208 sec Time delta=13.1766 sec
-I-: Map size=130000 Time=69.704 sec Time delta=15.1832 sec
-I-: Map size=140000 Time=86.9157 sec Time delta=17.2117 sec
-I-: Map size=150000 Time=105.944 sec Time delta=19.0281 sec
-I-: Map size=160000 Time=126.076 sec Time delta=20.1317 sec
-I-: Map size=170000 Time=148.236 sec Time delta=22.1606 sec
-I-: Map size=180000 Time=172.054 sec Time delta=23.8178 sec
-I-: Map size=190000 Time=197.327 sec Time delta=25.2729 sec
-I-: Map size=200000 Time=223.773 sec Time delta=26.4461 sec
-I-: Map size=210000 Time=242.954 sec Time delta=19.1812 sec
-I-: Map size=220000 Time=260.733 sec Time delta=17.7786 sec
-I-: Map size=230000 Time=278.89 sec Time delta=18.157 sec
-I-: Map size=240000 Time=297.838 sec Time delta=18.9478 sec
-I-: Map size=250000 Time=317.326 sec Time delta=19.4883 sec
-I-: Map size=260000 Time=337.068 sec Time delta=19.7423 sec
-I-: Map size=270000 Time=353.319 sec Time delta=16.2504 sec
-I-: Map size=280000 Time=374.058 sec Time delta=20.7392 sec
-I-: Map size=290000 Time=395.48 sec Time delta=21.4224 sec
-I-: Map size=300000 Time=419.424 sec Time delta=23.9436 sec
-I-: Map size=310000 Time=453.983 sec Time delta=34.5595 sec
-I-: Map size=320000 Time=486.734 sec Time delta=32.751 sec

I use concurrent_hash_map with the keys being 24-byte objects, values being integersand hash function returning a unique integer object id. Map size is the number of keys in map.

Fastune shows the most time (>70%) is spent in concurrent_hash_map::lookup() function.

0 Kudos
2 Replies
Black Belt
Serial performance may have been overlooked so far (although you don't specify exactly how much slower it is than STL where you add a hash code computation for each insertion because std::map uses order instead of hashing, nor what TBB version you are using), but meanwhile what are your results with parallelised initialisation, if possible (unless you're running more than (currently) 64 cores and assuming a good hashing function with maximum randomness in the lowest-value bits you should get very good scalability)? If your hash function is rather expensive, with 24 bytes in the key already you might want to consider cacheing the hash code (inside the key, so it holds the hash code plus the normal key), and then you could tell me whether this is just an idle obsession of mine or not.

(Correction corrected) I reverted an earlier correction because hash_map is indeed not yet part of the C++ Standard Library (C++0x will have an unordered_map), but some kind of STL hash_map could of course be used as a comparison point (although there are allegedly subtle differences). TBB uses direct chaining, array allocation takes O(n) amortised time (like std::vector) and seems unlikely to be the cause of the problem, pre-allocation would have to assume near-perfect hashing module 64. You also might want to check out the changes I contributed in "concurrent_hash_map::erase", because its fine-grained locking to make concurrent array growth possible will probably have even worse serial performance. :-)

Thank you for the reply. My hash function is cached and not time-expensive. But I was mistaken regarding my hash function: it occured to be not unique for every key. And that was the reason of long lookups in the table.