Implementation uses a bunch of advanced synchronization techniques, like timestamp validation, deferred memory reclamation and cache conscious data layout. As a result this implementation beats TBB's concurrent_hash_map by a factor of 50 on read-mostly workload on Intel Core 2 Quad (Q6600), and by a factor of 20 on modest write workload. The cost of read transaction (find operation) is about 30 cycles, which is basically equal to that of single-threaded hash map. Algorithm also have perfect linear scalability on read-mostly workload so on greater number of cores/processors it will have even higher performance difference with traditional lock-based synchronization (on 8 cores I expect >100x).
Such synchronization techniques, I believe, is the way to deal with future many-core systems.