Re:Concurrent_unordered_set performance issue with clear()

FlorentD · ‎05-28-2022

Hi,

I use in my IA project a concurrent hashtable with almost 20 000 000 items. There is no problem with insert or find operations but when I want to destroy all elements, it takes a lot of time...

This is my code :

typedef tbb::concurrent_unordered_set <Item*, Item::Hash, Item::Equality> ConcurrentHashTable;

ConcurrentHashTable m_explored;

// Fill the hash table with 20 000 000 items
...

// Release memory before destroying the hash table
for (auto& item : m_explored) { delete item; }

// Destroy the hashtable
m_explored.clear(); // performance issue here..
m_explored = ConcurrentHashTable(); // same performance issue here..

It takes about 10 seconds to clear the entire hashtable...

With the std::unordered_set, it takes only 2s.

How to fix it ? (I use Windows 10)

Note that I use the last version of oneTBB : 2021.5

Thanks

NoorjahanSk_Intel · ‎06-01-2022

Hi,

Thanks for reaching out to us.

you are observing performance decrease with tbb::concurrent-unordered_set might be because tbb::concurrent_unordered_set does not support concurrent erasure.

You can try using tbb::concurrent_hash_map as it supports concurrent insertion, lookup, and erasure.

Please refer to the below link for more details:

https://spec.oneapi.io/versions/latest/elements/oneTBB/source/containers/concurrent_hash_map_cls.html

Thanks & Regards,

Noorjahan.

Michael_V_Intel · ‎06-01-2022

I don’t think it is related to concurrent erasure, since there is no concurrent erasure in the example.

I think it is due to the inherent differences between serial and concurrent containers. There is sometimes a performance penalty for using a concurrent container over using a sequential container. The concurrent containers are designed to scale as multiple thread are accessing them concurrently. To safely support concurrent access, the internal structures are also different, including the need for additional memory, and this can lead to a penalty for even seemingly simple, non-concurrent operations such as clear. Even so, a 5x slowdown is unexpectedly large! In previous cases, such as this one, we have found that using the tbb::scalable_allocator class with the container can reduce some of these overheads. You can find a deeper analysis of the possible causes in that other case. When I ran your test case on my system, I saw a slowdown but not of the same magnitude as yours. Perhaps you can check if you are using the scalable_allocator and if not, see if that helps.

FlorentD · ‎06-01-2022

Hi Michael,

Indeed it is not a problem with erasure.

I am not sure to be able to use scalable_allocator because according to the documentation :

The scalable_allocator requires the memory allocator library. If the library is missing, calls to the scalable allocator fail. In contrast to scalable_allocator, if the memory allocator library is not available, tbb_allocator falls back on std::malloc and std::free.

I am not sure to have the "memory allocator library". Do I need another DLL ?

And what is exactly the scalable allocator, I don't understand the purpose of it when reading the official documentation :

https://spec.oneapi.io/versions/1.1-rev-1/elements/oneTBB/source/memory_allocation/scalable_allocator_cls.html

Note that your two links points on the same resource.

Thanks,

NoorjahanSk_Intel · ‎06-09-2022

Hi,

>>I am not sure to have the "memory allocator library". Do I need another DLL?

The scalable_allocator allocator template requires that the TBBmalloc library be available. This class is defined with #include <tbb/scalable_allocator.h>.

Please refer to the below link for more details:

https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Scalable_Memory_Allocator.html

>>what is exactly the scalable allocator, I don't understand the purpose of it

Please refer to the below link to understand it better on scalable allocator and get back to us if you face any issues.

https://link.springer.com/chapter/10.1007/978-1-4842-4398-5_7

Thanks & Regards,

Noorjahan.

NoorjahanSk_Intel · ‎06-19-2022

Hi,

We haven't heard back from you. Could you please provide an update on your issue?

Thanks & Regards,

Noorjahan.

FlorentD · ‎06-19-2022

Hi,

Same issue with scalable allocator. Note that I also have a 3rd DLL libtbbmalloc_proxy. What is it for ?

Finally I don't use anymore the tbb unordered_set due to its performance with millions of entries. Maybe something could be improved in its implementation ?

Thanks,

Mark_L_Intel · ‎06-21-2022

Hello,

tbbmalloc proxy redirects ALL application calls to scallable_malloc. You can also write your own new/delete operators to define what should go to scalable allocator.

This is a good article describing TBB proxy approach:

https://www.infoworld.com/article/3201285/why-effective-parallel-programming-must-include-scalable-memory-allocation.html

Mark_L_Intel · ‎06-21-2022

One more thing, could you provide a complete reproducer (source) with the scalable allocator and unordered set so I can file an internal ticket. If you could also consolidate all information regarding TBB version, OS version, build command you used, run command you used, etc. that would be greatly appreciated.

Mark_L_Intel · ‎06-28-2022

Hello,

In case of no response in 5 days since now, the ticket won't be supported by Intel anymore.