Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Concurrent_unordered_set performance trouble

diedler_f_
Beginner
1,145 Views

Hi,

I use in my IA project a concurrent hashtable with almost 20 000 000 items. There is no problem with insert or find operations but when I want to destroy all elements, it takes a lot of time...

This is my code :

typedef tbb::concurrent_unordered_set <MY_NODE*, MY_NODE::Hash, MY_NODE::Equality> ConcurrentHashTable;

// Release memory before destroying the hash table
for (auto& b : this->m_explored) { delete b; }

// Destroy the hashtable
this->m_explored = ConcurrentHashTable(); // performance issue here..

I also tried clear() but same result. It take about 10s to clear the entire hashtable...

With the std::unordered_set, it takes only 2s.

How to fix it ? (I use Windows 10)

Thanks

0 Kudos
5 Replies
Alexey-Kukanov
Employee
1,145 Views

Thanks for reporting the issue. We will investigate and get back to you.

0 Kudos
diedler_f_
Beginner
1,145 Views

Thanks,

Keep me in touch.

0 Kudos
Alexey-Kukanov
Employee
1,145 Views

We have reproduced and looked at the issue. Unfortunately, we see no way to fix it.

First, the internal structure of our concurrent unordered containers requires approximately 25% more allocations than the number of elements. But this is a small issue.

The big part of the slowdown is caused by the standard free() function (which redirects to the HeapFree() WinAPI function). The internal structure of both containers causes the calls to free() being done in a very different order than that of malloc(). However for some reason, the allocation/deallocation pattern of the TBB concurrent_unordered_set makes free() running a few times slower comparing to the C++ unordered_set. I do not have an explanation for that (I guess it can be caused by some HW effects), and we cannot fix this problem either.

My recommendation is to try the tbb::scalable_allocator class with the container:

typedef tbb::concurrent_unordered_set < MY_NODE*, MY_NODE::Hash, MY_NODE::Equality,
                                        tbb::scalable_allocator<MY_NODE*> > ConcurrentHashTable;

In my experiments, both insertion and cleaning a container run faster with it than with the std::allocator. The problem with cleaning of concurrent_unordered_set being slower than std::unordered_set still remains, but with much smaller magnitude.

0 Kudos
diedler_f_
Beginner
1,145 Views

Hi,

Thanks for your investigation.

Unfortunalety, I have other problems with tbb::scalable_allocator instead of standard allocator because I need the memory being released immedialty when asked. 

Does tbb::scalable_allocator is the same thing than tbbmalloc_proxy dll ? Because I remind me that the memory is not directly released when I use tbbmalloc dll...

Thanks

0 Kudos
Alexey-Kukanov
Employee
1,145 Views

When you say "I need the memory being released immediately", what do you actually mean?

All allocators "cache" some portion of released memory for future use. But that's true that the TBB memory allocator caches more memory comparing to the standard allocator; it is necessary to obtain better performance in parallel programs. Also if there are several allocators being used, they are not able to share memory, so the overall working set might increase.

There is difference between tbb::scalable_allocator and tbbmalloc_proxy.dll. The underlying memory manager - tbbmalloc.dll - is the same, but the API is different. The scalable_allocator is the C++ standard-compliant API that allows to use tbbmalloc only for selected containers. The tbbmalloc_proxy dll is the way to automatically redirect to tbbmalloc all C/C++ allocations (i.e. those done with malloc, operator new, std::allocator etc) happening in your program.

0 Kudos
Reply