I am now using tbb::concurrent_vector to collect data of dynamic set in parallel tasks.
Compared with use of mutex, it is faster, however, I hope more speed up.
Is there way or idea to fasten the above problem?
(This is severe performance problem for me)
It is hard to tell whether the highest possible scalability was reached with tbb::concurrent_vector. Maybe tbb::concurrent_queue be more suitable for this task, could you please describe what type of data is collected and how it is stored? Also, you may try to use tbb::scalable_allocator inside the vector in order to increase allocation scalability inside parallel region.
In general, the concurrent push_back operation cannot be cheaper than one atomic operation. So, even in the best case the scalability is limited with scalability of atomic operations. The typical BKM is not to push_back each element but to organize elements into packs and use the grow_by operation to reserve a space for a pack and amortize atomic operations.
Another approach is to use thread locals, e.g. combinable to store elements into local vectors (e.g. std::vector), then combine them into the final vector. If a serial merge is too slow, you may want to consider enumerable_thread_specific and merge values in parallel into concurrent_vector (via grow_by operation).
As Nikita has mentioned, it is quite difficult to provide advise without understanding a particular use case. It would be great if you can provide some details.
Dear Alex and Nikita,
Thank you for your everytime kind replies.
As a conceptual explanation,
I am now implementing some finding algorithm of elements in binary tree. In recursive procedure, elements I query are collected.
Therefore, it is unclear that performance bottle neck is rooted in dynamical allocation or synchronization between tasks.
I will at first try the thread local storage (TLS) as Alex told me since TLS probably does not need synchronization, by which I would like to distinguish them.
I always appreciate your helpful comments, and respect sophisticated technical knowledge and arts of Intel's engineers.
Dear Alex and Nikita,
I am sorry for my questions.
After testing TLS and several performance trials, I understand the overhead may perhaps not be due to synchronization (or dynamic memory allocation), but due to increase of evaluation of binary traversal and check I query.
I deeply appreciate your kind advice, and to improve my performance, I am going to create tasks per recursive call to make the best of task stealing.
Thank you so much.
You are welcome.
When creating tasks per recursive call, pay attention that each task should have reasonable amount of work to amortize tasking and stealing overhead (however, it is recommended to create many more tasks than there are threads). So, you might want to think about some cutoff to stop parallel recursion and continue recursion serially. In addition, I would recommend using parallel_do or task_group to simplify task programming.
Thank you for your rapid and helpful comment. I keep in mind to avoid overhead of creation of too much tasks.
I am going to try performance enhancement according to your advice, and the book, "Outfitting C++ for Multi-core Processor Parallelism".