We've had some discussions lately regarding concurrent_vector and erasing elements, and I appreciate the feedback I've been given. I'm learning more about the internals of concurrent_vector, and some of the design considerations made.
The fundamental problem is that concurrent_vector must retain ordering of its internal elements. This is leading to the problems we have been discussing I think. Personally, I just need the concurrent_vector as "a place to put stuff" with good parallel insert and delete performance.
This requirement might be better suited by some form of std::set that supports insert/delete and concurrent operations with good performance... and somehow makes good use of the cache.
Has this been attempted by the TBB team? Would it be best to just put a wrapper around an ordinary std::set?
Adrien, I did make a suggestion on the other thread which may provide a solution to your "holding set" design problem, just by adding a "valid" flag tothe concurrent_vector element type.Creating an enhanced std:set that supports insert/delete AND concurrent operationswith good cache use and performance may be a tall order.The interface for concurrent_vector carefully avoids defining operations on the vector elements themselves. Adding the constraint that operations such as includes, set_union and others defined for std:set be able to operate on sets with holes means that the ranges specified in the InputIterators need to be locked down prior to the operation in order to prevent other threads from modifying the sets while the current operation is in progress. The obvious solution is a global lock on the whole set. Doing anything less pervasive gets back to the question of where to hang per-element stuff, in this case a lock.
It's a tough problem. If you've got a solution that actually makes concurrent operations on malleable sets possible, I'm sure we'd love to hear it.
While I don't have a solution, I do think this would be a good idea. And I don't want to be the picky person who is asking for something with all these dream capabilities that are unrealistic... I will share what I learn with you as I read more papers. The performance of YetiSim is very much constrained by the operations of its internal containers... so this is something I must research anyways.
Also, I appreciate your help with my problem using concurrent_vector. Right now I have an alternative design which involves using STL containers with mutexes. This will do for now, but I'm also looking to help TBB grow and mature, so I'm talking about the issues to get a good understanding and to get some discussions flowing... as well as providing a problem for me to learn more on the subject.
Could you provide a short summary of what a data structure should do to provide good cache performance? Is it just that a contiguous section of memory like concurrent_vector is best, or is it that the data structure is best split unto cache-line-sized chunks that can be scattered?
The biggest problem with a set, as far as I can reckon, is that you have a data structure which is using pointer traversals. The implementation of a tree is going to result in bad cache usage, since the nodes will be scattered in memory. Even without the nodes being scattered in memory, the pre-fetch of the next cache line won't necessarily help.
It has been suggested on #tbb that I look at concurrent B+ trees that are cache-friendly. It looks like we can learn a lot from concurrent database structures. I still have a lot more reading to go, but I'm starting to appreciate the issues.