Perhaps a tbb::concurrent_set?

AJ13 · ‎03-17-2008

Hey,

We've had some discussions lately regarding concurrent_vector and erasing elements, and I appreciate the feedback I've been given. I'm learning more about the internals of concurrent_vector, and some of the design considerations made.

The fundamental problem is that concurrent_vector must retain ordering of its internal elements. This is leading to the problems we have been discussing I think. Personally, I just need the concurrent_vector as "a place to put stuff" with good parallel insert and delete performance.

This requirement might be better suited by some form of std::set that supports insert/delete and concurrent operations with good performance... and somehow makes good use of the cache.

Has this been attempted by the TBB team? Would it be best to just put a wrapper around an ordinary std::set?

Thanks.

robert-reed · ‎03-17-2008

Adrien, I did make a suggestion on the other thread which may provide a solution to your "holding set" design problem, just by adding a "valid" flag tothe concurrent_vector element type.Creating an enhanced std:set that supports insert/delete AND concurrent operationswith good cache use and performance may be a tall order.The interface for concurrent_vector carefully avoids defining operations on the vector elements themselves. Adding the constraint that operations such as includes, set_union and others defined for std:set be able to operate on sets with holes means that the ranges specified in the InputIterators need to be locked down prior to the operation in order to prevent other threads from modifying the sets while the current operation is in progress. The obvious solution is a global lock on the whole set. Doing anything less pervasive gets back to the question of where to hang per-element stuff, in this case a lock.

It's a tough problem. If you've got a solution that actually makes concurrent operations on malleable sets possible, I'm sure we'd love to hear it.

AJ13 · ‎03-17-2008

I fully appreciate this is a tough problem. I have been reading research papers on parallel data containers, and I admit I don't have a solution right now... and have yet to read about people who do... but as my research continues perhaps I can find something. I'm still learning at this phase, and reading a lot of papers... this is both an interesting question for TBB, and a good opportunity for me to read a lot of material to attempt to answer that question.

While I don't have a solution, I do think this would be a good idea. And I don't want to be the picky person who is asking for something with all these dream capabilities that are unrealistic... I will share what I learn with you as I read more papers. The performance of YetiSim is very much constrained by the operations of its internal containers... so this is something I must research anyways.

Also, I appreciate your help with my problem using concurrent_vector. Right now I have an alternative design which involves using STL containers with mutexes. This will do for now, but I'm also looking to help TBB grow and mature, so I'm talking about the issues to get a good understanding and to get some discussions flowing... as well as providing a problem for me to learn more on the subject.

Could you provide a short summary of what a data structure should do to provide good cache performance? Is it just that a contiguous section of memory like concurrent_vector is best, or is it that the data structure is best split unto cache-line-sized chunks that can be scattered?

Thanks!

robert-reed · ‎03-20-2008

About the only advice I can offer on data structures is to keep data that are used together close to each other, so that a thread operating on the data can do so while fetching the minimal number of cache lines possible. The bigger footprint each thread has cache-line-wise, the more likely its operation is to interfere with other threads grabbingcache lines. The more juice you can squeeze out of each cache line, the more effective useyou make of them. Ultimately though, it is somewhat algorithm dependent. Sometimes you just can't avoid a gather-scatter. Therein lies the art.

AJ13 · ‎04-18-2008

I have been reading a lot of documentation on processors, so I've been kinda quiet on the TBB lists lately. A lot more reading to go!

The biggest problem with a set, as far as I can reckon, is that you have a data structure which is using pointer traversals. The implementation of a tree is going to result in bad cache usage, since the nodes will be scattered in memory. Even without the nodes being scattered in memory, the pre-fetch of the next cache line won't necessarily help.

It has been suggested on #tbb that I look at concurrent B+ trees that are cache-friendly. It looks like we can learn a lot from concurrent database structures. I still have a lot more reading to go, but I'm starting to appreciate the issues.