I have a vector of objects on which I need to run a function in parallel. Each task involves a recursive octree traversal, and I'm curious if there is a better way to organize my TBB usage to get more optimum depth-first parallel traversal.
Right now each tree node uses a blocked range parallel_for to start the recursion into each tree. Each of the function calls that kick off the tree-traversals are also called using a blocked range parallel_for.
Running everything with parallel_for appears to complete a software-thread-ish number of the tree traversals at around the same time! It is more desirable for me to execute fewer of the octree traversals faster instead of a number of them at the same time.
Please correct me if I'm wrong, but this is my current thinking!
Will I get the behavior I want if I create each of the recursive function calls as TBB tasks instead of using parallel_for's? At the top of the tree I would spawn() a task for each of my octrees to kickstart the parallel recursive traversals. Inside of the recursive functions I would spawn_and_wait() because of tree dependencies.
Would this create more of a depth-first behavior, or what other design pattern must I use?
If I interpret this correctly (and it is not very clear to me so I have to speculate somewhat), you issued a (conceptual) execution tree that includes the vector's parallel execution as well as each of its elements' trees, in the form of a parallel_for() on the vector and a recursive parallel traversal of the tree in each element. TBB does not distinguish those two levels, and does a bona fide depth-first execution. This is optimal for throughput performance, and it should also produce results quickly for individual elements, although at unpredictable locations throughout the vector. Is this interpretation correct?
If you want to improve apparent performance by quickly producing finished results for a prefix of the vector, you can easily use parallel_do() over the vector instead, although this may slightly increase the total time.
It's not clear which problem you try to address by reducing the number of octrees traversed concurrently. But the simplest way to do that, I think, is to run the outermost loop over trees serially, thus traversing one octree at a time.
If you want some degree of parallelism at the outer level, just not the maximal degree, there are two options I can think of:
1) Set the grain size for the outermost parallel_for to be 1/2 (or 1/4) of the number of trees by e.g. blocked_range<int>(0, N, N/2). Thus you will have each half (or quarter) of the vector processed serially, and so no more than 2 (or 4) trees traversed at the same time.
2) If you want to traverse the trees closer to their serial order, or maybe want to process 3 or 5 trees at a time (which you *will not* achieve with blocked_range<int>(0,N,N/3)), then parallel_pipeline can be used. The first *serial* stage would just iterate over the vector, get the next tree, and pass it down; the second *parallel* stage would do the traversal of the received tree. The first argument to parallel_pipeline is the max number of items to process at a time, and you might set it to 3 or 5.
Note that in both cases you do not need to modify your parallel traversal procedure at all, unless you want to improve something else.
This has been informative! You have the right idea Raf. Alexey, it's not that I want to reduce the number of trees being traversed concurrently, but that I know there is enough work on one of the trees to keep all of the cores occupied and focused on one tree.There are 5/6 levels of octree and a significant work load at each branch, so there should be enough work for one tree at a time. At the same time I don't want there to be any unoccupied CPU time.
It is pretty fast already, and I'm sure that I am just trying to over-optimize. I think I'll just trust TBB to do the right thing.