Looking at the new version of the TBB Reference (v1.17), it says the dummy task in the "Letting Main Thread Work While Child Tasks Run" section will stay around until it is explicitely destroyed.
Doesn't this mean I can keep adding new children to it as I like and spawing them? I.e., for each task walking the tree, there is one dummy task. When the walking task finds a "big-node", it creates a new child task of it's dummy task and spawns the new child. This looks like a good solution. And since the children are all the same, if I could figure out a way to recycle them, the potential to remove overhead looks very good.
This seems the same as your suggestion above, but the new children are spawned as they are found. Does your suggestion mean the group of "big-nodes" will be handled by the one parent task instead of separate tasks as I was thinking?
Am I understanding this corectly?
Second issue (but related):
Writing a single recursive task is easy. Writing something which spawns new tasks only part of the time is not as straighforward, since the task that is spawned is not always the same as the spawning task. It would be useful to have an example or two like this for people to look at. (Or, maybe I am misunderstanding the right way to do this.)
Thank you once again for your very valuable help!
Thank you for the suggestion. I will look at this after I finish banging my head against all of the task-creation solutions mentioned above.
Working on this, I have started to think a task pool might be a good addition to TBB. For my solution, if I could create a bunch of tasks with the same environment, and then change a couple variables in the environment of one and start it up when I find a "big-node", I would not ever need to create any new tasks.
The tasks share a big read-only tree, with the input being a pointer to the node of interest, and their output goes to a thread-safe container that is also shared.
Maybe one of the mentioned solutions is so close to this that it is not a unique idea.
I am now getting a speedup of 48% (4 cores) by allowing new tasks until a set depth. It is time to dynamically tweak the depth to see what I can get.
Now, the "big bang" will be if I can somehow collect pairs of tasks and convert their floating-point calculations to 64bits x 2 SIMD instructions. Since all of the tasks are running by themselves, I am not sure how to do this. A global barrier would be just the ticket. I suppose I will have to dig into the task scheduler to figure this one out. If there is an example you know of, please feel free to point it out.
Maybe it is time to look into IPP. :-)
Once again, thank you for the help! Hopefully this thread will help other people in the future.
p.s. All of this is amazingly similar to the Fib. example in the O'Reilly book. It is too bad the examples in the book are not quite complete and in some cases misleading (based on my experience getting to where I am). Maybe I will put complete versions on the Web to help future TBB'ers.
You are absolutely right. Different solutions might be possible for that; my first thought was spawning a task to call some_func instead of in-place execution, so I provided some information on how to do that.
I am now limiting the depth for new task creation and getting a speedup of close to 100% on four processors. (I would like to see something closer to linear (300%).)
All of the algorithms metion tree balance. My tree is perfectly balanced, but the algorithm does tree pruning, which (I believe) is causing some process threads to end up with no tasks.
My current implementation is to recycle the parent as a child. Is this bypassing of the task scheduler perhaps causing a thread to sit idle when the recycled parent ends up being a pruned portion of the tree (thus, spawns no new tasks)?
Also, I have not tried Alexey's parallel_for implementation. He mentions issues with tree imbalances also, so I am not sure that is going to be a solution.
Thank you for any suggestions.