How to spawn worker children while parent continues to execute?

todd-bezenek · ‎03-16-2010

This is the general idea of my code:

I have a recursive_algorithm (I assume this will be a task class, as it is now) which does this:

1. Most of the time it crawls down a binary tree--I do NOT want this done with other threads, as the thread creation overhead is as large as the overhead for checking a node (10s of instructions).

2. It finds a node that will require work (1,000 instructions or more). This can be done by calling a method which works on only global/passed data.

So, I want to be able to create new tasks to work on the new nodes (but not a new task at EVERY node) I find and allow the parent to continue through the tree at the same time looking for more "large amount of work" nodes. Any ideas?

When looking at the task structure, I do not see an obvious way to do this. Everything seems to rely on the parent blocking or finishing, in which case a continuation node will start and execute the children ("large amount of work" nodes in my case).

Thank you for any help!

-Todd

RafSchietekat · ‎03-16-2010

I think you know that thread creation overhead is many orders of magnitude more than dozens of instructions, but you may not yet know that even tasks should be considered only to do work of several thousands of instructions to be fully useful (otherwise throughput performance is sacrificed for earlier termination only on an oversized machine). A continuation task executes after its children have finished, which would have been spawned by a different task.

You should probably explore the tree in parallel instead of just splitting off the big nodes. You'll find many more opportunities for parallelism that way, and they will be of higher quality because of less inter-thread communication, which would improve performance.

If you can't change your program to do that anyway, have a look at "Letting Main Thread Work While Child Tasks Run" in the reference manual, but use allocate_as_additional_child() instead (the section should probably be changed to include a mention of that). If you have parallelism available in your computer, it will be used as soon as node tasks are being spawned; on a single thread (TBB is into optional parallelism, not required concurrency), those tasks are tackled all together at wait_for_all() time.

todd-bezenek · ‎03-16-2010

Thank you, Raf. I should have said "task" at each position where I said "thread". I am still new to TBB. :-)

Exploring the tree in parallel was the first choice attempt. It gave me a 2x slowdown on a 4-core machine. I implemented "task recycling with scheduler bypass," and managed to get a speedup of about 2%. It is all gravy from here!

I do not see a better way to break up the tree scan, other than creating several tasks for the sub-trees and then letting them run without creating any new tasks. Hmmm.

I will take a look at the section "Letting Main Thread Work While Child Tasks Run" at work tomorrow. It sounds like what I need.

It is also possible my tasks are more expensive to create than I realize. I assume there is a way to see how many bytes the managed memory allocation is allocating/copying.

Thank you again.

-Todd

todd-bezenek · ‎03-16-2010

Raf,

The copy of the TBB Reference I was looking at (downloaded about a week ago) was Document Number 315415-001U Revision: 1.6 (2006). It does not have the section you mentioned. I wonder if this is the document referenced from the open-source Web site.

I grabbed a "new" copy of the document (Document Number 315415-004US--Revision 1.17), which does have the section you mentioned. I already tried the method described in this section (I guess I thought it up without the doc.) without much luck.

I will try creating multiple tasks to scan the tree, but NOT one per node. I assume the standard way to do this is to create multiple tasks from a parent and have each run a task which scans the tree without creating new tasks. Or, I suppose each task could count the number of nodes it visits before spawning an additional task... There a lots of choices! It is time to start thinking in tasks.

-Todd

RafSchietekat · ‎03-17-2010

Strange that you get so little speedup from parallel exploration. Maybe you could probe the number of nodes ahead to see whether it's worthwhile creating a new task, thus avoiding many expensive task creations near the leaves that bring no net benefit? Perhaps depth reached by iterating over first nodes is a suitable heuristic, subject to tuning. Number of nodes already visited as you suggest may also be suitable to avoid an excessive number of tasks, but it takes longer to get up to full speed, and so would work well only with bigger trees to amortise startup.

Tasks are a lot cheaper than threads, but they should still have thousands of instructions of work to do even if you add negligible overhead to task creation yourself.

As I wrote, use create_additional_child_of() instead of a predetermined number as suggested in the reference manual.

(Added) Perhaps as you explore the tree you could allocate a task, add big-node references to it until you have assembled enough, and only then spawn the task and allocate a new one for the next batch.

Alexey-Kukanov · ‎03-17-2010

> There a lots of choices!

One interesting and non-trivial choice is to use parallel_for with auto_partitioner. Yes I am not joking :) The trick is to create a custom range class. I have attached an example code of that for your reference, and will comment it a little.

See that the Action() function in the example does not apply a function to a leaf node but instead does serial traversal of a subtree. At the same time, the function is_divisible of the special range returns true if a subtree exists. So it might seem like this code will chunk the tree down to leafs, and so create huge overhead. But, the second part of the trick is to use auto_partitioner (which is the default partitioner for parallel_for since TBB 2.2). This scheduling helper will stop splitting ranges when it sees no need, and it does not matter if those are divisible furtheror not.

So I expect this trick to do a few splits in the very beginning to create several subtrees, and then process the subtrees sequentially, thus eliminating most of the overhead for task allocation and spawn.

If you wish to spawn tasks for heavy processing of nodes rather than do it in place, you will need to access a tbb::task instance from parallel_for's range and body classes. The way to do it is to call tbb::task::self(). E.g. I would use something like the following code to spawn:
tbb::task* process_the_node = new (tbb::task::self().allocate_additional_child_of(tbb::task::self().parent()) NodeProcessingTask(my_current_node);
tbb::task::self().spawn(*process_the_node);
This way, the new task becomes a sibling of the one being executed under the hood.

If you decide to try it out, let me know whether it works for you, and of course feel free to ask any questions.

Todd_Bezenek · ‎03-17-2010

Quoting Raf Schietekat

As I wrote, use create_additional_child_of() instead of a predetermined number as suggested in the reference manual.

(Added) Perhaps as you explore the tree you could allocate a task, add big-node references to it until you have assembled enough, and only then spawn the task and allocate a new one for the next batch.

Raf,

Looking at the new version of the TBB Reference (v1.17), it says the dummy task in the "Letting Main Thread Work While Child Tasks Run" section will stay around until it is explicitely destroyed.

Doesn't this mean I can keep adding new children to it as I like and spawing them? I.e., for each task walking the tree, there is one dummy task. When the walking task finds a "big-node", it creates a new child task of it's dummy task and spawns the new child. This looks like a good solution. And since the children are all the same, if I could figure out a way to recycle them, the potential to remove overhead looks very good.

This seems the same as your suggestion above, but the new children are spawned as they are found. Does your suggestion mean the group of "big-nodes" will be handled by the one parent task instead of separate tasks as I was thinking?

Am I understanding this corectly?

Second issue (but related):

Writing a single recursive task is easy. Writing something which spawns new tasks only part of the time is not as straighforward, since the task that is spawned is not always the same as the spawning task. It would be useful to have an example or two like this for people to look at. (Or, maybe I am misunderstanding the right way to do this.)

Thank you once again for your very valuable help!

-Todd

Todd_Bezenek · ‎03-17-2010

Quoting Alexey Kukanov (Intel)

> There a lots of choices!

One interesting and non-trivial choice is to use parallel_for with auto_partitioner. Yes I am not joking :) The trick is to create a custom range class. I have attached an example code of that for your reference, and will comment it a little.

Alexey,

Thank you for the suggestion. I will look at this after I finish banging my head against all of the task-creation solutions mentioned above.

Working on this, I have started to think a task pool might be a good addition to TBB. For my solution, if I could create a bunch of tasks with the same environment, and then change a couple variables in the environment of one and start it up when I find a "big-node", I would not ever need to create any new tasks.

The tasks share a big read-only tree, with the input being a pointer to the node of interest, and their output goes to a thread-safe container that is also shared.

Maybe one of the mentioned solutions is so close to this that it is not a unique idea.

-Todd

RafSchietekat · ‎03-17-2010

#5 "One interesting and non-trivial choice is to use parallel_for with auto_partitioner."
Unless the tree is hugely unbalanced, that's an excellent approach.

#6 "Doesn't this mean I can keep adding new children to it as I like and spawing them? I.e., for each task walking the tree, there is one dummy task."
Yes, there's no limit to the number of additional children. Keeping multiple dummies may be helpful for performance, although I'm not sure how much.

#6 "This seems the same as your suggestion above, but the new children are spawned as they are found. Does your suggestion mean the group of "big-nodes" will be handled by the one parent task instead of separate tasks as I was thinking?"
Try to assemble a number of big nodes per spawned task, and then you don't need to worry too much about recycling tasks (always a good idea if the opportunity presents itself, not sure how much you should try to create the opportunity out of thin air, because much of the overhead would be in the respawning itself).

#6 "Second issue (but related):"
Not sure what you mean.

#7 "I will look at this after I finish banging my head against all of the task-creation solutions mentioned above."
Go there directly, and you may not even need to bother handling big nodes in separate tasks.

#7 "Working on this, I have started to think a task pool might be a good addition to TBB."
Nah... Unless you add a lot of overhead to task creation (with an alternative solution being to recycle an object that's referenced by several tasks in succession), most of the overhead is probably in the scheduling. Alexey may have some numbers to support that?

RafSchietekat · ‎03-17-2010

With the parallel_for/auto_partitioner approach, I would suspect that it may often be beneficial not to do some_func() inside the range's splitting constructor, so that the work can more quickly be distributed to all the workers. A simple solution may be to have an additional reference my_ancestor, possibly equal to my_item, so that the range contains my_ancestor, its left descendants down to my_item, and the whole subtree rooted in my_item.

todd-bezenek · ‎03-17-2010

Quoting Raf Schietekat

Strange that you get so little speedup from parallel exploration. Maybe you could probe the number of nodes ahead to see whether it's worthwhile creating a new task, thus avoiding many expensive task creations near the leaves that bring no net benefit? Perhaps depth reached by iterating over first nodes is a suitable heuristic, subject to tuning. Number of nodes already visited as you suggest may also be suitable to avoid an excessive number of tasks, but it takes longer to get up to full speed, and so would work well only with bigger trees to amortise startup.

Raf,

I am now getting a speedup of 48% (4 cores) by allowing new tasks until a set depth. It is time to dynamically tweak the depth to see what I can get.

Now, the "big bang" will be if I can somehow collect pairs of tasks and convert their floating-point calculations to 64bits x 2 SIMD instructions. Since all of the tasks are running by themselves, I am not sure how to do this. A global barrier would be just the ticket. I suppose I will have to dig into the task scheduler to figure this one out. If there is an example you know of, please feel free to point it out.

Maybe it is time to look into IPP. :-)

Once again, thank you for the help! Hopefully this thread will help other people in the future.

-Todd

p.s. All of this is amazingly similar to the Fib. example in the O'Reilly book. It is too bad the examples in the book are not quite complete and in some cases misleading (based on my experience getting to where I am). Maybe I will put complete versions on the Web to help future TBB'ers.

RafSchietekat · ‎03-18-2010

"I am now getting a speedup of 48% (4 cores) by allowing new tasks until a set depth. It is time to dynamically tweak the depth to see what I can get."
Or to abandon that approach and use Alexey's instead, which is the most obvious way yet to quite possibly obtain near-linear scaling. But feel free to experiment with what you've already got and share some details.

"collect pairs of tasks and convert their floating-point calculations to 64bits x 2 SIMD instructions"
For big nodes, you mean? How about... creating tasks for an even number of big nodes? :-)

"A global barrier would be just the ticket."
Sounds scary, performance-wise.

"I suppose I will have to dig into the task scheduler to figure this one out."
That would be impressive, but probably not the best use of your time.

Alexey-Kukanov · ‎03-18-2010

Quoting Raf Schietekat

With the parallel_for/auto_partitioner approach, I would suspect that it may often be beneficial not to do some_func() inside the range's splitting constructor, so that the work can more quickly be distributed to all the workers.

You are absolutely right. Different solutions might be possible for that; my first thought was spawning a task to call some_func instead of in-place execution, so I provided some information on how to do that.

todd-bezenek · ‎03-25-2010

Quoting Raf Schietekat

Strange that you get so little speedup from parallel exploration. Maybe you could probe the number of nodes ahead to see whether it's worthwhile creating a new task, thus avoiding many expensive task creations near the leaves that bring no net benefit?

Raf,

I am now limiting the depth for new task creation and getting a speedup of close to 100% on four processors. (I would like to see something closer to linear (300%).)

All of the algorithms metion tree balance. My tree is perfectly balanced, but the algorithm does tree pruning, which (I believe) is causing some process threads to end up with no tasks.

My current implementation is to recycle the parent as a child. Is this bypassing of the task scheduler perhaps causing a thread to sit idle when the recycled parent ends up being a pruned portion of the tree (thus, spawns no new tasks)?

Also, I have not tried Alexey's parallel_for implementation. He mentions issues with tree imbalances also, so I am not sure that is going to be a solution.

Thank you for any suggestions.

-Todd

RafSchietekat · ‎03-26-2010

"My current implementation is to recycle the parent as a child. Is this bypassing of the task scheduler perhaps causing a thread to sit idle when the recycled parent ends up being a pruned portion of the tree (thus, spawns no new tasks)?"
Recycling a task does not by itself bypass the task scheduler like returning a non-NULL value from execute() does, but I don't see anything here that would cause a thread to sit idle while work is still available?

"Also, I have not tried Alexey's parallel_for implementation. He mentions issues with tree imbalances also, so I am not sure that is going to be a solution."
It should be easy enough to give it a quick try. If it doesn't help right away, you could still substitute simple_partitioner and implement a cheap heuristic for divisibility.

todd-bezenek · ‎03-26-2010

Thank you, Raf. I will be working on this over the weekend.

-Todd