- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to convert some old serial dynamic DAG evaluation code to run in parallel using TBB. I've looked at both the General Acyclic Graphs of Tasks example as well as the parallel_reduce (parallel_preorder) example. My use case differs from these examples enough that I'd figure I'd ask for some pointers. Forgive the noob. :)
My DAG evaluation is similar to the examples except that: 1) The graph topology is not known a priori and can change with each evaluation, 2) post-order traversal is required since the results of evaluating a node depends on the results of its predecessor nodes, and 3) continuation passing is not practical due to legacy reasons.
Given 1) and 3), it would seem that the "blocking style" of spawning tasks for evaluation would be the most natural. Except, it seems to be a less than optimal way to get tasks scheduled when memory is not a concern? In my case, I also expect the work done at each node to vary. Is there no way to do a spawn_and_wait_for_all() that will work-steal from any waiting task regardless of its depth? If not, is something that might be considered for TBB? What are the alternatives?
For the dynamic post-order traversals, multiple successor nodes might be trying to evaluate the same predecessor node simultaneously. In this case, I want to let the first task that gets to the predecessor node to spawn off a task and then wait for it to finish. For the second task that also wants to evaluate the same predecessor node, I want it to wait until it finishes. It's not clear to me as to which TBB mechanism I can use for this second "wait". Is there some way to add the predecessor task as a TBB child task so that the wait_for_all() mechanism can be used? Or do I have to set up some sort of barrier myself to do this? Perhaps this isn't the best approach?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Conversion of my example to continuation-passing style is straightforward using a "pseuo program counter" approach. Attached is the modified example. It does post-order evaluation without deadlocking.
I wrote the program counter logic explicitly to show the details. I've seen people write macros for it in other systems.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Obviously this means we are doing some of the scheduling ourselves.
If you don't know the order at the beginning of the evaluation, it is vastly more complex. I'd be interested to hear what you come up with.
Mike
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are on aWindows platform I might have a solution for you. I can send you a demo SDK if you are willing to spend a little time investigating the possibility. QuickThread has a task enqueuing method that provides for the type of flexability that you are describing. For example, your requirement for each node to contain a gait keeper that permits non-concurrent processing of a specific node in a concurrent system. Something that would require a Critical Section in a (oversubscribed) multi-threading system but which you would rather not use in a tasking system such as TBB or QuickThread.
Assume you wished to perform a function named DoSomething taking the node address and two arguments. When node is not in use you want DoSomething called immediately, when node is in use you want DoSomething with node and arguments enqueued for processing upon completion of the current task (which may by using node in DoSomething or DoSomethingElse, ...).
In QuickThread you can place a qtControl object within the node class/struct than use that in the task enqueue
parallel_task(OnDone$, &node->control, DoSomething, node, arg1, arg2);
You can cram that into a member function if you so desire
node->DoSomething(arg1, arg2);
This control, within the node, can be linked to a parent control node as an inverted tree of control nodes.
Here is sketch code for a two-level control tree.
[bash]#include "stdafx.h" #include "QuickThread.h" #include "parallel_task.h" using namespace qt; qtControl masterNodeControl; struct node_t { struct node_t* left; struct node_t* right; qtControl ctrl; node_t() { left = NULL; right = NULL; ctrl.Parent = &masterControl; }QED
~node_t()
{
ctrl.WaitTillDone();
// other cleanup after tasks complets
} // node data here }; int _tmain(int argc, _TCHAR* argv[]) { // ... // at some point in traversal node_t* node; ... parallel_task(OnDone$, &node.ctrl, doSomething, node, arg1, arg2); ... // now wait for all tasks on all nodes to complete masterNodeControl.WaitTillDone(); return 0; } [/bash]
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
TBB's
wait_for_all
(in all its forms) is not as inefficient as one might think. Its main deficiencies are (a) consuming thread stack space (due to recursion), and (b) workers not returning to their outermost dispatch loop that may affect fairness of workers distribution between different master threads. The latter is normally not an issue, and the former should be OK as long as your graph depth fits into 1M stack space (most modern OSes reserve even larger space for thread call stacks). Entering/exiting wait_for_all
also incur some overhead, but it is reasonably small (normally the order of tens of clock ticks).Regarding wait_for_all behavior, it works exactly as you wished :). It steals and executes any available (that is spawned) tasks disregarding their depth. Starting from the last stable release (and from TBB 3.0 that will be released soon)
wait_for_all
exits as sson as all the predecessors of the task it waits on are done. But while it waits, it steals and executes.With regard to dynamic graph support, TBB provides a few methods that may be useful to you. Method
task::allocate_additional_child_of
allows to dynamically add new predecessors to a task already being executed (or not yet spawned, it does not matter). You just have to make sure that the successor has not left its wait_for_all yet when you add new predecessor.As Mike noted above, methods
task::increment_ref_count
and task::decrement_ref_count
could be used to track when successor task has to be spawned, or as a safety belt (to add/remove extra reference) when adding new predecessors as described above.The same last stable release also added new
task::enqueue
method that allows scheduling tasks in (somewhat relaxed) FIFO order. Though its semantics is not completely finalized yet, it still may be useful for building graphs.And as the last note, in one of the post 3.0 releases TBB will add a set of APIs specifically for dynamic graphs evaluation and implementation of actors like models. But even now TBB seems to have enough (even if) low level means to solve your problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Nice, do you have an rough release date for post 3.0?
Mike
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On a more serious note, it's in exploration phase now, so dates are hard to predict. And we need to get done with 3.0 first. So please stay tuned.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[bash] #include "stdafx.h" #include "QuickThread.h" #include "parallel_task.h" using namespace qt; qtControl masterNodeControl; struct node_t { struct node_t* parent; struct node_t* left; struct node_t* right; qtControl ctrl; node_t() { left = NULL; right = NULL; ctrl.Parent = &masterNodeControl; } ~node_t() { // precaution in case of deletion of node // prior to completion of tasks using node ctrl.WaitTillDone(); } // node data here }; node_t* treeHead = NULL; void exclusiveTask_1(node_t* node) { // ... task performed with exclusive use of node // (excepting traversal of tree permitted) } void exclusiveTask_2(node_t* node) { // ... task performed with exclusive use of node // (excepting traversal of tree permitted) } typedef void(*void_fn_node_t)(node_t*); bool makeDecision(node_t* node) { // ... return true; } void parseTreeNode(node_t* node, void_fn_node_t fn) { if(!node) return; // nothing to do if(node->left) { // recursively enqueue ourself using master control node // and without exclusive reservation of node parallel_task(&masterNodeControl, parseTreeNode, node->left, fn); } if(node->right) { // recursively enqueue ourself using master control node // and without exclusive reservation of node parallel_task(&masterNodeControl, parseTreeNode, node->right, fn); } // perform non-exclusive node calculations if(makeDecision(node)) { // ta-da, our parent node requires exclusive operation parallel_task(OnDone$, &node->ctrl, fn, node->parent); } } int _tmain(int argc, _TCHAR* argv[]) { // ... (tree built) // at some point // Tree travresal performed without node ownership // perform exclusiveTask_1 on selected nodes parallel_task(&masterNodeControl, parseTreeNode, treeHead, exclusiveTask_1); // perform exclusiveTask_2 on selected nodes parallel_task(&masterNodeControl, parseTreeNode, treeHead, exclusiveTask_2); // now wait for all tasks on all nodes to complete masterNodeControl.WaitTillDone(); return 0; } [/bash]The above code compiles, although it doesn't doanything.
The "tree" is recursively searched for selected nodes using two parallel search paths passing action functors. Each search itself is parallel. In a real example, the last two parallel_tasks would likely pass in the decisiion functor. IOW the selection chritia for exclusiveTask_1 would likely be different from that of exclusiveTask_2
One of the principal strengths of QuickThread is shearelagance in coding.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for pointing out the new wait_for_all() behaviour. I'm still currently on TBB 2.2 so it looks like I'll have to update.
I'm still not clear on how to handle the case when multiple successor nodes try to evaluate the same predecessor in TBB. By the time the successor nodes try to evaluate their common predecessor, they're both already executing within their own tasks so the successor tasks are both already allocated and executing. I'm not sure how task::allocate_additional_child_of nor task::ref_count can be used to solve this problem. In an ideal world, there would be some way to make a task have multiple parents in TBB. It sounds like this might need to wait for post TBB 3.0?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mike
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Upon further thought, I kind of see now where Andrey was going with task::enqueue(). I was trying to find documentation for it until I realized that it's not an official function yet (what Andrey said in the first place :). So in this example, let's suppose that B is faster and spawns A. When C goes to evaluate and notices that A is already spawned (ie. already in the queue), C can create an empty child task, enqueue() it and then wait on the empty child task to finish. This should guarentee that C will only continue after A has been done executing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[plain]- Class task was extended with enqueue() method, and slightly changedHowever, the same pdf reference manual on the documentation page does not mention enqueue().
semantics of methods spawn() and destroy(). For exact semantics,
refer to TBB Reference manual.
[/plain]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mike
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
static void enqueue ( task& );
Effects
The task is scheduled for eventual execution by a worker thread even if no thread ever explicitly waits for the task to complete. If the total number of worker threads is zero, a special additional worker thread is created to execute enqueued tasks. Enqueued tasks are processed in roughly, but not precisely, first-come first-serve order.
CAUTION: Using enqueued tasks for recursive parallelism can cause high memory usage, because the recursion will expand in a breadth-first manner. Use ordinary spawning for recursive parallelism.
CAUTION: Explicitly waiting on an enqueued task should be avoided, because other enqueued tasks from unrelated parts of the program might have to be processed first. The recommended pattern for using an enqueued task is to have it asynchronously signal its completion, for example, by posting a message back to the thread that enqueued it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As for the CAUTION regarding waiting on enqueued tasks, are you saying here that other tasks might be executed even after the enqueued task waited upon has already been done executing before returning to the parent?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately, blocking style for postorder evaluation of a DAG will deadlock if the TBB task stealing mechanism is used. I first ran into this years ago when prototyping futures for TBB. Theproblem arises from what we call the "trapped task" problem. This happens when a thread running a task blocks waiting for child tasks to complete, and in the meantime steals other work. In the DAG case, this other work can end up waiting on the original task, which leads to deadlock.
To see this, consider a DAG of vertices (i,j) where evaluation of (i,j) requires evaluation of (i,j-1) and (i-1,j). Here is a troublesome sequence:
- Thread A starts evaluating (1,1)
- Thread A recursively starts evaluation of (0,1)
- ThreadB attempts to evaluate (0,2)
- ThreadB recursively attempts to evaluate (0,1). But Thread A got to it first, so thread B goes off and steals other work.
- Thread B steals the task for evaluating (0,3)
- Thread A finishes evaluation of (0,1). The evaluation of (0,2) can complete, but that's the (0,2) from step 3, which is now buried in thread B's stack.
- Thread B recursively attempts to evaluate (0,2). It's going to wait forever, since it is also trying to evaluate it at step 3.
With continuation passing, the problem goes away since no task becomes trapped on a stack.
With respect to the "second wait" question, yes, you can add a predecessor task and use the wait_for_all(). I've added an example of this trick. As the name of the file and warnings at the top indicate, it will often deadlock with TBB up to and including 3.0.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Waiting on an enqueued task will return when the enqueued task finishes. The hazard we worry about is that ina large piece of software, the programmer does not know what else was enqueued in front of their task,and so could get themselves into deadlock. (Of course, as my earlier reply showed, spawned tasks can deadlock too when using blocking style for DAGs.) It seems much safer to use enqueued tasks in a fire and forget way, or have the enqueued task enqueue a reply when it is done, or use continuation passing. TBB 3.0 will have a TBB Design Patterns document explaining some of the use cases for enqueued tasks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Conversion of my example to continuation-passing style is straightforward using a "pseuo program counter" approach. Attached is the modified example. It does post-order evaluation without deadlocking.
I wrote the program counter logic explicitly to show the details. I've seen people write macros for it in other systems.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alas effectively parallelizing codes often requires some re-engineering. I recently say a paper describe itas "an exercise in refactoring".
Something to consider in your re-engineering is whether individual node evaluations will do enough work to amortize scheduler overheads or not. I recommend that the work per task be about 10,000 clock cycles or higher. If not for your case, you'll need to aggregate multiple node evaluations together somehow.I've attached an excerpt of the TBB 3.0 Design Patterns document that describes agglomeration for grids and trees. But for DAGs, I have not seen a common pattern for agglomeration.
With respect toproxy tasks, yes, the schedulerdoes descrement the ref_count of a task when a predecessor finishes. It's somewhat buried in the Reference manual in the section titled "Processing of execute()". [By the way,the TBB 3.0 documentation has switched from saying child/parent to predecessor/successor. Method names retain the historical names for sake of source compatibility.]
The reference counts of waiting tasks could be manipulated directly, using task::decrement_ref_count. However, a task might be waiting on more than one node, so it might have to be in more than one list. So since I needed to keep a list anyway, I introduced ProxyTask objects. In other words each ProxyTask serves two purposes: (1) As a way to decrement the ref_count of a waiting task. (2) As a list element. Because TBB allocates small tasks from thread-local free lists, the ProxyTask objects are a fairly cheap way to implement the list aspect, so it seemed easier to me than trying to use task::decrement_ref_count.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In my case, the work done per node can be fairly large. The re-engineering effort will likely result in suboptimal parallelism for me. I have evaluation code that calls functions which in turn may or may not evaluate other nodes. So it's not even possible in many cases to know if/when/where a node will be evaluated from.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page