Solved: Dynamic post-order evaluation of DAGs

e4lam · ‎04-05-2010

Hi,

I'm trying to convert some old serial dynamic DAG evaluation code to run in parallel using TBB. I've looked at both the General Acyclic Graphs of Tasks example as well as the parallel_reduce (parallel_preorder) example. My use case differs from these examples enough that I'd figure I'd ask for some pointers. Forgive the noob. :)

My DAG evaluation is similar to the examples except that: 1) The graph topology is not known a priori and can change with each evaluation, 2) post-order traversal is required since the results of evaluating a node depends on the results of its predecessor nodes, and 3) continuation passing is not practical due to legacy reasons.

Given 1) and 3), it would seem that the "blocking style" of spawning tasks for evaluation would be the most natural. Except, it seems to be a less than optimal way to get tasks scheduled when memory is not a concern? In my case, I also expect the work done at each node to vary. Is there no way to do a spawn_and_wait_for_all() that will work-steal from any waiting task regardless of its depth? If not, is something that might be considered for TBB? What are the alternatives?

For the dynamic post-order traversals, multiple successor nodes might be trying to evaluate the same predecessor node simultaneously. In this case, I want to let the first task that gets to the predecessor node to spawn off a task and then wait for it to finish. For the second task that also wants to evaluate the same predecessor node, I want it to wait until it finishes. It's not clear to me as to which TBB mechanism I can use for this second "wait". Is there some way to add the predecessor task as a TBB child task so that the wait_for_all() mechanism can be used? Or do I have to set up some sort of barrier myself to do this? Perhaps this isn't the best approach?

Thanks!

ARCH_R_Intel · ‎04-06-2010

Conversion of my example to continuation-passing style is straightforward using a "pseuo program counter" approach. Attached is the modified example. It does post-order evaluation without deadlocking.

I wrote the program counter logic explicitly to show the details. I've seen people write macros for it in other systems.

View solution in original post

mwhenson · ‎04-05-2010

Our solution requires the complete ordering to be known at the beginning of the evaluation. We create every task with the ref count set to the number of children, where parents in the DAG cannot begin execution until their children are done. When a task finishes, it atomically decrements its parent's ref count; when the reference count is zero the task is spawned. This works very well for smaller trees (< 10,000).

Obviously this means we are doing some of the scheduling ourselves.

If you don't know the order at the beginning of the evaluation, it is vastly more complex. I'd be interested to hear what you come up with.

Mike

jimdempseyatthecove · ‎04-05-2010

e4lam,

If you are on aWindows platform I might have a solution for you. I can send you a demo SDK if you are willing to spend a little time investigating the possibility. QuickThread has a task enqueuing method that provides for the type of flexability that you are describing. For example, your requirement for each node to contain a gait keeper that permits non-concurrent processing of a specific node in a concurrent system. Something that would require a Critical Section in a (oversubscribed) multi-threading system but which you would rather not use in a tasking system such as TBB or QuickThread.

Assume you wished to perform a function named DoSomething taking the node address and two arguments. When node is not in use you want DoSomething called immediately, when node is in use you want DoSomething with node and arguments enqueued for processing upon completion of the current task (which may by using node in DoSomething or DoSomethingElse, ...).

In QuickThread you can place a qtControl object within the node class/struct than use that in the task enqueue

parallel_task(OnDone$, &node->control, DoSomething, node, arg1, arg2);

You can cram that into a member function if you so desire

node->DoSomething(arg1, arg2);

This control, within the node, can be linked to a parent control node as an inverted tree of control nodes.
Here is sketch code for a two-level control tree.

[bash]#include "stdafx.h"
#include "QuickThread.h"
#include "parallel_task.h"
using namespace qt;

qtControl masterNodeControl;
struct node_t
{
    struct node_t* left;
    struct node_t* right;
    qtControl    ctrl;
    node_t()
    {
        left = NULL;
        right = NULL;
        ctrl.Parent = &masterControl;
    }
    ~node_t()
    {
        ctrl.WaitTillDone();
        // other cleanup after tasks complets



    }
    // node data here
};

int _tmain(int argc, _TCHAR* argv[])
{
    // ...
    // at some point in traversal
    node_t* node;
    ...
    parallel_task(OnDone$, &node.ctrl, doSomething, node, arg1, arg2);
    ...
    // now wait for all tasks on all nodes to complete
    masterNodeControl.WaitTillDone();
    return 0;
}
[/bash]

QED

Jim Dempsey

Andrey_Marochko · ‎04-06-2010

It would be interesting to see a bit more fleshy outline of your graph construction/processing algorithm. Meanwhile a few notes that hopefully may be helpful to you.

TBB's wait_for_all (in all its forms) is not as inefficient as one might think. Its main deficiencies are (a) consuming thread stack space (due to recursion), and (b) workers not returning to their outermost dispatch loop that may affect fairness of workers distribution between different master threads. The latter is normally not an issue, and the former should be OK as long as your graph depth fits into 1M stack space (most modern OSes reserve even larger space for thread call stacks). Entering/exiting wait_for_all also incur some overhead, but it is reasonably small (normally the order of tens of clock ticks).

Regarding wait_for_all behavior, it works exactly as you wished :). It steals and executes any available (that is spawned) tasks disregarding their depth. Starting from the last stable release (and from TBB 3.0 that will be released soon) wait_for_all exits as sson as all the predecessors of the task it waits on are done. But while it waits, it steals and executes.

With regard to dynamic graph support, TBB provides a few methods that may be useful to you. Method task::allocate_additional_child_of allows to dynamically add new predecessors to a task already being executed (or not yet spawned, it does not matter). You just have to make sure that the successor has not left its wait_for_all yet when you add new predecessor.

As Mike noted above, methods task::increment_ref_count and task::decrement_ref_count could be used to track when successor task has to be spawned, or as a safety belt (to add/remove extra reference) when adding new predecessors as described above.

The same last stable release also added new task::enqueue method that allows scheduling tasks in (somewhat relaxed) FIFO order. Though its semantics is not completely finalized yet, it still may be useful for building graphs.

And as the last note, in one of the post 3.0 releases TBB will add a set of APIs specifically for dynamic graphs evaluation and implementation of actors like models. But even now TBB seems to have enough (even if) low level means to solve your problem.

mwhenson · ‎04-06-2010

Quoting Andrey Marochko (Intel)

And as the last note, in one of the post 3.0 releases TBB will add a set of APIs specifically for dynamic graphs evaluation and implementation of actors like models. But even now TBB seems to have enough (even if) low level means to solve your problem.

Nice, do you have an rough release date for post 3.0?

Mike

Alexey-Kukanov · ‎04-06-2010

Year 2011 :)

On a more serious note, it's in exploration phase now, so dates are hard to predict. And we need to get done with 3.0 first. So please stay tuned.

jimdempseyatthecove · ‎04-06-2010

Here is a fleshier outline

[bash]
#include "stdafx.h"
#include "QuickThread.h"
#include "parallel_task.h"
using namespace qt;

qtControl masterNodeControl;

struct node_t
{
	struct node_t* parent;
	struct node_t* left;
	struct node_t* right;
	qtControl	ctrl;
	node_t()
	{
		left = NULL;
		right = NULL;
		ctrl.Parent = &masterNodeControl;
	}
	~node_t()
	{
		// precaution in case of deletion of node
		// prior to completion of tasks using node
		ctrl.WaitTillDone();
	}
	// node data here
};

node_t*	treeHead = NULL;


void exclusiveTask_1(node_t* node)
{
	// ... task performed with exclusive use of node
	//     (excepting traversal of tree permitted)
}

void exclusiveTask_2(node_t* node)
{
	// ... task performed with exclusive use of node
	//     (excepting traversal of tree permitted)
}

typedef void(*void_fn_node_t)(node_t*);

bool makeDecision(node_t* node)
{
	// ...
	return true;
}

void parseTreeNode(node_t* node, void_fn_node_t fn)
{
	if(!node) return;	// nothing to do
	if(node->left)
	{
		// recursively enqueue ourself using master control node
		// and without exclusive reservation of node
		parallel_task(&masterNodeControl, parseTreeNode, node->left, fn);
	}
	if(node->right)
	{
		// recursively enqueue ourself using master control node
		// and without exclusive reservation of node
		parallel_task(&masterNodeControl, parseTreeNode, node->right, fn);
	}
	// perform non-exclusive node calculations
	if(makeDecision(node))
	{
		// ta-da, our parent node requires exclusive operation
		parallel_task(OnDone$, &node->ctrl, fn, node->parent);
	}
}

int _tmain(int argc, _TCHAR* argv[])
{
	// ... (tree built)
	// at some point
	// Tree travresal performed without node ownership
	// perform exclusiveTask_1 on selected nodes
	parallel_task(&masterNodeControl, parseTreeNode, treeHead, exclusiveTask_1);
	// perform exclusiveTask_2 on selected nodes
	parallel_task(&masterNodeControl, parseTreeNode, treeHead, exclusiveTask_2);
	// now wait for all tasks on all nodes to complete
    masterNodeControl.WaitTillDone();
	return 0;
}

[/bash]

The above code compiles, although it doesn't doanything.

The "tree" is recursively searched for selected nodes using two parallel search paths passing action functors. Each search itself is parallel. In a real example, the last two parallel_tasks would likely pass in the decisiion functor. IOW the selection chritia for exclusiveTask_1 would likely be different from that of exclusiveTask_2

One of the principal strengths of QuickThread is shearelagance in coding.

e4lam · ‎04-06-2010

Jim, I appreciate your time in coming up with the examples. However, I'm not currently considering use of other libraries. Your examples do look like an elegant solution to my problem though via the use of OnDone$.

e4lam · ‎04-06-2010

Andrey,

Thanks for pointing out the new wait_for_all() behaviour. I'm still currently on TBB 2.2 so it looks like I'll have to update.

I'm still not clear on how to handle the case when multiple successor nodes try to evaluate the same predecessor in TBB. By the time the successor nodes try to evaluate their common predecessor, they're both already executing within their own tasks so the successor tasks are both already allocated and executing. I'm not sure how task::allocate_additional_child_of nor task::ref_count can be used to solve this problem. In an ideal world, there would be some way to make a task have multiple parents in TBB. It sounds like this might need to wait for post TBB 3.0?

mwhenson · ‎04-06-2010

task::ref_count can be used if the child creates / spawns the 'parent' (parent in the DAG, not the task's parent). You could completely ignore the task's parenting, i.e. there could be only one root task and every other task is a child of that task. Then the ref count is decremented by the DAG's child's completion.

Mike

e4lam · ‎04-06-2010

Sorry for being dense but I'm not following. When do you propose the ref_count be incremented/set? Recall, that I'm constrained to the blocking style, performing post-order traversals, and graph topology is not known ahead of time. Let's take an example graph where we have the directed edges A->B, A->C, , B->D, C->D. In this graph, we only discover upon evaluating D that it needs to evaluate B and C. And only upon evaluating B and C, do we know that they depend on A. I can certainly ignore task parenting but I can't ignore the fact that task B and C can only evaluate after task A completes.

Upon further thought, I kind of see now where Andrey was going with task::enqueue(). I was trying to find documentation for it until I realized that it's not an official function yet (what Andrey said in the first place :). So in this example, let's suppose that B is faster and spawns A. When C goes to evaluate and notices that A is already spawned (ie. already in the queue), C can create an empty child task, enqueue() it and then wait on the empty child task to finish. This should guarentee that C will only continue after A has been done executing.

e4lam · ‎04-06-2010

I should probably ask this anyhow. What are the current semantics of task::enqueue() in TBB 3.0 stable? From the CHANGES file I see:

[plain]- Class task was extended with enqueue() method, and slightly changed
    semantics of methods spawn() and destroy(). For exact semantics,
    refer to TBB Reference manual.
[/plain]

However, the same pdf reference manual on the documentation page does not mention enqueue().

mwhenson · ‎04-06-2010

Quoting e4lam

Sorry for being dense but I'm not following. When do you propose the ref_count be incremented/set? Recall, that I'm constrained to the blocking style, performing post-order traversals, and graph topology is not known ahead of time. Let's take an example graph where we have the directed edges A->B, A->C, , B->D, C->D. In this graph, we only discover upon evaluating D that it needs to evaluate B and C. And only upon evaluating B and C, do we know that they depend on A. I can certainly ignore task parenting but I can't ignore the fact that task B and C can only evaluate after task A completes.

Sorry it was not clear that the dependencies were not available until evaluation of the node. If it's possible to do a (serial) dependency traversal, followed by an (parallel enabled) evaluation traversal, that may be able to solve your problem. If that is not the case, then my solution does not work; I would also suggest that tbb is not well suited to this problem, at least not yet.

Mike

Alexey-Kukanov · ‎04-06-2010

static void enqueue ( task& );

Effects

The task is scheduled for eventual execution by a worker thread even if no thread ever explicitly waits for the task to complete. If the total number of worker threads is zero, a special additional worker thread is created to execute enqueued tasks. Enqueued tasks are processed in roughly, but not precisely, first-come first-serve order.

CAUTION: Using enqueued tasks for recursive parallelism can cause high memory usage, because the recursion will expand in a breadth-first manner. Use ordinary spawning for recursive parallelism.

CAUTION: Explicitly waiting on an enqueued task should be avoided, because other enqueued tasks from unrelated parts of the program might have to be processed first. The recommended pattern for using an enqueued task is to have it asynchronously signal its completion, for example, by posting a message back to the thread that enqueued it.

e4lam · ‎04-06-2010

How does enqueue() interact with spawned tasks? While I can sort of see the "roughly FIFO" ordering, are there no guarentees that if task A has already been spawned/enqueued by the time task B is enqueued, then task A will execute before task B?

As for the CAUTION regarding waiting on enqueued tasks, are you saying here that other tasks might be executed even after the enqueued task waited upon has already been done executing before returning to the parent?

ARCH_R_Intel · ‎04-06-2010

Unfortunately, blocking style for postorder evaluation of a DAG will deadlock if the TBB task stealing mechanism is used. I first ran into this years ago when prototyping futures for TBB. Theproblem arises from what we call the "trapped task" problem. This happens when a thread running a task blocks waiting for child tasks to complete, and in the meantime steals other work. In the DAG case, this other work can end up waiting on the original task, which leads to deadlock.

To see this, consider a DAG of vertices (i,j) where evaluation of (i,j) requires evaluation of (i,j-1) and (i-1,j). Here is a troublesome sequence:

Thread A starts evaluating (1,1)
Thread A recursively starts evaluation of (0,1)
ThreadB attempts to evaluate (0,2)
ThreadB recursively attempts to evaluate (0,1). But Thread A got to it first, so thread B goes off and steals other work.
Thread B steals the task for evaluating (0,3)
Thread A finishes evaluation of (0,1). The evaluation of (0,2) can complete, but that's the (0,2) from step 3, which is now buried in thread B's stack.
Thread B recursively attempts to evaluate (0,2). It's going to wait forever, since it is also trying to evaluate it at step 3.

With continuation passing, the problem goes away since no task becomes trapped on a stack.

With respect to the "second wait" question, yes, you can add a predecessor task and use the wait_for_all(). I've added an example of this trick. As the name of the file and warnings at the top indicate, it will often deadlock with TBB up to and including 3.0.

ARCH_R_Intel · ‎04-06-2010

When a thread runs out of local work, it attempts to execute queued tasks before it tries to steal spawned tasks. The motivation is that queue tasks are for things where latency, not locality, is more important.

Waiting on an enqueued task will return when the enqueued task finishes. The hazard we worry about is that ina large piece of software, the programmer does not know what else was enqueued in front of their task,and so could get themselves into deadlock. (Of course, as my earlier reply showed, spawned tasks can deadlock too when using blocking style for DAGs.) It seems much safer to use enqueued tasks in a fire and forget way, or have the enqueued task enqueue a reply when it is done, or use continuation passing. TBB 3.0 will have a TBB Design Patterns document explaining some of the use cases for enqueued tasks.

ARCH_R_Intel · ‎04-06-2010

Conversion of my example to continuation-passing style is straightforward using a "pseuo program counter" approach. Attached is the modified example. It does post-order evaluation without deadlocking.

I wrote the program counter logic explicitly to show the details. I've seen people write macros for it in other systems.

e4lam · ‎04-06-2010

Thanks for the enlightening examples, Arch! I guess I will have to try to buckle down and see how to re-engineer things so that the continuation style can be used. I was really hoping to avoid that though. :(

As for doing the "second wait", thanks again for the great example! I was actually thinking of a similar approach while commuting back home from work.Although I can't seem to find it in the documentation, I would assume that whenever a task is done executing, the scheduler automatically decrements the ref count of the executed task's parent? So in your example, it looks like spawning the empty proxy task is just a fancy way for the scheduler to decrement the ref count of the second waiting task. Instead then, could the second task directly add itself into the predecessor task's proxy list? Then when the predecessor task completes, it simply decrements the ref count on all of its proxy tasks, allowing them to be scheduled for execution?

ARCH_R_Intel · ‎04-07-2010

Alas effectively parallelizing codes often requires some re-engineering. I recently say a paper describe itas "an exercise in refactoring".

Something to consider in your re-engineering is whether individual node evaluations will do enough work to amortize scheduler overheads or not. I recommend that the work per task be about 10,000 clock cycles or higher. If not for your case, you'll need to aggregate multiple node evaluations together somehow.I've attached an excerpt of the TBB 3.0 Design Patterns document that describes agglomeration for grids and trees. But for DAGs, I have not seen a common pattern for agglomeration.

With respect toproxy tasks, yes, the schedulerdoes descrement the ref_count of a task when a predecessor finishes. It's somewhat buried in the Reference manual in the section titled "Processing of execute()". [By the way,the TBB 3.0 documentation has switched from saying child/parent to predecessor/successor. Method names retain the historical names for sake of source compatibility.]

The reference counts of waiting tasks could be manipulated directly, using task::decrement_ref_count. However, a task might be waiting on more than one node, so it might have to be in more than one list. So since I needed to keep a list anyway, I introduced ProxyTask objects. In other words each ProxyTask serves two purposes: (1) As a way to decrement the ref_count of a waiting task. (2) As a list element. Because TBB allocates small tasks from thread-local free lists, the ProxyTask objects are a fairly cheap way to implement the list aspect, so it seemed easier to me than trying to use task::decrement_ref_count.

e4lam · ‎04-07-2010

Thanks again for all the pointers, Arch! I'll be taking a look at your document.

In my case, the work done per node can be fairly large. The re-engineering effort will likely result in suboptimal parallelism for me. I have evaluation code that calls functions which in turn may or may not evaluate other nodes. So it's not even possible in many cases to know if/when/where a node will be evaluated from.