Better Support for Tasks with Multiple Successors

ARCH_R_Intel · ‎09-02-2008

This note proposes a small extension to class tbb::task that would improve support for scenarios where a task has multiple successor tasks. The proposal adds two methods to class task:

 //! Atomically increment reference count.
 void task::increment_ref_count();

 //! Atomically decrement reference count and return its new value.
 int task::decrement_ref_count();

Currently, tasks with multiple successors are dealt with in TBB via dummy child tasks. E.g., if task A has two successors B and C that should run after A runs, dummy child tasks b and c are created for B and C respectively . When A finishes, it spawns b and c. When b or cfinishes,their parents runs. The disadvantage of this approach is that it incurs overhead of creating/running/destroying the dummy child tasks, and complicates exception safety when said dummy child tasks must be cleaned up.

The proposed new methods enable a program to manipulate reference counts just as dummy children would, but without the dummy children. Instead of allocating a dummy child,a program can use method task::increment_ref_count. Instead of running a dummy child to completion, a predecessor task could use:

 if( successor->decrement_ref_count()==0 )
 spawn( *successor );

where successor points to the successor task. The means of representing the set of successors is left to the user.

A design alternative is to provide a method that does the decrement and conditional spawning:

void task::decrement_ref_count_and_spawn_if_zero();

An argument for this alternative is that it is the intended idiom, and decrementing a reference count and not doing spawning if zero can lead to problems (e.g. if another thread is using the intended idiom on the same successor!). But in the spirit of "trust the user", the more primitive interfaces seems preferable, as creative users may find valid usages forthe less restrictive interface.

Thanks to Mike Henson for his early feedback on these additions to class task. A complete example that uses the proposed extension follows.

- Arch Robison

// Example of using proposed extensions to tbb::task to support task DAGs more efficiently.
// Date: Sept. 2, 2008
// Author: Arch D. Robison

#include 
#include 
#include "tbb/task.h"
#include "tbb/task_scheduler_init.h"

//! A task with possibly multiple successors.
class MyTask: public tbb::task {
 static const size_t max_pred = 4;
 tbb::task* successor[max_pred];
 int n;
 const char* name;
 tbb::task* execute() {
 printf("%s starts
",name);
 // Do some real work.
 for( volatile int i=0; i<100000000; ++i )
 continue;
&n
bsp; printf("%s finishes
",name);
 // Notify successors that we are done.
 for( int i=0; i if( successor->decrement_ref_count()==0 )
 spawn( *successor );
 return NULL;
 }
 //! Add a predecessor. Max of two predecessors allowed.
 void add_pred( MyTask* t ) {
 if( t ) {
 assert( t->n t->successor[t->n++] = this;
 increment_ref_count();
 }
 }
public:
 MyTask( const char* name_, MyTask* pred0=NULL, MyTask* pred1=NULL ) : name(name_), n(0) {
 set_ref_count(1);
 add_pred(pred0);
 add_pred(pred1);
 }
 //! Called once all predecessors have been constructed and task is safe to run.
 /** In this particular example, we could call ready_to_go at the end of the constructor.
 But in real situations, it's probably best to keep construction and
 "read to go" as separate actions, in case extra setup needs to be done after construction. */
 void ready_to_go() {
 if( decrement_ref_count()==0 )
 spawn(*this);
 }
};

int main() {
 tbb::task_scheduler_init init;

 // Need a dummy for final wait
 tbb::task* dummy = new( MyTask::allocate_root() ) tbb::empty_task;
 dummy->set_ref_count(2);

 // Set up diamond pattern (arcs implicitly downwards in ASCII picture)
 // A
 // / 
 // B C
 //  / 
 // D E
 //  /
 // F
 // |
 // dummy

 MyTask* A = new( tbb::task::allocate_root() ) MyTask("A");
 A->ready_to_go();

 MyTask* B = new( tbb::task::allocate_root() ) MyTask("B",A);
 B->ready_to_go();

 MyTask* C = new( tbb::task::allocate_root() ) MyTask("C",A);
 C->ready_to_go();

 MyTask* D = new( tbb::task::allocate_root() ) MyTask("D",B,C);
 D->ready_to_go();

 MyTask* E = new( tbb::ta
sk::allocate_root() ) MyTask("E",C);
 E->ready_to_go();

 // F must be child of dummy, because we wait on dummy eventually.
 MyTask* F = new( dummy->allocate_child() ) MyTask("F",D,E);
 F->ready_to_go();

 dummy->wait_for_all();
 dummy->destroy(*dummy);
 return 0;
}

Dmitry_Vyukov · ‎09-02-2008

MADadrobiso:
Currently, tasks with multiple successors are dealt with in TBB via dummy child tasks.

Why user currently can't employ proposed technique?
It seems that it's no more than:

class my_task : tbb:task

...

 atomic rc;

 void increment_ref_count() {

 atomic_inc_acquire(rc);

 bool decrement_ref_count() {

 if (1 == load_acquire(rc))

 return true;

 if (0 == atomic_dec_release(rc)) {

 memory_fence_acquire();

 return true;

  return false;

};

Ok, while writing this I have understood why. Because he can't get all memory fences right :)
But user can just use atomic RMW with full fences.

Btw, do increment_ref_count() really have to be atomic? It seems that intended usage is in 'construction' phase, so increment_ref_count() can use plain non-atomic increment.

Dmitry_Vyukov · ‎09-02-2008

MADadrobiso:
This note proposes a small extension to class tbb::task that would improve support for scenarios where a task has multiple successor tasks. The proposal adds two methods to class task:
 //! Atomically increment reference count.
 void task::increment_ref_count();
 //! Atomically decrement reference count and return its new value.
 int task::decrement_ref_count();

Btw, .NET TPL solve the problem by adding Task::ContinueWith() method:
http://blogs.msdn.com/pfxteam/archive/2008/07/23/8768673.aspx
http://blogs.msdn.com/pfxteam/archive/2008/08/05/8835612.aspx

... since you were implementing task_group interface, I think you are already aware of ContinueWith() too.

And they doing exactly the same thing - decrementing CountdownLatch in ContinueWith() callback.

At first glace, ContinueWith() looks like more general interface, and it also non-intrusive wrt predecessor task, i.e. task can be non aware of the fact that it has some 'synchronous continuations'.
But I understand your concern with not handling list of successor tasks. It will require dynamic memory allocation. Right?
It seems that .Net guys don't care about such things :)
I need to think some more on this.

mwhenson · ‎09-02-2008

I really like this. Also, for this problem domain, some of my 'tasks' are extremely quick in terms of cpu cycles ( << 10K) and some are very slow (>>10K). I'm working on abstracting out the task to have the ability to make a 'task' run 'serially' (i.e., no tbb task allocation) or in parallel using the same concept of ref_count. Code for review may be forthcoming if my boss gives the ok.

I actually agree that increment_ref_count is both unnecessary and probably doesn't need to be atomic, at least not in any of my use. That being said, I like it since increment_ref_count since it is more readable than set_ref_count(ref_count()+1).

Mike

ARCH_R_Intel · ‎09-02-2008

Methodincrement_ref_count() needs to be atomic if it is operating concurrently with decrement_ref_count() .E.g., if more predecessors are being generated whileexisting predecessors are running.If predecessors are not running concurrently, it

By the way, I erred in my example. The "x.ready_to_go()" calls occur before all references to x are made, so x may run and disappear while pointers to it are still being used to construct other tasks.

Dmitriy is correct that a user could use their own reference count field to achieve the same effect without myproposed additions. The advantage of the proposed additions is that they use the existing reference count, andthus can be used in combination with "real" child tasks.

Dmitry_Vyukov · ‎09-02-2008

MADadrobiso:
Methodincrement_ref_count() needs to be atomic if it is operating concurrently with decrement_ref_count()

... or with other concurrent increment_ref_count() invocations.

But maybe the solution is just to prohibit concurrent invocations of decrement_ref_count() and increment_ref_count()? (possibly with some asserts in debug build). I was not using DAGs of tasks, so it's just an abstract thought.

MADadrobiso:

Dmitriy is correct that a user could use their own reference count field to achieve the same effect without myproposed additions. The advantage of the proposed additions is that they use the existing reference count, andthus can be used in combination with "real" child tasks.

I get entangled a bit. If task has children than it must be already executed. Are you referring here to continuation of task?

mwhenson · ‎09-03-2008

Even in your example, increment_ref_count is insufficient to add a predecessor to a task concurrently, as we would also need to be able to concurrently add the predecessor to the list; hence I can't think of a use case.

Mike

ARCH_R_Intel · ‎09-03-2008

randomizer:

I get entangled a bit. If task has children than it must be already executed. Are you referring here to continuation of task?

Yes, I should have said continuation.

ARCH_R_Intel · ‎09-03-2008

mwhenson:
Even in your example, increment_ref_count is insufficient to add a predecessor to a task concurrently, as we would also need to be able to concurrently add the predecessor to the list; hence I can't think of a use case.

Attached is variation on my example that exploits the concurrency. It starts predecessor B of D before it finishes linking in predecessor C. My previous example tied construction of tasks and linking in a way the precluded this sort of thing.More generally,I think the interesting case for atomic increment_ref_count is when the static structure of the graph is not fully known ahead of time, and there is a desire for the thread creating the graph to start predecessors running before it has enumerated all predecessors.

Dmitry_Vyukov · ‎09-03-2008

What do you think about following interface?

First, task provides following functions:
 //! Atomically increment reference count.
 void task::increment_ref_count();

 //! Atomically decrement reference count and return its new value.
 int task::decrement_ref_count();

Then, task also provides on_after_execute() callback:

namespace tbb
{
    class task
    {
        //...

        // callback function
        // called after user's execute() has finished
        virtual void on_after_execute() {}

        void internal_execute()
        {
            //...

            // user's execute
            this->execute();

            // calling callback
            on_after_execute();

            //...
        }
    };
}

Then, following helper class is provided by TBB:

namespace tbb
{
    template
    class dag_task : public task
    {
    public:
        void continue_with(task* successor)
        {
            put_to_embed_array_or_allocate_dynamically(successor);
            successor->increment_ref_count();
        }

        void continue_with(task* successor1, task* successor2)
        {
            continue_with(successor1);
            continue_with(successor2);
        }

    private:
        task* embed_successors [count_of_embeded_successors];
        task* dynamically_allocated_successors;

        virtual void on_after_execute()
        {
            foreach successor s
            {
                if (0 == s->dec_ref_count())
                    s->spawn();
            }
        }
    };
}

Then example can be rewritten this way:

class my_dag_task : public tbb::dag_task<>
{
    // no changes required
};

int main()
{
    my_dag_task *A, *B, *C, *D, *E, *F;
    // construct tasks as usual
    A->continue_with(B, C);
    B->continue_with(D);
    C->continue_with(D, E);
    D->continue_with(F);
    E->continue_with(F);
}

So, basically, TBB provides low-level functions for other interesting usages,
and also high-level convenient wrapper for main use case.

mwhenson · ‎09-03-2008

Hmmm, this assumes that C will not finish before E adds it as a predecessor. It also assumes that, in add_pred(), the task is not already spawning its predecessors between adding the predecessor to the list and incrementing the ref count. Which may be fine assumptions in some cases, but this is completely thread-unsafe in general and hence doesn't imply that increment_ref_count needs to be atomic.

For your example to be correct, you would need some guarantee that all preds will be added to to the task before the task completes. This is hard to guarentee in general, hence you would need some locking mechanism to make sure the threads does not spawn it's preds while you are trying to add one; once you're already locking, atomic increment becomes unecessary. The only use case I can think of for an amotic increment is when you can guarentee that the task will not be done by the time all successors are added AND that you need to add successors in parallel. That seems rather unlikely.

It's these issues which is why I build the whole execution tree before starting any execution.

mwhenson · ‎09-03-2008

randomizer:

What do you think about following interface?

That's essentially what I've implemented in my code. The whole tree still needs to be constructed before starting execution to be completely safe. I don't have a strong opinion one way or the other on whether it should be included in the library; I like that it's low level and easy to implement the dag_task as necessary, but others may find it useful.

Mike

ARCH_R_Intel · ‎09-04-2008

I suggest reorganizing the interface to make it more "pay as you go". Adding another virtual function to class task that the scheduler must call after execute() imposes an execution cost on everyone, even those who are not using it. The cost of the extension can be moved to only those using it by making method dag_task::execute() run two virtual methods: user_execute() and on_after_execute(). Users could override user_execute() to do their work.

Dmitriy's earlier remarks about "can't get all the memory fences right" has me leaning back towards providing decrement_ref_count_and_spawn_if_zero(), because it seems to be the common use case andthe fencing could be simplified. The simplification is that theacquire fences can be eliminated for the decrement, because they occur anyway when spawning causes the task goes in and out of the task pool.

Dmitry_Vyukov · ‎09-04-2008

MADadrobiso:
I suggest reorganizing the interface to make it more "pay as you go". Adding another virtual function to class task that the scheduler must call after execute() imposes an execution cost on everyone, even those who are not using it. The cost of the extension can be moved to only those using it by making method dag_task::execute() run two virtual methods: user_execute() and on_after_execute(). Users could override user_execute() to do their work.

I completely understand your concerns. When I was designing similar library I was eliminating even smaller inefficiencies from fast-path. Virtual function is not the best choice here.

As for "dag_task::execute() run two virtual methods: user_execute() and on_after_execute()". If user will implement dag_task, then I think it's Ok. Because he knows what he is doing. If dag_task will be a part of TBB, then I think it's not the best solution. Because user have to not forget to not override "execute" (as usual) instead of "user_execute". MSVC have __sealed keyword, so execute() can be marked as sealed in dag_task, but other compilers don't support sealed virtual functions.

Well, if TBB would using templated, then life will be easier. In my framework user defines task as:

class user_task : public zzz::task {...}

in this situation base class zzz::task can detect whether descendant override some callback function or not. If function is not overriden in descendant class then it's just not called at all. Runtime cost is zero.

MADadrobiso:

Dmitriy's earlier remarks about "can't get all the memory fences right" has me leaning back towards providing decrement_ref_count_and_spawn_if_zero(), because it seems to be the common use case andthe fencing could be simplified. The simplification is that theacquire fences can be eliminated for the decrement, because they occur anyway when spawning causes the task goes in and out of the task pool.

Yeah, the more implemented in library, the more ways for various optimizations.