Re: Starting top level tasks

bangerth · ‎01-05-2009

I'm porting the deal.II library (www.dealii.org), a library for finite element computations, to use TBB. Let me say I love TBB and that pipeline is exactly the tool that is needed in many places.

I want to move to more unstructured cases now. Essentially, here is a sketch of what I want to do:

-------------------------------

void worker (int i);

int main () {

int i;

while (std::cin >> i)

create a new task and run worker(i) on it

}

-------------------------------------

So I will end up with a lot of top-level tasks (more than CPU cores availabl) that I would like TBB to schedule as it sees fit. Let's not worry for the moment about the fact that we will have to wait for all tasks to finish before we exit main().

My current implementation tries to do something like this:

---------------------------------------

void worker (int);

struct WorkerTask : tbb::task {

boost::function function;

tbb::task * execute () { function(); return 0; }

};

int main () {

int i;

tbb::task_scheduler_init init;

while (std::cin >> i) {

WorkerTask * t = new (tbb::task::self().allocate_child()) WorkerTask;

t->function = boost::bind (worker, i);

tbb::task::self().spawn (*t);

}

-------------------------------------------------

My ideas for this implementation were:

- Since I have no convenient task around to which the new tasks to attach, I use task::self to spawn the new tasks from.

- I assume that as a consequence, I should also use tbb::task::self.allocate_child() then to allocate the new thread before spawning it, right?

- The alternative would have been to use tbb::allocate_root and tbb::spawn_root_and_wait, but this of course blocks spawning further tasks. What I really miss is something like a non-blocking function tbb::spawn_root().

My main questions are this:

- Is this a useful way to implement things?

- If using tbb::task::self() as the parent for my new tasks, I suppose I have to also do something like tbb::task::self().set_ref_count(). The trouble is that I don't know the number of tasks up front. What *would* work is if there were an atomic function tbb::task::increment_ref_count() that increments the ref count by one and which I could call each time I'm about to spawn a new child of tbb::task::self(). I imagine a function like that could be useful in more contexts than this since set_ref_count() can really only be called once, before any child tasks are created. After that, even if there were a way to figure out how many tasks are currently running/queued, calling set_ref_count() has an obvious race condition.

Any feedback would be much appreciated!

Thanks

Wolfgang

robert-reed · ‎01-05-2009

Quoting - bangerth

I'm porting the deal.II library (www.dealii.org), a library for finite element computations, to use TBB. Let me say I love TBB and that pipeline is exactly the tool that is needed in many places.

I want to move to more unstructured cases now. Essentially, here is a sketch of what I want to do:

-------------------------------

void worker (int i);

int main () {
int i;
while (std::cin >> i)
create a new task and run worker(i) on it
}

-------------------------------------

So I will end up with a lot of top-level tasks (more than CPU cores availabl) that I would like TBB to schedule as it sees fit.

Let me answer a question with a question. Given your appreciation of TBB pipeline, why would you want to "end up with a lot of top-level tasks" which from your example could be a lot more tasks rather than using a pipeline and use the token count to limit the number of tasks to a count manageable by the available HW threads? One of the philosophies embedded in the Cilk/TBB approach is to limit task creation to a rate sufficient to keep all HW threads busy without getting too far ahead, creating queues that grow old in cache and get evicted to memory.

I also assumed that this is just an example and that in your real code you have more work to do than to process a single integer per task.

bangerth · ‎01-05-2009

Quoting - Robert Reed (Intel)

| Let me answer a question with a question. Given your appreciation of TBB pipeline, why would you want to "end up with a lot of

| top-level tasks" which from your example could be a lot more tasks rather than using a pipeline and use the token count to limit the

| number of tasks to a count manageable by the available HW threads?

These are two parts of the program. In finite element programs one splits a body into thousands of cells and has to do something on each of them; I use the pipeline pattern for this: a sequential stage to iterate over all cells, a parallel stage to do something on them, and a sequential stage to put the results together again in the correct order.

But other parts of programs are not of this kind. For example, I may want to read in an input file, construct a mesh, compute boundary conditions, etc, all of which may be independent. For this I wanted to spawn tasks -- currently I use threads for this, but tasks should do better. For this latter job, over the lifetime of the program I will have many more tasks than HW threads, but most of the time after half a dozen jobs there will be a synchronisation point, i.e. I will wait for all spawned threads to finish.

| I also assumed that this is just an example and that in your real code you have more work to do than to process a single integer per task.

Yes, absolutely. The example I gave was more illustrational. What I really want to do is write code like this:

------------------------------------

void foo(int);

void bar(int);

void baz (int i) {

Task t1 = new_task (boost::bind(foo, i));

Task t2 = new_task (boost::bind(bar, i));

do_something_else_in_the_meantime ();

t1.join();

t2.join();

}

---------------------------------------------

You see the obvious analogy to threads here (in fact, we already have this whole setup working with threads for many years). Note that my new_task() function needs to create a task with the given function argument and pass it on to the scheduler without knowledge of how many other tasks exist. baz() also doesn't have obvious knowledge of whether it is run in a task itself or not, and both foo() and bar() may want to spawn additional tasks itself.

Any suggestions as to how to achieve this?

Thanks

Wolfgang

RafSchietekat · ‎01-06-2009

"tbb::task::self().spawn (*t);"Any reason why you don't use spawn_root_and_wait_for_all() (rhetorical question)?

"Any suggestions as to how to achieve this?" Allocate root l_empty_task, allocate and spawn t1 and t2 as its children (with correct ref_count on l_empty_task and optionally using allocate_additional_child_of()), do_something_else_in_the_meantime(), l_empty_task.wait_for_all(), l_empty_task.destroy(l_empty_task) (you could also do spawn_root_and_wait_for_all(l_empty_task) instead of the last two steps).

I hope that helps; please use the reference to look up what I actually meant.

jimdempseyatthecove · ‎01-06-2009

Wolfgang,

Consider the following two sample functionspseudo code:

[cpp]void foo(int);

void bar(int);

void baz (int i)
{
  // Queue-up two tasks
  // to an arbitrary set of threads
  parallel_task(foo, i);
  parallel_task(bar, i);
  do_something_else_in_the_meantime ();
  parallel_task_wait();
}

------ or ---

void baz (int i)
{
  // Queue-up two tasks
  // to a set of cores that share an L2 cache
  // with preference towards set with most
  // available threads (waiting threads)
  parallel_task(NotInCache_L2$, foo, i);
  parallel_task(NotInCache_L2$, bar, i);
  do_something_else_in_the_meantime ();
  parallel_task_wait();
}
[/cpp]

In the first case, baz schedules foo and bar to any two (or one) thread without regard to cache sensitivity. When foo and bar have completely independent data sets this type of scheduling is ok.

Consider though the case when foo and bar share some or a considerable amount of data. On current archetectured systems with four or more cores, at least two threads share an L2 cache and the system typically has number of cores/2 L2 cache systems. When foo and bar share data it might be advantagious to select two cores that share the same L2 cache, and to pick the pair that are (or most likely) available for tunning foo and bar. Further, because you will be running do_something_else_in_the_meantime() you would prefer that the current thread's L2 cache not be used (there is an alternate mask to specify that you require the current thread not be included in the thread set).

Jim Dempsey

bangerth · ‎01-06-2009

Quoting - Raf Schietekat

"tbb::task::self().spawn (*t);" Any reason why you don't use spawn_root_and_wait_for_all() (rhetorical question)?

"Any suggestions as to how to achieve this?" Allocate root l_empty_task, allocate and spawn t1 and t2 as its children (with correct ref_count on l_empty_task and optionally using allocate_additional_child_of()), do_something_else_in_the_meantime(), l_empty_task.wait_for_all(), l_empty_task.destroy(l_empty_task) (you could also do spawn_root_and_wait_for_all(l_empty_task) instead of the last two steps).

I see, or create two empty tasks for each of the two functions I want to call and wait for them in return.

I think I had never understood the use of empty_task because it didn't occur to me that one may want to create a task but not spawn it. That's a clever technique. I now also see that what you describe is probably what is on page 230/231 of the book. Thanks for the pointer.

W.

bangerth · ‎01-06-2009

Quoting - jimdempseyatthecove

[cpp]void foo(int);

void bar(int);

void baz (int i)
{
  // Queue-up two tasks
  // to an arbitrary set of threads
  parallel_task(foo, i);
  parallel_task(bar, i);
  do_something_else_in_the_meantime ();
  parallel_task_wait();
}
[/cpp]

Yes, this is what I'd love to have (ideally with the parallel_task returning a handle and parallel_task_wait taking the handle of the task I want to wait on).

Is your parallel_task() function meant as an example? I can't find it in the tbb distribution (there is a parallel_task class in the tachyon example, but it does something entirely different). Or maybe I'm just looking in the wrong place for actual code that I could use...

Thanks!
W.

jimdempseyatthecove · ‎01-06-2009

The parallel_task and other threadingstatements are in a threading library I produced for my own purposes. It is called QuikThread (I am looking for beta testers to improve the product). The capability also supports a concept analogous the handles, only more powerful/flexible.

[cpp]void baz (int i)
{
  // create a local control structure
  qtControl qtControl;

  // Queue-up two tasks via explicit qtControl  
  // to a set of cores that share an L2 cache
  // with preference towards set with most
  // available threads (waiting threads)
  parallel_task(NotInCache_L2$, qtControl, foo, i);
  parallel_task(NotInCache_L2$, qtControl, bar, i);
  do_something_else_in_the_meantime ();
  parallel_task_wait(qtControl);
  do_more_work();
}

[/cpp]

The qtControl has an implict parallel_task_wait in the dtor. Therfore if do_more_work is not required you can omit the parallel_task_wait.

Another interesting construction:

[cpp]void baz (int i)
{
  // create a static control structure
  static qtControl qtControl;

  // wait for prior run of baz to complete
  parallel_task_wait(qtControl);

  // Queue-up two tasks via explicit qtControl  
  // to a set of cores that share an L2 cache
  // with preference towards set with most
  // available threads (waiting threads)
  parallel_task(NotInCache_L2$, qtControl, foo, i);
  parallel_task(NotInCache_L2$, qtControl, bar, i);
  // when above two tasks complete perform do_more_work();
  parallel_task_on_done(qtControl, do_more_work);
  // while foo and bar pending (and do_more_work pending)
  do_something_else_in_the_meantime ();
}





[/cpp]

jimdempseyatthecove · ‎01-06-2009

In the first sample code (without local qtControl) a thread task level default control is used. Prior to entering baz, the task may have enqueued additional tasks and the parallel_task_wait(), waiting for completion of the default control, will wait for those tasks to complete as well as for foo and bar. This is not necessarily what you might want. Using the local qtControl provides the means to wait on just those two tasks.

The third example (2nd above) uses a static contrtol structure that lives outside the life of baz. On first call the control has no pending tasks and therefor the wait immediately returns. Note now due to the control being static, the dtor is not called and therefor the implicit wait is circumvented. Using this technique, baz return prior to completion, which may be desirable. On entry the second time the parallel_task_wait will wait for the prior task, and completion routing to complete prior to enqueuing the next set of tasks and completion routing. If you were to not insert the parallel_task_wait(qtControl) there would be an error in your programming. When the two prior tasks and completion code have completed the program would run correctly. If the second task of foo were to be enqueued priorto completion of prior bar then the completion of the prior bar would not signal the completion event for the first completion request (as the task queue would no longer empty on completion offirstbar).

If the running of the 2nd foo can preceed the running of the 1st completion code then you would likely not see a problem (other than increased time to complet the first session). If the system were busy though, then without the parallel_task_wait(qtControl) the 2nd foo could potentially be processed prior to the 1st foo.

Proper sequencing of enqueuing of task is a responsibility of the programmer.

What is your system configuration?

Jim Dempsey