Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

concurrent execution of parallel_for and task reference counts

anton_yemelyanov
Beginner
308 Views
Hi Everyone,

Decided to post my question here as I have been staring at this for quite a while now.

#1

I have an image processing algorithm that digests an image scanline by scanline. parallel_for is excellent for this as I can run parallel_for with a function that processes each scanline and blocked_range that has 0..height of scanlines.

Works great, and I got over 3:1 speed improvement.

Having done this, I ran into another problem. I now have 5 images and my algorithm demands that I initiate the processing on all 5 images in parallel and then continue performing operations on the main thread.

In pseudo code this would look like:

start parallel_for for 5 images
do "stuff"...
wait_for all parallel_for to complete

I immediately started looking into the task based model of doing this. My first thought was to create a task that will in it's body run parallel_for. However, that would create a blocking task, which is not recommended when using tasks as it effectively blocks the scheduler task execution threads.

My natural drive was to create a thread that will handle this, however since I adopted TBB I have been doing my best to avoid creating my own threads.

I made an experiment that is showing good results. I have created nonblocking_parallel_for, which is exactly the same as parallel_for, except as one of the input paraleters it takes a task and when doing start_for<>() it gives this task to be used as parent in parallel_for tasks.

That works very well, as I can do the following:
1. create tbb::empty_task
2. start multiple nonblocking_parallel_for() for 5 images (giving them this empty_task)
3. do "stuff"...
4. wait for empty_task to complete

This really works well, except that I had to create my own class replicating parallel_for functionality, which I did not like.

Is there a way to run multiple tasks/functions (similar to parallel_invoke) that would be non-blocking as well as not block task threads, and then at later time wait for completion on all these tasks?

#2

On a separate, but related subject, what I found rather annoying is that I had to do "task.set_ref_count(5+1);" on the empty task before using it with these 5 images. Just a quick side question, is it safe to do set_ref_count(get_ref_count()+1) ? I did not fully get into task processing, but got a feeling that ref_count might be decremented concurrently, which would not make this threadsafe.

The problem is that I ultimately don't know how many images there are and I need to pass a parent task to an image for it to create it's own N child tasks depending on the image format etc. I can then wait on all these tasks.

What I ended up doing for the time being is setup an array of empty tasks and passing this array to images. If an image needs a task, then it creates new tbb::empty_task, adds it to the array (std::vector<:TASK>) and creates tasks with this empty_task as a parent.

Once "stuff is done..." on the main thread, it does wait_for_all() sequentially for each task in this std::vector

Is there a better way of doing this? I am basically looking for an efficient way of dynamically adding tasks to other parent tasks.

Appreciate your feedback,

Anton






0 Kudos
4 Replies
Alexey-Kukanov
Employee
308 Views
#2

On a separate, but related subject, what I found rather annoying is that I had to do "task.set_ref_count(5+1);" on the empty task before using it with these 5 images. Just a quick side question, is it safe to do set_ref_count(get_ref_count()+1) ? I did not fully get into task processing, but got a feeling that ref_count might be decremented concurrently, which would not make this threadsafe.

The problem is that I ultimately don't know how many images there are and I need to pass a parent task to an image for it to create it's own N child tasks depending on the image format etc. I can then wait on all these tasks.

Easy things first :)

You are right that the reference counter might be decremented concurrently, and so set_ref_count may only be used before the first child is spawn.

It's understandable that one may not know the amount of children in advance. For such cases, we have allocate_additional_child_of(), which atomically increments the reference counter. So you set_ref_count of the root task to 1, and allocate each child task using the mentioned method. The thing to care is that once you started to wait for children completion (i.e. called wait_for_all on the root task)you can only add more children from inside the execute() method of an existing child, which guarantees that the waiting for children is not yet finished.
0 Kudos
Alexey-Kukanov
Employee
308 Views
I now have 5 images and my algorithm demands that I initiate the processing on all 5 images in parallel and then continue performing operations on the main thread.

In pseudo code this would look like:

start parallel_for for 5 images
do "stuff"...
wait_for all parallel_for to complete

I immediately started looking into the task based model of doing this. My first thought was to create a task that will in it's body run parallel_for. However, that would create a blocking task, which is not recommended when using tasks as it effectively blocks the scheduler task execution threads.

My natural drive was to create a thread that will handle this, however since I adopted TBB I have been doing my best to avoid creating my own threads.

I made an experiment that is showing good results. I have created nonblocking_parallel_for, which is exactly the same as parallel_for, except as one of the input paraleters it takes a task and when doing start_for<>() it gives this task to be used as parent in parallel_for tasks.

That works very well, as I can do the following:
1. create tbb::empty_task
2. start multiple nonblocking_parallel_for() for 5 images (giving them this empty_task)
3. do "stuff"...
4. wait for empty_task to complete

This really works well, except that I had to create my own class replicating parallel_for functionality, which I did not like.

Is there a way to run multiple tasks/functions (similar to parallel_invoke) that would be non-blocking as well as not block task threads, and then at later time wait for completion on all these tasks?

First, it was not at all a bad idea to just start each parallel_for in a separate task. It looks like you maybe misunderstand the purpose of wait_for_all & spawn_root_and_wait calls. These methods do not make a thread sleeping until all child tasks complete, but instead make it executing those tasks. So the call to parallel_for does not block a worker thread; in fact, it will produce and execute the parallel_for tasks.

Though the problem was somewhat imaginary in my opinion, your solution forthe problem is good - you efficiently made worker threads to avoid any wait_for_all calls :)

In the recent versions of TBB there is a new class called task_group that provides more convenient interface to the pattern "allocate an empty task, spawn a few children to it, and wait for completion at later time". Here is an example (it uses C++0x lambda expressions for the tasks; task_group works with function objectsas well):

[cpp]#include "tbb/task_group.h"
using namespace tbb;
int Fib(int n) {
    if( n<2 ) {
        return n;
    } else {
        int x, y;
        task_group g;
        g.run( [&]{x=Fib(n-1);}); // spawn a task
        g.run( [&]{y=Fib(n-2);}); // spawn another task
        g.wait(); // wait for both tasks to complete
        return x+y;
    }
}[/cpp]
So I would try utilizing task_group as the outer level parallel construct, and use unmodified parallel_for for the inner-level parallelism of image processing.
0 Kudos
anton_yemelyanov
Beginner
308 Views
Alexey,

Thank you very much for your reply. All of the above has been addressed once you've made me realize that waiting on a task does not simply block the scheduler thread but uses that thread as a resource for processing the task it is waiting on.

I now create an empty_task as a root, to which different functions add tasks using allocate_as_child_of() and these tasks themselves contain parallel_for() loops. I then do root->spawn() and at a later time wait on my root task.

I am wondering about properly deallocating tasks. When do tasks have to be explicitly destroyed?

Currently, when I allocate my root task, once done I do root->destroy(*root);, however for children I do allocate_as_child_of(), then root->spawn() and then I just forget about it. Is this correct?

Anton
0 Kudos
Alexey-Kukanov
Employee
308 Views
I am wondering about properly deallocating tasks. When do tasks have to be explicitly destroyed?

Currently, when I allocate my root task, once done I do root->destroy(*root);, however for children I do allocate_as_child_of(), then root->spawn() and then I just forget about it. Is this correct?
Yes this is correct. You only need to care about destruction of tasks that are not executed. The tasks that are executed (i.e. children in your case) are automatically destroyed after execution, unless they were explicitly recycled.

Using task_group would allow you to not bother about task destruction at all; unlike tasks, task_group objects can be automatic (i.e. stack allocated).
0 Kudos
Reply