Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Why no task priorities?

christian_convey
Beginner
753 Views

Does anyone know why TBB doesn't offer task prioritization? Has it just not been gotten around to yet?

I have an application where that would be the most convenient mechanism for achieving the schedule I want.

0 Kudos
14 Replies
ARCH_R_Intel
Employee
753 Views

We are experimenting with adding priorities.

The difficulty has been defining what priorities mean. For example, if all threads are busy processing low-priority tasks, and a high priority task is injected into the system, what should happen? If priorities are preemptive, how is priority inversion avoided? Can you explain your use case in more detail so we can figure out whether our experimental scheme might help?

A way to implement non-preemptive priorities is described in Chapter 9 of the TBB Design Patterns manual.

0 Kudos
christian_convey
Beginner
753 Views
I'll see if I can do this with ASCII art well enough.

Here's my computation dependency graph:

A1 --> A2 --> A3 --> ... An

A1 --> B1 --> C1 --> output1
A2 --> B2 --> C2 --> output2
...
An --> Bn --> Cn --> outputn

Eachoperation above {A, B, C}{1...n} has useful internal parallelism.

For our application, we need to produce output1 as early as possible, followed by output2 as soon after as possible, etc. The outputs feed a realtime simulation, so we want to minimize the occurrences of output_i being delivered after the simulator is ostensibly at time i.

So one schedule that would not be acceptable is to first compute all of the A* tasks, followed by all of the B* tasks, and so forth.

The graph above is finite, and knowable before anyof thoseoperations begins execution. So I was thinking I could assign priorities in this order (listing highest priority first): Output1, C1, B1, A1, Output2, C2, B2, A2, ..., Outputn, Cn, Bn, An.

This way we'd always prioritize the delivery of soon-needed data over further-out-in-simulation-time data.

Since TBB doesn't offer priorities at the moment, I'm kicking around two alternative ideas for lack of any better ones:

1) Create four separate OS threads: One for each of A, B, C, and Output tasks. Give those threads scheduling priorities such that Output is highest priority and A is lowest. Each of those threads has a separate TBB scheduler, and will use parallel_for when executing piece of work (i.e., A1 or B5). The OS thread scheduler would quiesce any of these four threads when it was waiting on an empty concurrent_bounded_queue, which is how lower-priority operations would get a chance to occur. I don't really know if this would work.

2) Give up on some potential pipelined parallelism by doing something liks this:

A_results calculate_one_simulated_second(A_results prev_A_results) {
A_results new_A_results;
B_results new_B_results;
C_results new_C_results;

parallel_for( A_functor( prev_A_results, & new_A_results));
parallel_for( B_functor( new_A_results, & new_B_results));
parallel_for( C_functor( new_B_results, & new_C_results));
Output( new_C_results );
return new_A_results;
}

void do_whole_run(int max_seconds) {
A_resultsr;
init_A_results(& r );

for (int i = 0; i < max_seconds; ++i) {
r = calculate_one_simulated_second( r );
}
}


So this actually might work for us, specifically because each of those parallel_for calls is likely to use up all of our system's cores anyway. So I think this will have pretty decent CPU utilization.

What I don't like though is that this approach breaks down on systems where we actually do have more processor cores than we have useful parallelism in those parallel_for loops. It seems to me that if I could set up tasks with priorities, as I originally envisioned, my program structure would be more resiliant to increasing numbers of cores on CPUs.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
753 Views

It seems that you can exploit LIFO scheduling. What you need is:

task Ai
{

spawn Ai+1

do synchronously Bi, Ci, outputi

}

This will provide required ordering. However, parallelism is not ideal. Whether it's a problem or not depends on task granularity. If it's a problem, then each n-th Ai task may spawn tasks Ai+1...Ai+n.

Btw, what for does Ai fork Bi, Bi fork Ci, and then Ci fork outputi? There is no parallelism, the only thing you achieve this way is increased overheads.

0 Kudos
ARCH_R_Intel
Employee
753 Views
We've had other requests for supporting these kinds of priorities. The trouble is that we do not know how to support it efficiently in a scalable "pay as you go" way (i.e. not tax users who do not need priorities). In particular, there is a fundamental conflict between minimizing communication across the system and strictly obeying priorities. E.g., it's a global problem to find the nexttask with highest priority.

The design pattern for non-preemptive priorities that I pointed to might be a good compromise. It basically lets you write your own custom scheduler to choose what to run next, so you can choose between minimizing communication (e.g. implement a data structure that implements some sort of work stealing scheme) versus strictly following priorities (e.g. implement a global priority queue).

0 Kudos
christian_convey
Beginner
753 Views

What about your approach guarantees the required ordering?

Will the spawned "Ai+1" task somehow be inelligible for activation until after Ai completes?

0 Kudos
christian_convey
Beginner
753 Views
Thanks, itsounds like the Non-Preemptive Priorities pattern is my best bet.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
753 Views

What about your approach guarantees the required ordering?

Will the spawned "Ai+1" task somehow be inelligible for activation until after Ai completes?

Of course. It's already spawned (as a first step of Ai), so it's eligible to stealing and execution.

0 Kudos
christian_convey
Beginner
753 Views
I haven't come across any docmentation that explains these rules to which you're appealing. Would you mind pointing me at the right docs?

I'm looking at the TBB Tutorial, section 11.4, "HOw Task Scheduling Works". I can see how spawning a child task puts it onto the parent's dequeue.

But what I'm not clear about is why some other thread couldn't steal that child task before its parent finished its own body.

And if such steeling could occur, and if the TBB scheduler isn't fair, then the parent task could end up starving for CPU time for an arbitrarily long time.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
753 Views
I haven't come across any docmentation that explains these rules to which you're appealing. Would you mind pointing me at the right docs?

I'm looking at the TBB Tutorial, section 11.4, "HOw Task Scheduling Works". I can see how spawning a child task puts it onto the parent's dequeue.

But what I'm not clear about is why some other thread couldn't steal that child task before its parent finished its own body.

And if such steeling could occur, and if the TBB scheduler isn't fair, then the parent task could end up starving for CPU time for an arbitrarily long time.

I don't understand you.

Spawned task will indeed be picked by another thread, and then executed. This is exactly what you want - you want them to execute in parallel. This will not cause any starvation. Current thread finishes Ai, while another thread starts processing of Ai+1, isn't it what you want?

0 Kudos
christian_convey
Beginner
753 Views
I think you partially understant what I want.

Your code will generate the correct dependency graph, but it could result in an evaluation schedule that's not what I want.

You proposed something like this:

do_A(int i) {
spawn( do_A(i+1) );
do_B(i);
do_C(i);
do_output(i);
}


Now, suppose TBB has four threads doing work. You could get a schedule like this (I think):

Thread 1: begin running do_A(1), call spawn ( do_A(2) )
Thread 2: begin running do_A(2), call spawn ( do_A(3) )
Thread 3: begin running do_A(3), call spawn ( do_A(4) )
Thread 4: begin running do_A(4), call spawn ( do_A(5) )
Thread 1: begin running do_B(1) (as part of do_A(1))

Now at this point, all four threads are doing productive work. But there's a problem: Now Thread 1, running do_A(1), is ready to perform its do_B(1) operation.

The problem is, I have no reason to believe (based on my limited understanding of TBB's scheduler) that do_B(1) would be executed before do_A(2)...do_A(4) complete their executions. And since do_A(4) can recursively generate additional tasks: do_A(5), do_A(6), ..., do_A(n), it might be a very long time until do_B(1) finally gets CPU time again.

One thing I should mention: I'm assuming that do_B(1) makes a call to parallel_for(...), as a way of captalizing on its internal parallelism. And that the additional tasks generated by that parallel_for call are what can remain unscheduled for a very long time. The net result would be that do_A(1) might not execute until after all of the other do_A(...) tasks have completed their execution.

In my application, I want do_A(1) to complete its execution as early as possible, followed by the completion of do_A(2) as soon as possible, etc.

0 Kudos
jimdempseyatthecove
Honored Contributor III
753 Views
What you are describing is a parallel pipeline. I suggest you look at the tbb:parallel_pipeline example(s) to see if this is what you want. The do_A(n) will run in parallel but appear to complete serially. When you "run" the pipeline you can specify the number of concurrent do_A's (number of tokens / buffers) leaving threads for your parallel_for's. This is a tuning parameter you will have to experiment with. parallel_pipeline is a powerful feature, but it may require you to rethink your programming. Once you master the parallel_pipeline you will find a whole new set of opportunities for optimization.

Warning - plug comming. The QuickThread parallel_pipeline can be constructed with I/O pipes as well as compute pipes. Essentialy you place your do_input() and do_output() on the two ends of the pipeline with your do_work() in the middle. As with TBB, the output is collated (default), and you can limit the degree of parallelization by controlling the number of tokens/buffers for use by the pipeline. Excess threads are thenavailable for parallel_... calls used by threads consuming/modifying/producing pipeline tokens/buffers.

Check out the TBB parallel pipeline first since it appears you have already jumped into using TBB. If you get stuck or have I/O performance issues then consider exploring the alternatives.

Jim Dempsey

0 Kudos
Andrey_Marochko
New Contributor III
753 Views
The problem is, I have no reason to believe (based on my limited understanding of TBB's scheduler) that do_B(1) would be executed before do_A(2)...do_A(4) complete their executions. And since do_A(4) can recursively generate additional tasks: do_A(5), do_A(6), ..., do_A(n), it might be a very long time until do_B(1) finally gets CPU time again.

One thing I should mention: I'm assuming that do_B(1) makes a call to parallel_for(...), as a way of captalizing on its internal parallelism. And that the additional tasks generated by that parallel_for call are what can remain unscheduled for a very long time. The net result would be that do_A(1) might not execute until after all of the other do_A(...) tasks have completed their execution.


Don't worry here. Even if do_A(2)..do_A(4) generate a lot more tasks, those tasks will go into local task pools of threads 2..4. At the same time T1 will start executing tbb::parallel_for() inside do_B(1), the tasks of wich will go into local task pool of T1, and since other therads are busy (and thus do not steal) will be executed by the current thread (T1).

LIFO order of local execution will guarantee that tasks created by tbb::parallel_for() inside do_B(1) will be executed before T1 attempts to execute do_A(2), leaving do_A(2) as the first target to be stolen, and thus ensuring progress of do_A(1) body.

The bottom line is that as long as there is sufficient parallel slack (many tasks in each thread's task pool), each task is executed most efficiently (without its children being stolen with all associated overhead).

0 Kudos
Dmitry_Vyukov
Valued Contributor I
753 Views
One thing I should mention: I'm assuming that do_B(1) makes a call to parallel_for(...), as a way of captalizing on its internal parallelism. And that the additional tasks generated by that parallel_for call are what can remain unscheduled for a very long time. The net result would be that do_A(1) might not execute until after all of the other do_A(...) tasks have completed their execution.

In my application, I want do_A(1) to complete its execution as early as possible, followed by the completion of do_A(2) as soon as possible, etc.

I see. If not that nested parallel_for() than what I described would do.

However, with parallel_for() only 1 thread will work on B1, and other threads will work on A2, A3, A4.

What you need is not priorities. What you need is to not spawn A2 until A1/B1/C1 completed. You do not even need tasks for that, you can just do along the lines of:

for (i = 0; i != N; i += 1)
{

...

parallel_for(...);

...
}

Then all threads will participate in A1/B1/C1, then when it is finished all threads switch to A2/B2/C2.

If for each i parallel_for() is executed over the same range, task parallelism is not optimal, because you need to distribute work across threads on each iteration. Phased approach would be more efficient here, because you can distribute iteration across thread once:

#pragma omp parallel
{

calculate my subrange

for (i = 0; i != N; i += 1)

{

for over my subrange

#pragma omp barrier

}

}

0 Kudos
jimdempseyatthecove
Honored Contributor III
753 Views

>>
#pragma omp parallel
{

calculate my subrange

for (i = 0; i != N; i += 1)

{

for over my subrange

#pragma omp barrier

}

}
<<

The above assumes i=0:N expands to interleaving of your total range
#pragma omp parallel
{
me = omp_get_thread_num();
stride = omp_get_num_threads();
for(i=me; i < N; i += stride)
{
do A
do B
do C
(optional barrier)
}
}
===============
Alternate method in OpenMP pseudo code

volatile char Adone; // somewhere next to A[] or alloca'd
...
memset(Adone, 0, N);
#pragma omp parallel
{
me = omp_get_thread_num();
stride = omp_get_num_threads();
for(i=me; i < N; i += stride)
{
if(i > 0)
{
while(Adone[i-1] == 0)
_mm_pause();
}
do A
Adone =1;
do B
do C
(no barrier here)
}
}

The above is when A(n) must complete prior to starting A(n+1)
When A(n), A(n+1),...A(n+stride-1) can run concurrently then take out the Adone and _mm_pause() loop.
Or consider if an Abegun arrangement might work.
In TBB you can construct similar thread teams with artificial thread number and number of threads for team.

Jim Dempsey

0 Kudos
Reply