How to prevent tbb::parallel_invoke from stealing its parent's sibling tasks?

Evgeniy_B_ · ‎03-03-2016

We have a custom pipeline whose implementation is based on tbb::flow::graph:

All pipeline tasks run in a tbb::task_arena with 1 dedicated master thread and automatic parallelism.
Stage tasks are created from an add-ref`ed tbb::empty_task (which serves as the root task) using tbb::task::allocate_additional_child_of.
All stages are sequential, have an input queue, and can produce [0, N] output items for a single input item.
Some stages use tbb::parallel_for and tbb::parallel_invoke internally.

I noticed that tbb::parallel_invoke running inside stage 1 steals and executes a task spawned for an input item of stage 2.

Here's the call stack (I removed all irrelevant entries):

 	my.dll!stage2::execute()
 	tbb_debug.dll!tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task & parent, tbb::task * child)
 	tbb_debug.dll!tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::wait_for_all(tbb::task & parent, tbb::task * child)
 	my.dll!tbb::task::spawn_and_wait_for_all(tbb::task & child)
 	my.dll!tbb::internal::parallel_invoke_helper::run_and_finish<fn0>(const fn0 & f0)
 	my.dll!tbb::parallel_invoke<fn0,fn1>(const fn0 & f0, const fn1 & f1, tbb::task_group_context & context)
	my.dll!tbb::parallel_invoke<fn0,fn1>(const fn0 & f0, const fn1 & f1)
 	my.dll!stage1::execute()
 	tbb_debug.dll!tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task & parent, tbb::task * child)
 	tbb_debug.dll!tbb::tbb::internal::arena::process(tbb::internal::generic_scheduler & s)
 	tbb_debug.dll!tbb::internal::tbb::internal::market::process(rml::job & j)
 	tbb_debug.dll!tbb::internal::tbb::internal::rml::private_worker::run()
 	tbb_debug.dll!tbb::internal::tbb::internal::rml::private_worker::thread_routine(void * arg)
 	msvcr120d.dll!_callthreadstartex()
 	msvcr120d.dll!_threadstartex(void * ptd)
 	kernel32.dll!BaseThreadInitThunk()
 	ntdll.dll!RtlUserThreadStart()

Effectively tbb::parallel_invoke in our case executes tasks that do not belong to its 'forest', sibling tasks to its parent.

Is this behavior a feature? And is it possible to overcome it?

Alexei_K_Intel · ‎03-05-2016

Hi Evgeniy,

TBB threads are allowed to execute tasks from outer level when in nested parallelism. Is your question theoretical one or did you face some issues related to this behavior. To overcome it, you can use the task_arena interface for isolation of nested level parallelism. However, task_arena can lead to some inefficiency related to underutilization and feature's overheads.

RafSchietekat · ‎03-06-2016

To expand on (only) the first sentence of what Alex wrote, it is a good thing for throughput performance that threads will favour such tasks for stealing, i.e., it doesn't just happen by accident and is definitely a feature.

Evgeniy_B_ · ‎03-08-2016

Greetings.

Yeah, I guess it's a good approach if all you care about is the throughput.

Though we experience some difficulties with tbb::parallel_invoke's behavior:

It's impossible to implement reliable quality control, which would work on top of the stage components (to monitor input queue length, processing performance and delay, to command stage to skip some processing substages). Since you have to implement some kind of context switch in every tbb::parallel_invoke aware region, which would in its turn most likely ruin Object Oriented Approach.
Sometimes tbb::parallel_invoke hangs in tbb::task::spawn_and_wait_for_all until another worker thread completes its task, with our lengthy tasks this leads to starvation. Moreover it leads to a deadlock if we try to lockout stages to do a quick configuration change.

Will the following sample code resolve the issue? And will the overhead be significant for functions taking more than 1ms?

template<typename... Functions>
inline void ParallelInvoke( Functions&&... functions )
{
    tbb::task_arena arena;
    arena.execute([&] ()
    {
        tbb::parallel_invoke( std::forward<Functions>( functions )... );
    } );
}

And won't it clash with the outer tbb::task_arena which task would be calling ParallelInvoke?

Alexei_K_Intel · ‎03-08-2016

Your code is correct. However, the task_arena object creation leads to quite expensive internal activities (especially, on big machines). Theoretically, it should not be noticeable for functions taking more than 1 ms. However, if it is possible, I'd recommend to create one task_arena object and use it for all nested tasks. It is Ok to call the execute method from different threads for the same task_arena object. But all tasks inside a task_arena are shared between all threads in this task_arena.

I am not sure what do you mean with clashing. Additional task arenas recall some TBB workers from already existing arenas. So, task arenas affect each other in some way.

Evgeniy_B_ · ‎03-09-2016

Thanks Alex,

AFAIU the single tbb::task_arena approach won't help much in case of deeper nesting (e.g. if a function supplied to ParallelInvoke in its turn calls another ParallelInvoke), since 2nd level tasks would not be isolated from the 1st level, am I right?

Rephrasing the 'clash' question: do nested tbb::task_arena`s work independently of the outer arena, are there any limitations to the use cases? (as with the nested parallel algorithms)

And do other parallel primitives (e.g. tbb::paralle_for) have the same behavior?

jimdempseyatthecove · ‎03-12-2016

From your description, it appears as if you have a two dimensional tasking problem that you wish to implement using parallel_pipeline:

filter-in  filter-1  filter-2 ...filter-n filter-out
task-in    task-1.0  task-2.0 ... task-n.0  task-out
           task-1.1  task-2.1 ... task-n.1
           ...       ...      ... ...
           task-1.I  task-2.J ... task-n.K

Where the tasks of each filter differ as well as the number of task required for each filer differ.

A technique you could consider doing is to first consider the tasking requirements as 2D sparse matrix of tasks.
Then determine a maximum number of (or representative number of) tasks that used to be performed in each of your current filters (this could be thread count on the system or some factor of thread count, or some other determination.
Then construct a different parallel_pipeline that is the transposition of the above diagrammed tasks.
Assuming filter 2's J is the maximum number of tasks

filter-in  filter-1  filter-2 ...filter-J+1 filter-out
task-in    task-1.0  task-1.1 ... task-n.J  task-out
           task-2.0  task-2.1 ... task-n.J
           ...       ...      ... ...
           task-n.0  task-n.1 ... task-n.J

Note, the task lists in some of the latest filters may have empty functors at the end of the task lists. In which case they simply return the token.

Jim Dempsey

Alexei_K_Intel · ‎03-14-2016

Hi Evgeniy,

You are right, nested parallel_invoke are also allowed to steal tasks from outer level parallel_invoke. Any blocking algorithm (e.g. parallel_for, parallel_reduce and so on) can steal outer level tasks when used on nested level.

task_arena can not be nested. Only the task_arena::exectute method calls can be nested. The main restriction is that the nested indirect executes are prohibited in the same arena, i.e. a.execute( []{ b.execute( []{ a.execute( []{} ) ) ) - may lead to a deadlock. However, a.execute( []{ b.execute( []{ c.execute( []{} ) ) ) is allowed use case.

Perhaps, you may want to consider using a continuation passing technique instead of nested parallel_invokes, however, it requires some experience and may lead to many code changes. Using this technique together with task_arena can help you achieve the required behavior.

Evgeniy_B_ · ‎03-14-2016

Jim, Alex, thanks for the suggestions!