We have a custom pipeline whose implementation is based on tbb::flow::graph:
I noticed that tbb::parallel_invoke running inside stage 1 steals and executes a task spawned for an input item of stage 2.
Here's the call stack (I removed all irrelevant entries):
my.dll!stage2::execute() tbb_debug.dll!tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task & parent, tbb::task * child) tbb_debug.dll!tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::wait_for_all(tbb::task & parent, tbb::task * child) my.dll!tbb::task::spawn_and_wait_for_all(tbb::task & child) my.dll!tbb::internal::parallel_invoke_helper::run_and_finish<fn0>(const fn0 & f0) my.dll!tbb::parallel_invoke<fn0,fn1>(const fn0 & f0, const fn1 & f1, tbb::task_group_context & context) my.dll!tbb::parallel_invoke<fn0,fn1>(const fn0 & f0, const fn1 & f1) my.dll!stage1::execute() tbb_debug.dll!tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task & parent, tbb::task * child) tbb_debug.dll!tbb::tbb::internal::arena::process(tbb::internal::generic_scheduler & s) tbb_debug.dll!tbb::internal::tbb::internal::market::process(rml::job & j) tbb_debug.dll!tbb::internal::tbb::internal::rml::private_worker::run() tbb_debug.dll!tbb::internal::tbb::internal::rml::private_worker::thread_routine(void * arg) msvcr120d.dll!_callthreadstartex() msvcr120d.dll!_threadstartex(void * ptd) kernel32.dll!BaseThreadInitThunk() ntdll.dll!RtlUserThreadStart()
Effectively tbb::parallel_invoke in our case executes tasks that do not belong to its 'forest', sibling tasks to its parent.
Is this behavior a feature? And is it possible to overcome it?
Link Copied
Hi Evgeniy,
TBB threads are allowed to execute tasks from outer level when in nested parallelism. Is your question theoretical one or did you face some issues related to this behavior. To overcome it, you can use the task_arena interface for isolation of nested level parallelism. However, task_arena can lead to some inefficiency related to underutilization and feature's overheads.
To expand on (only) the first sentence of what Alex wrote, it is a good thing for throughput performance that threads will favour such tasks for stealing, i.e., it doesn't just happen by accident and is definitely a feature.
Greetings.
Yeah, I guess it's a good approach if all you care about is the throughput.
Though we experience some difficulties with tbb::parallel_invoke's behavior:
Will the following sample code resolve the issue? And will the overhead be significant for functions taking more than 1ms?
template<typename... Functions> inline void ParallelInvoke( Functions&&... functions ) { tbb::task_arena arena; arena.execute([&] () { tbb::parallel_invoke( std::forward<Functions>( functions )... ); } ); }
And won't it clash with the outer tbb::task_arena which task would be calling ParallelInvoke?
Your code is correct. However, the task_arena object creation leads to quite expensive internal activities (especially, on big machines). Theoretically, it should not be noticeable for functions taking more than 1 ms. However, if it is possible, I'd recommend to create one task_arena object and use it for all nested tasks. It is Ok to call the execute method from different threads for the same task_arena object. But all tasks inside a task_arena are shared between all threads in this task_arena.
I am not sure what do you mean with clashing. Additional task arenas recall some TBB workers from already existing arenas. So, task arenas affect each other in some way.
Thanks Alex,
AFAIU the single tbb::task_arena approach won't help much in case of deeper nesting (e.g. if a function supplied to ParallelInvoke in its turn calls another ParallelInvoke), since 2nd level tasks would not be isolated from the 1st level, am I right?
Rephrasing the 'clash' question: do nested tbb::task_arena`s work independently of the outer arena, are there any limitations to the use cases? (as with the nested parallel algorithms)
And do other parallel primitives (e.g. tbb::paralle_for) have the same behavior?
From your description, it appears as if you have a two dimensional tasking problem that you wish to implement using parallel_pipeline:
filter-in filter-1 filter-2 ...filter-n filter-out task-in task-1.0 task-2.0 ... task-n.0 task-out task-1.1 task-2.1 ... task-n.1 ... ... ... ... task-1.I task-2.J ... task-n.K
Where the tasks of each filter differ as well as the number of task required for each filer differ.
A technique you could consider doing is to first consider the tasking requirements as 2D sparse matrix of tasks.
Then determine a maximum number of (or representative number of) tasks that used to be performed in each of your current filters (this could be thread count on the system or some factor of thread count, or some other determination.
Then construct a different parallel_pipeline that is the transposition of the above diagrammed tasks.
Assuming filter 2's J is the maximum number of tasks
filter-in filter-1 filter-2 ...filter-J+1 filter-out task-in task-1.0 task-1.1 ... task-n.J task-out task-2.0 task-2.1 ... task-n.J ... ... ... ... task-n.0 task-n.1 ... task-n.J
Note, the task lists in some of the latest filters may have empty functors at the end of the task lists. In which case they simply return the token.
Jim Dempsey
Hi Evgeniy,
You are right, nested parallel_invoke are also allowed to steal tasks from outer level parallel_invoke. Any blocking algorithm (e.g. parallel_for, parallel_reduce and so on) can steal outer level tasks when used on nested level.
task_arena can not be nested. Only the task_arena::exectute method calls can be nested. The main restriction is that the nested indirect executes are prohibited in the same arena, i.e. a.execute( []{ b.execute( []{ a.execute( []{} ) ) ) - may lead to a deadlock. However, a.execute( []{ b.execute( []{ c.execute( []{} ) ) ) is allowed use case.
Perhaps, you may want to consider using a continuation passing technique instead of nested parallel_invokes, however, it requires some experience and may lead to many code changes. Using this technique together with task_arena can help you achieve the required behavior.
Jim, Alex, thanks for the suggestions!
For more complete information about compiler optimizations, see our Optimization Notice.