I'm working with a large Linux program that uses TBB in a variety of places throughout the code. When I run "perf top", the top-most function it displays is receive_or_steal_task. That suggests to me that my TBB tasks are too small, which is causing the program to spend too much time looking for new tasks to run. Presuming that conclusion is correct, the hard question: How do I determine, out of the variety of TBB calls in my program, which ones are the too-small tasks?
That's a big presumption. How do you know your program just doesn't have enough parallel slack to keep the Intel TBB worker threads busy? An idle worker looks for work to steal.
If I remember correctly, if an idle worker thread cannot find any work to do, it suspends, so that should not be the case.
Regarding the original question: How do you create those tasks? Do you use TBB algorithms (which and how?) or do you create and spawn the tasks yourself?
Robert, your point is well taken. Full core utilization is also a problem I'm pursuing; possibly the only one, eh?
Jiri, the program mostly uses parallel_for to add tasks, with a smattering of parallel_reduce and parallel_invoke. There might be one or two instances of tasks being generated by hand, but I've converted at least one of those to parallel_invoke when encountering it.
If you are using auto_partitioner or affinity_partitioner, then the parallel_for and parallel_reduce probably do not produce "small tasks". They work on some assumptions that may not be true for your code, but I would personally start looking at other parts of the program first.
When receive_or_steal is a top-most hot-spot it actually means imbalance (when there is no local work). I.e. the presumption is rather wrong, the tasks are too coarse.
local_wait_for_all would indicate the scheduler overhead when it consumes too small local tasks (simple_partitioner case).