Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

How expensive is tsak creation / scheduling?


I'm starting to parallelize a big application, and I'm new to TBB.

I'm trying to figure out how granular my parallelism should be, which involves weighing the costs of creating/scheduling tasks vs. the benefits of their potentially parallel execution.

Are there any guidelines as to how expensive task creation / scheduling is in TBB?

0 Kudos
3 Replies
As a small example. I am calculating Fibonaccis in a loop:
unsigned long fib ( unsigned long i )
if ( i < 2 ) return i;
return fib ( i - 1 ) + fib ( i - 2 );
for fib(5) you can better loop serial, because task creation is too expensive. for fib(10) you get better results for creating Tasks that get scheduled and each Task calculates fib(10). the longer each task takes, the more benefit you have of course.
That was for task scheduler.
For parallel loops, the documentation says:
"Typically a loop needs to take at least a million clock cycles for parallel_for to
improve its performance. For example, a loop that takes at least 500 microseconds
on a 2 GHz processor might benefit from parallel_for."
But this again depends on the loop body. if you have a large operation in it, parallel_for benefits from lower iterations.
0 Kudos
Honored Contributor III
I do not think there is one consistent right answer to your question. The overhead cost will very depending on

a) which parallel_xxx you use
b) if the task pool (sans your thread) or to the degree of portions thereof is completely idle
c) if the task pool (sans your thread) or to the degree of portions thereof is looking for work
d) if the task pool (sans your thread) or to the degree of portions thereof is working on other tasks
e) if the other threads are in task stealing or directly scheduled tasks
f) which operating system you are running on
g) number of logical processors
h) the functional level within your program where you place your parallization.

Your best bet is to insturment your code and run tests. Working from the outer layers of the application inwards will generally produce better results. At some point you will/may reach saturation and after this point further parallization is futile (until there is a platform change). Also, the code you insert to insturment performance can be reused to add an autotune capability.

Jim Dempsey
0 Kudos
General rule of thumb is to leave the grainsize as a tuneable knob during development and run experiments. The complexity of modern machines makes performance predictions difficult.

That said, a task size of a 1000 clocks or less is impractically small for TBB, and a task size of greater than a 1,000,000 clocks is likely more than big enough. So that leaves only three orders of magnitude to check with experiments, which is not difficult if you use a logarithmic scale (e.g. for a first cut, try successive grain sizes that differ by a factor of 4).
0 Kudos