I'm starting to parallelize a big application, and I'm new to TBB.
I'm trying to figure out how granular my parallelism should be, which involves weighing the costs of creating/scheduling tasks vs. the benefits of their potentially parallel execution.
Are there any guidelines as to how expensive task creation / scheduling is in TBB?
a) which parallel_xxx you use
b) if the task pool (sans your thread) or to the degree of portions thereof is completely idle
c) if the task pool (sans your thread) or to the degree of portions thereof is looking for work
d) if the task pool (sans your thread) or to the degree of portions thereof is working on other tasks
e) if the other threads are in task stealing or directly scheduled tasks
f) which operating system you are running on
g) number of logical processors
h) the functional level within your program where you place your parallization.
Your best bet is to insturment your code and run tests. Working from the outer layers of the application inwards will generally produce better results. At some point you will/may reach saturation and after this point further parallization is futile (until there is a platform change). Also, the code you insert to insturment performance can be reused to add an autotune capability.
That said, a task size of a 1000 clocks or less is impractically small for TBB, and a task size of greater than a 1,000,000 clocks is likely more than big enough. So that leaves only three orders of magnitude to check with experiments, which is not difficult if you use a logarithmic scale (e.g. for a first cut, try successive grain sizes that differ by a factor of 4).