I have been experimenting with the "wavefront" parallel design pattern, as presented in section 11.6 of the Tutorial. The task graph that my program builds and evaluates is that of Figure 11. I am trying to emulate a real computation by having each task receive data from its left and upper neighbours, execute a quite small workload (100 reg-to-reg FP operations), and then notify its right and lower neighbours by means of decreasing their ref. counters.
I have implemented the following versions of this pattern, all based on the low-level task api:
a) "baseline": this is exactly the one demonstrated in section 11.6, which builds statically the task graph and then evaluates it dynamically
b) "dynamic": this version dynamically allocates (as roots) and spawns new tasks, each time their refcount becomes zero. I had to use atomic counters instead of tbb::task's native ref_count's, because, apparently, the task objects have not been created by that time.
c) "dynamic+reuse": the same as 'b', and additionally tasks are recycled using recycle_as_continuation()
On a 8-core Nehalem system I get the following times at 8 threads, when running a 10000 x 1000 graph (i.e. total of 10M tasks):
a -> 2.42 sec
b -> 0.89 sec
c -> 0.26 sec
I verified that ~75% of the time of (a) is spent on allocation of tasks, which happens sequentially. This allocation occurs concurrently in cases (b) and (c), and additionally seems to scale pretty well. So I'm guessing that the library uses internally scalable allocators on each of these allocate_* task creation calls. Is this the case?
"I would assume the library uses scalable_allocator, if it can be found." Obviously TBB eats its own cooking, as the saying goes, and a scalable allocator is essential to scalable task-based parallelism, which internally uses and explicitly encourages "recursive parallelism", which means that the very division of work is parallelised recursively.
OK, I've additionally implemented a version based on flow graphs, and another one based on parallel_do with feeder (as in the Design Patterns doc). I have tested 3 different task granularities, i.e. 100, 1K and 10K FP ops. Here are the results (all for a 10000x1000 task graph):