Question about body object creation in recursive patterns
According to the reference manual, algorithms such as parallel_for that act recursively on an iteration space, create new body objects for every distinct subrange.Since all these bodies share the same code (i.e., operator() function) and only differ on the input parameters, I was wondering if TBB optimizes their creation in a way that takes into account code reuse and minimizes redundant copy operations.I guess that when tasks are small the overhead for creating new objects might also be small. However, what happens when the parallel loop body is pretty large so that repeatedly copying it endangers introducing notable overhead?
The body objectof parallel_for was intended to be a closure that captures the context in which a parallel loop should execute, to use itinside the loop. I.e. think of the body as of an instance of a lambda function in C++0x. Or, as of a function that contains the whole scope of a single loop iteration, and obtains all necessary data via parameters.
If body objects are big to worry about copying, maybe it's time to rethink the design. Would you pass that many data/context as parameters into a function? If not, don't do it for parallel_for body as well.
You can also use parallel_reduce which performs lazy copying of bodies. For parallel_reduce, this is important because its body also serves as an accumulator of partial "sums". But if necessary it can be used without doing any reduction: implement method join() so that it does nothing.
Argument passing by value or by reference also applies to a Body, except that you can think of it as many function calls when making the trade-off: just pass a pointer to that big lump of data, assuming it is thread-safe of course.
Many thanks to both of you for replying. I think I have not stressed enough that the overhead I am talking about would come from copying code and not data. Parameter passing is not an issue for me, since I assume that most data processed by the loop can be heap-allocated global objects.
A typical scenario would be e.g. applications where code comprises of large for-loops parallelized at the outermost level, while data may be totally decoupled from code and defined elsewhere. This is a common case in large-scale, array-based scientific codes. In that case, parallel_for would only pass the subrange bounds in each operator() invocation, but would need of course to copy the operator() code every time. [Apart from copying the same code again and again, this could have other side effects such as thrashing instruction cache]. Of course, an effective alternative would be to enclose the whole loop body in a separate global function and having it called from inside operator(), but I wonder whether it would be a better approach for TBB to provide some way to decouple code from task state.