If I've a nested for loop in parallel_for. Will TBB parallelize just outer for loop or both loop (including innrer for loop)
[&](const tbb::blocked_range<size_t>& r)
for(size_t i = std::begin(r); i != std::end(r); ++i)
tbb::blocked_range<size_t> r2(0, list.size();
for(size_t i = std::begin(r2); i != std::end(r2); ++i)
//Do some work
TBB does not generate code the same way OpenMP does. (I've read your other posts and it seems to be where you are coming from)
So, what you write is what is executed. Nothing more.
Here, you call a parallel_for function to distribute some work over multiple cores. Inner loops are inside that work body. Nothing special will be done to it in regards to parallelism. Effectively, only the outer loop (inside parallel_for method) is parallelized. The other two for(...) will be threated as sequential code but will be run in parallel over multiple sub-partitions of the initial blocked_range<>
Hope it makes sense.
We should also point out that Intel Threading Building Blocks also provides blocked_range2d and blocked_range3d that will accommodate double and triple-nested loops and parallelize all loop levels. Those expansions are explicit and require a bit more syntax to accomplish than it would take in, say, in an OpenMP loop construct to just add a collapse(n) directive to expand parallel chunking to an inner loop.
Your nested construct would be effective under the following subset of conditions:
a) your outer loop partitions into less than number of available threads
b) the outer loop is unable to partition the work relatively equal to the available threads
c) the outer loop iteration space is partitioned relatively equally however the computational requirements for each partition differ
Generally speaking if the outer loop has lots to do then avoid use of additional inner parallel constructs.
Robert has a good suggestion if applicable to your situation.
I'm wondering whether it wouldn't always be possible, in the case of embarrassing parallelism (just mapping an operation), to find one dimension that always offered enough parallel slack by itself alone, either statically when not all dimensions are created equally or dynamically when they can be interchanged (which is both where it's technically possible to switch them around and to be less likely to be able to predict the "largest" dimension in the first place). My feeling is that higher-dimensional blocked ranges are to minimise boundaries for "communicating" heat or waves or other disturbances to a neighbouring box, where it's a lot better to do that between cube-like boxes than flat slabs. Using higher-dimensional blocked ranges just for parallel slack seems like a rather exceptional use case.
Most multiply-nested loops I've seen parallelized have been over dimensional indexes where all the computational work is inside the innermost loop, and various clustering and blocking strategies have been applied less to maximize parallel slack as to localize cache reuse--certainly as Raf pointed out blocked_range2d/3d are intended for localizing computation within particular local regions of memory that mimic spatial compactness, and hopefully get a boost through local data reuse. I must also admit that I've tried using blocked_range2d/3d to gain cache reuse performance employing "cache obliviousness" and not had much success. Whereas a careful analysis of data reuse while walking the planes of a 3d volume has resulted in implementations where blocking in the plane for a strip through z has minimized data use through last-level cache to reuse previous planes in things like stencil operations through a volume. Only if progressing through the nested loops exposes at each level a new wealth of computational operations whose localization to a set of thread partitions offers opportunities for data reuse that you might not get with a simpler loop could you justify threading at multiple levels. So far I have not seen a practical example of such code.
About 20 years ago I recall some buzz about genetic programming. Whereby the code evolves by itself to find a maximal solution (time or convergence precision). Once the best solution was found, it got re-used. I think there is too much expectation on compiler optimizations where by you expect the compiler to know how best to do something without being told all the particulars.