The recursive task-based parallelism of TBB is usually working very well for me. However, occasionally there is a situation where a simple static split of the work into n_thread equal-sized parts, one per thread, is beneficial
I currently have such a situation. In step 1, a simple range-based parallelism is used to construct thread-local oct-trees from a list of positions. In steps 2&3, these thread-local trees are merged and linked to the final data product (all steps use tbb parallelism). In this process, an considerable speed-up (~factor 2) is obtained (in serial and parallel) if the initial positions are already ordered close to the tree order (in practice I use the order from the previous time step). Steps 2&3 (and perhaps also step 1), should benefit from an even better affinity, such as would emerge if in step 1 each thread a works on a contiguous partition of all positions.
How could/should I implement this in TBB. I was thinking about creating a range which splits off N_initial/n_thread until N<=n_thread, when it reverts to ordinary splitting down-the-middle. Would this give me the correct behaviour? In these initial splits, which side should keep the larger portion: the newly created or the old range?
Alternatively, I could spawn n_thread root tasks? would this be better (in the sense of faster distributing the work evenly amongst the threads as intended)?
Have you considered parallel_reduce()? If there are no strict barriers between the steps, you could do step 1 in the execution and step 2 and 3 in the joins. I would think that TBB's decisions align with what you want to achieve, even without using TLS. I'm not saying it would necessarily be the best solution, but it seems to belong on the shortlist.
sorry, but step 1 has to be finished before step 2 can start and the same for step 2 to step 3. I just implemented a simple static_range, which splits off N_initial/default_num_threads() at the end and it does seem to achieve what I wanted, including a small speed-up.