I am using TBB for a numerical computing application. I have an outer for loop that runs sequentially, and an inner loop that is parallelized using TBB's parallel_for(). This inner loop does most of the work of the application. I recently made a change where a work intensive function gets called one more time within the parallel_for loop. When the application starts, work is properly distributed across all the available cores at close to 100% utilization. After the first iteration, however, the application seems to run serially keeping only one core active at close to 100%. Before my recent code change, I noticed this behavior during test runs sometimes; I never understood exactly why it was happening but I would tweak the grain size or the number of iterations in order to get it to use all cores. Now, my test runs always use all the cores through all iterations, but full executions drop down to one core after the first iteration everytime. I have experimentally tested many grain sizes, not setting grain size, different number of iterations, setting the number of processors as an argument to task_scheduler_init, using auto_paritioner(), using simple_paritioner(), etc, etc. I tested the application w/o the recent code change and full executions complete using all cores close to 100%. There is nothing unusual about this function call (that I know of) - it is called several times within the parallel_for loop; the only change is it is now called one more time which seems to have disrupted my delicate workload distribution balance. I have this application running on Linux and Windows and get the same behavior. I can't seem to find the magic combination to keep all cores active after the first iteration. Any suggestions are greatly appreciated.
It's hard to suggest something specific having only vague idea of your code.
Might it be that the function call you added has significant variability in its run time depending on actual parameters, and you hit an unluckily long call? You could try measuring the function run time with tbb::tick_count, and see how much it varies. Also, if you just replace this suspicious function call with a call that runs for approx. the same time but has no side effects (e.g. spinning for a while changing a local volatile variable), will you see the same effect?
Also I wonder if your outer loop can be parallelized as well; the usual recommendation is to parallelize as outer loop as practical because finer-grained parallelism of an inner loop can cause too much overhead and less possibilities for load balancing.