It's hard to suggest something specific having only vague idea of your code.
Might it be that the function call you added has significant variability in its run time depending on actual parameters, and you hit an unluckily long call? You could try measuring the function run time with tbb::tick_count, and see how much it varies. Also, if you just replace this suspicious function call with a call that runs for approx. the same time but has no side effects (e.g. spinning for a while changing a local volatile variable), will you see the same effect?
Also I wonder if your outer loop can be parallelized as well; the usual recommendation is to parallelize as outer loop as practical because finer-grained parallelism of an inner loop can cause too much overhead and less possibilities for load balancing.