I see, the outer loop is also parallel... Well, in TBB, parallel_for doesn't mean that the different tasks can progress at the same time, because that would mean required parallelism, and TBB is all about optional parallelism, soparallel_for will start executing one or more chunks in parallel, that may occupy all available worker threads but aren't necessarily a full partition of the complete range, before tackling more chunks as worker threads become available, and so the concept of an overallbarrier doesn't apply like it does with threads. Instead, you should probably distribute the contents of the outer body over successive invocations of the outer parallel_for, each of which happens before the next one. When doing that, consider whether each inner parallel_for really offers an opportunity for additional parallelism or merely increases parallel overhead, in which case it might as well be serial instead.
static affinity_partitioner ap;
for(int k=0; k
(k, size, (size-k)/2), lud_division(), ap); /* for(i=r.begin()+1; i
(k, size, (size-k)/2), lud_elimination(), ap); /* for(i=r.begin()+1; i }