using multiple tbb containers in code

Vijay_O_ · ‎01-10-2010

Is it okay for to uses tbb containers to call other tbb containers? For example, I have a parallel_for loop whose bodyclass might later recall a parallel_reduce or another parallel_for loop. I have nested loops and function within a loop that also has loops and I do not want toalter the code right to reduce the looping structure. Will this also balance the load?

RafSchietekat · ‎01-10-2010

I don't see any problem with that approach.

Alexey-Kukanov · ‎01-11-2010

Quoting - Vijay Oza

Is it okay for to uses tbb containers to call other tbb containers? For example, I have a parallel_for loop whose bodyclass might later recall a parallel_reduce or another parallel_for loop. I have nested loops and function within a loop that also has loops and I do not want toalter the code right to reduce the looping structure. Will this also balance the load?

Avoid calling nested parallel loops while holding a lock. There is a known deadlock issue if an inner parallel loop is called while the outer parallel loop iteration holds a lock. If you can not avoid it, use a reentrantlock (e.g. tbb::recursive_mutex) so that it can be acquired by the thread already holding it.

Other than that, the nested scenario should work fine.

Vijay_O_ · ‎01-11-2010

Right now I have no critical sections so locking might not be a problem but will this method cause one task to be push onto the task queue during processing. I have started adding multi-core processing but got a performance hit I think that is mightis because my function is not properly handling the load balancing.

ARCH_R_Intel · ‎01-11-2010

Quoting - Vijay Oza

Right now I have no critical sections so locking might not be a problem but will this method cause one task to be push onto the task queue during processing. I have started adding multi-core processing but got a performance hit I think that is mightis because my function is not properly handling the load balancing.

The TBB scheduler is designed to handle nesting efficiently. If my code takes a serious performance hit after adding TBB, the first thing I do is look at how fast it runs with TBB using a single thread; i.e. declaring something like:

[cpp]int main() {
    tbb::task_scheduler_init init(1);   
    ...
}[/cpp]

If the performance hit is still there even for a single thread, then the problem is likely that the tasks are too small to amortize the scheduler overheads.The TBB 2.2 default of "auto_partitioner" usually takes care of making tasks large enough.

One situation, however, where it will not save you is if the total loop execution is less than about 10,000 cycles. In that case, the loop probably cannot be efficiently parallelized with TBB, and the parallel loop contructs just add overhead. If you have loops that are often short, but occasionally very long, consider using a run-time test to determine whether to use the parallel loop. I won't claim that it is always easy to tell which loops are short or long. Alas it's inherent in parallel execution that you need somehow to know in advance about whether it is profitable on average to do the parallelization.

One occasional problem is that after a loop is converted to TBB, the compiler may fail to optimize it as well as the original loop. In particular, compilers usually optimize non-address-taken scalar variables much better than fields of a structure. So it sometimes pays to load fields into scalars before entering a loop. E.g.:

[cpp]void Foo::operator()( const tbb::blocked_range& r ) {
    // Hoist read of r.end() into local scalar.
    int e = r.end();
    for( int i=r.begin(); i!=e; ++i ) {
        ...
    }
}[/cpp]

Mutexes or replacement of non-atomic operations with atomic operations can be other culprits that slow down single-threaded execution of multi-threaded code.

If the performance of the TBB code on a single thread is okay, but the code slows down for multiple threads (e.g. changing the "1" in the task_scheduler_init), then the problem is trickier to diagnose.Beforeexploringthat, I suggest you try the "tbb::task_scheduler_init init(1);" variation to find out if the problem is sequential efficiency or not.

RafSchietekat · ‎01-11-2010

#2 "If you can not avoid it, use a reentrant lock (e.g. tbb::recursive_mutex) so that it can be acquired by the thread already holding it."
How is that going to work if the inner loop's work is stolen? I didn't consider the circumstance that anyone might want to do that, but if I have to, then I'd say this would in fact be a substantial problem with that approach.

Alexey-Kukanov · ‎01-12-2010

Quoting - Raf Schietekat

#2 "If you can not avoid it, use a reentrant lock (e.g. tbb::recursive_mutex) so that it can be acquired by the thread already holding it."
How is that going to work if the inner loop's work is stolen? I didn't consider the circumstance that anyone might want to do that, but if I have to, then I'd say this would in fact be a substantial problem with that approach.

You are right, using a reentrant lock is more of a mediocre workaround than a good solution, and I should write more about it.
The dealock will be avoided, but the behavior may still be incorrect. The thread will simultaneouslyexecute the critical section for two iterations of the outer loop, with the first one being kind of "paused" in the middle (at the point of call to the inner parallel loop). Indeed executing another iteration by the same thread is not expected and can change the program state for the "paused" one, which can then break when resumed. So one should be careful when trying reentrant lock for deadlock avoidance in the described case, and understand well how the things (don't) work to not fall into anothertrouble.

RafSchietekat · ‎01-12-2010

I think I may have misunderstood what you mean, but probably it would lead too far to go into it further.