parallel_reduce getting stuck

vanswaaij · ‎05-16-2008

Hi,

Is it allowed to call parallel_reduce in a method that is itself already called from parallel_reduce?

A test code, given below, often runs taking a few seconds, but sometimes gets stuck in a busy wait, see stack trace futher down for an example.

Is there a bug in this test code, is it misuse of parallel_reduce or is it some other problem?

If it is misuse, what would be the right way to count the leaf nodes in an n-ary tree?

Thanks

=============================================================

#include "tbb/blocked_range.h"
#include "tbb/parallel_reduce.h"
#include "tbb/task_scheduler_init.h"

#define LOOP 16
#define SIZE 1024 
#define NEST 3
#define GRAIN 256

class thing {
public:
        void doit (int & nb, int nest);
};

class reduce_test {
private:
        
        thing * global_obj_;
        int nest_;
        int nb_;

public:

        reduce_test (reduce_test & x, tbb::split) :
                global_obj_ (x.global_obj_), nest_ (x.nest_), nb_ (0)  {}

        reduce_test (thing * global_obj, int nest) :
                 global_obj_ (global_obj), nest_ (nest), nb_ (0) {}

        void operator () (const tbb::blocked_range & r) 
        {
                for (size_t i = r.begin (); i != r.end (); ++i) 
                {
                        int nb = 0;
                        global_obj_->doit (nb, nest_);
                        nb_ += nb;
                }
        }

        void join (const reduce_test & x)
        {
                nb_ += x.nb_;
        }

        int nb () {return nb_;}

};


void thing::doit (int & nb, int nest) 
{
        ++nest;

        if (nest < NEST) 
        {
                reduce_test rt (this, nest);
                tbb::parallel_reduce (tbb::blocked_range(0, SIZE, GRAIN), rt);
                nb = rt.nb ();

        }
        else 
        {
                for (int i = 0; i < LOOP; ++i) 
                {
                        for (int j = 0; j < LOOP; ++j) 
                        {
                                if (i == j) ++nb;
                        }
                }
        }
}


int main ()
{
        tbb::task_scheduler_init init;

        thing * global_obj = new (thing);

        int nb = 0;

        global_obj->doit (nb, 0);

        return 0;
}

=================================================================

TOS for both threads:

(gdb) thread 1

[Switching to thread 1 (Thread 46912498605664 (LWP 10654))]#0 0x0000003b7e9ae159 in sched_yield ()

 from /lib64/tls/libc.so.6

(gdb) thread 2

[Switching to thread 2 (Thread 1077938528 (LWP 10655))]#0 0x0000003b7e9ae159 in sched_yield () from /lib64/tls/libc.so.6

=====================================================================

Full stack trace

[Switching to thread 1 (Thread 46912498605664 (LWP 10654))]#0  0x0000003b7e9ae159 in sched_yield ()
   from /lib64/tls/libc.so.6
(gdb) where
#0  0x0000003b7e9ae159 in sched_yield () from /lib64/tls/libc.so.6
#1  0x00002aaaaaac1b15 in tbb::internal::AtomicBackoff::pause (this=0x7fffffe50920) at tbb_mac
hine.h:149
#2  0x00002aaaaaac8933 in __TBB_LockByte (flag=@0x7fffffe50d10) at tbb_machine.h:563
#3  0x00002aaaaaacbd4b in tbb::task_group_context::unbind (this=0x7fffffe509a0) at ../../src/tbb/task.cpp:2794
#4  0x00002aaaaaacbaa1 in ~task_group_context (this=0x7fffffe509a0) at ../../src/tbb/task.cpp:2744
#5  0x000000000040168f in tbb::internal::start_reduce<:BLOCKED_RANGE>, reduce_test, tbb::simple_partitioner>::run (range=@0x7fffffe50a60, body=@0x7fffffe50a90, partitioner=@0x7fffffe50a8f) at parallel_reduce.h:136
#6  0x00000000004015a7 in tbb::parallel_reduce<:BLOCKED_RANGE>, reduce_test> (range=@0x7fffffe50a60, 
    body=@0x7fffffe50a90, partitioner=@0x7fffffe50a8f) at parallel_reduce.h:301
#7  0x00000000004013ad in thing::doit (this=0x504ae0, nb=@0x7fffffe50ae4, nest=2) at main.cc:58
#8  0x0000000000401c66 in reduce_test::operator() (this=0x7fffffe50dd0, r=@0x505350) at main.cc:36
#9  0x00000000004019c6 in tbb::internal::start_reduce<:BLOCKED_RANGE>, reduce_test, tbb::simple_partitioner>::execute (this=0x505340) at parallel_reduce.h:149
#10 0x00002aaaaaacf1da in tbb::internal::CustomScheduler<:INTERNAL::INTELSCHEDULERTRAITS>::wait_for_all (
    this=0x504680, parent=@0x504d40, child=0x504bc0) at ../../src/tbb/task.cpp:1993
#11 0x00002aaaaaaca294 in tbb::internal::GenericScheduler::spawn_root_and_wait (this=0x504680, first=@0x504bc0, 
    next=@0x504bb8) at ../../src/tbb/task.cpp:1776
#12 0x00000000004016d4 in tbb::task::spawn_root_and_wait (root=@0x504bc0) at task.h:644
#13 0x000000000040165a in tbb::internal::start_reduce<:BLOCKED_RANGE>, reduce_test, tbb::simple_partitioner>::run (range=@0x7fffffe50da0, body=@0x7fffffe50dd0, partitioner=@0x7fffffe50dcf) at parallel_reduce.h:136
#14 0x00000000004015a7 in tbb::parallel_reduce<:BLOCKED_RANGE>, reduce_test> (range=@0x7fffffe50da0, 
    body=@0x7fffffe50dd0, partitioner=@0x7fffffe50dcf) at parallel_reduce.h:301
#15 0x00000000004013ad in thing::doit (this=0x504ae0, nb=@0x7fffffe50e34, nest=1) at main.cc:58
#16 0x0000000000401443 in main () at main.cc:83
(gdb) thread 2
[Switching to thread 2 (Thread 1077938528 (LWP 10655))]#0  0x0000003b7e9ae159 in sched_yield () from /lib64/tls/libc.so.6
(gdb) where
#0  0x0000003b7e9ae159 in sched_yield () from /lib64/tls/libc.so.6
#1  0x00002aaaaaac1b15 in tbb::internal::AtomicBackoff::pause (this=0x403ffe40) at tbb_machine.h:149
#2  0x00002aaaaaac8933 in __TBB_LockByte (flag=@0x7fffffe50d10) at tbb_machine.h:563
#3  0x00002aaaaaacbc54 in tbb::task_group_context::bind_to (this=0x403fff10, parent=@0x7fffffe50ce0)
    at ../../src/tbb/task.cpp:2775
#4  0x00002aaaaaacb19b in tbb::internal::allocate_root_with_context_proxy::allocate (this=0x403fff08, size=48)
    at ../../src/tbb/task.cpp:2542
#5  0x000000000040172b in operator new (bytes=48, p=@0x403fff08) at task.h:815
#6  0x0000000000401606 in tbb::internal::start_reduce<:BLOCKED_RANGE>, reduce_test, tbb::simple_partitioner>::run (range=@0x403fffd0, body=@0x40400000, partitioner=@0x403fffff) at parallel_reduce.h:136
#7  0x00000000004015a7 in tbb::parallel_reduce<:BLOCKED_RANGE>, reduce_test> (range=@0x403fffd0, 
    body=@0x40400000, partitioner=@0x403fffff) at parallel_reduce.h:301
#8  0
x00000000004013ad in thing::doit (this=0x504ae0, nb=@0x40400054, nest=2) at main.cc:58
#9  0x0000000000401c66 in reduce_test::operator() (this=0x504ed8, r=@0x5063d0) at main.cc:36
#10 0x00000000004019c6 in tbb::internal::start_reduce<:BLOCKED_RANGE>, reduce_test, tbb::simple_partitioner>::execute (this=0x5063c0) at parallel_reduce.h:149
#11 0x00002aaaaaacf1da in tbb::internal::CustomScheduler<:INTERNAL::INTELSCHEDULERTRAITS>::wait_for_all (
    this=0x505e00, parent=@0x506040, child=0x0) at ../../src/tbb/task.cpp:1993
#12 0x00002aaaaaacae93 in tbb::internal::GenericScheduler::worker_routine (arg=0x5045a0) at ../../src/tbb/task.cpp:2430
#13 0x0000003b7fa060aa in start_thread () from /lib64/tls/libpthread.so.0
#14 0x0000003b7e9c53d3 in clone () from /lib64/tls/libc.so.6
#15 0x0000000000000000 in ?? ()

Alexey-Kukanov · ‎05-16-2008

Thank you for reporting the issue.

TBB parallel algortihms should be nestable like in your code. The stacks suggest me that we might have had some problem in the implementation of work cancellation,whichis new functionality in TBB and keeps changing. Could you please tell what TBB package you used?

vanswaaij · ‎05-16-2008

Hi Alexey,

I'm using tbb20_20080408oss_src

Thanks for looking at it.

Andrey_Marochko · ‎05-20-2008

Thanks for the good test case! As Alexey said we've recently reimplemented most of the internal cancellation code. In particular we've removed all the locks from the normal execution control path (that is no locks are used when there is no exception or cancellation in flight). I've tested your example with this new version and it works fine. You can find it in the last development release (tbb20_20080512oss).

I would also like to ask you (if you don't mind) to contribute your example through our contribution page so that I could add it to our regression test suite.

vanswaaij · ‎05-20-2008

Thanks for the quick turn-around. I've submitted the test case per your request for your regression test suite.
I'm curious how you deal with those regression tests by the way. For instance I needed to run this test case a number of times before it would get stuck. A single successful run doesn't guarantee much. I have the same problem of course with the application I'm porting.
Debugging is also a problem. I already have resorted to producing a core file first and then inspecting it with the debugger, because when running from scratch in the debugger the code would not fail. I wonder what your thoughts on these issues are.

Andrey_Marochko · ‎05-21-2008

Thank you for your contribution!

Honestly speaking we do not have a special infrastructure for regression testing so far. Its absence is more or less compensated by the fact that we run daily automated test sessions (using our unit tests and example apps) on a few dozens of machines (several compilers on each machines + debug/release modes). Thus overall each test case is run daily for about a hundred times, and normally it is enough for most of even the sporadic bugs to manifest themselves (at least once in a few days). Besides we are working on building performance test suite that will also help with catching correctness bugs.

Yet you are right that to detect bugs with poor reproducibility with higher degree of reliability multiple runs are necessary. A few recent cases (one of which is yours) convinced us that we need to extend our test harness to support multiple runs. This will also require separating regression test cases with bad reproducibility in a separate group because repeating the whole tests session for even a hundred times would take a few weeks.

As it regards practical aspects of debugging, then your technique is probably the most universal one and is what we also often use (when I see some signs of a sporadic problem I use shell's "for" to run a test for a thousand times and then either inspect core dump or attach debugger if the test hangs).

Yet you have another good option to deal with the correctness problems, which is often unavailable for us (, but I'll explain why it is so below). There is a great tool called Intel Thread Checker. It is specifically designed to find all the sorts of correctness issues in multithreaded applications, and what is really invaluable, it does not require the problem to actually occur in the test run to detect it. So you may want to try it out, and I hope that it will help you.

By the way, support for Intel Thread Checker has been significantly improved in the last development release of TBB. Most of the false positives (which we were aware about) were eradicated.

And, if you are curious why we cannot use this great tool ourselves (at least most of the times), this is because TBB scheduler use very specific mechanisms for inter-thread communication. The only way to let Intel Thread Checker know about them would be to insert a lot of special API calls, which would obviously affect TBB's control flow significantly. You, as the TBB user, are protected from the false positives that might be caused by the TBB internals, but since these internals are what we usually need to debug - we are not.