Quote:Michel Lemay wrote:

Seunghwa_Kang · ‎05-30-2013

Hello,

I am using TBB 4.1 update 3 (Linux) on a workstation with 4 Intel Xeon sockets (X7560, total 32 cores, 64 threads, using Intel compiler 12.0.4) and testing with 4 MPI processes---each configured to use 8 threads (task_scheduler_init init(8)). Each processes has three task groups and each task group invokes parallel_for in multiple levels.

If I use the debug version of TBB libraries, I am seeing the following assertion failures (mostly the first one, and somewhat less frequently the second one)

Assertion my_global_top_priority >= a.my_top_priority failed on line 530 of file ../../src/tbb/market.cpp

Assertion prev_level.workers_requested >= 0 && new_level.workers_requested >= 0 failed on line 520 of file ../../src/tbb/market.cpp

If I use the production version of TBB libaries, I am seeing stall inside parallel_for (this happens roughly once every hour or two... though very irregular, the above assertion failures occur more frequently...)

In one case I see two processes out of the total four with two stalling threads (total four...) If I print few GDB outputs for one of the two problematic processes (both have 10 threads, the remaining two (without a stalling thread) has 8 and 10, respectively),

(gdb) info threads
10 Thread 0x2afbd94d6940 (LWP 11682) 0x0000003810ad19a9 in syscall () from /lib64/libc.so.6
* 9 Thread 0x2afbd98d7940 (LWP 11684) 0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6
8 Thread 0x2afbd9cd8940 (LWP 11690) 0x00002afbb3e5280e in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb43dfe00, completion_ref_count=@0x0, return_if_no_work=true, $I7=<value optimized out>, $I8=<value optimized out>,
    $I9=<value optimized out>) at ../../src/tbb/custom_scheduler.h:253
7 Thread 0x2afbda0d9940 (LWP 11694) 0x0000003810ad19a9 in syscall () from /lib64/libc.so.6
6 Thread 0x2afbda4da940 (LWP 11695) 0x00002afbb3e52810 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb42cfe00, completion_ref_count=@0x0, return_if_no_work=5, $I7=<value optimized out>, $I8=<value optimized out>,
    $I9=<value optimized out>) at ../../src/tbb/custom_scheduler.h:253
5 Thread 0x2afbda8db940 (LWP 11699) 0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6 <= stalling
4 Thread 0x2afbdacdc940 (LWP 11705) 0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6 <= waiting for a MPI message
3 Thread 0x2afbdd4e9940 (LWP 11714) tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (
    this=0x2afbb45a3480, completion_ref_count=@0x0, return_if_no_work=true, $I7=<value optimized out>, $I8=<value optimized out>,
    $I9=<value optimized out>) at ../../src/tbb/custom_scheduler.h:173
2 Thread 0x2afbdd8ea940 (LWP 11715) tbb::internal::generic_scheduler::reload_tasks (this=0x2afbdcc57e00, $5=<value optimized out>)
    at ../../src/tbb/scheduler.cpp:854
1 Thread 0x2afbb41f5f60 (LWP 11678) 0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6 <=stalling

(gdb) thread 1
[Switching to thread 1 (Thread 0x2afbb41f5f60 (LWP 11678))]#0 0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6
(gdb) where
#0 0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6
#1 0x00002afbb3e529b4 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb45a4a00,
    completion_ref_count=@0x0, return_if_no_work=6, $I7=<value optimized out>, $I8=<value optimized out>, $I9=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:261
#2 0x00002afbb3e516e9 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbb45a4a00,
    parent=..., child=0x6, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:547
#3 0x00002afbb3e4fbd0 in tbb::internal::generic_scheduler::local_spawn_root_and_wait (this=0x2afbb45a4a00, first=..., next=@0x6,
    $0=<value optimized out>, $1=<value optimized out>, $2=<value optimized out>) at ../../src/tbb/scheduler.cpp:664
#4 0x00002afbb3e4fafb in tbb::internal::generic_scheduler::spawn_root_and_wait (this=0x2afbb45a4a00, first=..., next=@0x6,
    $1=<value optimized out>, $2=<value optimized out>, $3=<value optimized out>) at ../../src/tbb/scheduler.cpp:672
#5 0x000000000048a272 in tbb::task::spawn_root_and_wait (root=..., $=1=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task.h:693
#6 0x0000000000626e6c in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::run(const tbb::blocked_range<int> &, const class {...} &, const tbb::auto_partitioner &) (range=..., body=..., partitioner=..., $1=<value optimized out>,
    $2=<value optimized out>, $3=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:94
#7 0x0000000000633066 in tbb::parallel_for(const tbb::blocked_range<int> &, const class {...} &) (range=..., body=...,
    $6=<value optimized out>, $7=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:165
#8 0x00000000005fb993 in operator() (this=0x2afbdbe1ac58, r=..., $F2=<value optimized out>, $F3=<value optimized out>) at mech_intrct.cpp:295
#9 0x0000000000627637 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::run_body (
    this=0x2afbdbe1ac40, r=..., $7=<value optimized out>, $8=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:110
#10 0x000000000061d883 in tbb::interface6::internal::partition_type_base<tbb::interface6::internal::auto_partition_type>::execute (
    this=0x2afbdbe1ac68, start=..., range=..., $5=<value optimized out>, $0=<value optimized out>, $1=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/partitioner.h:259
#11 0x0000000000626ff4 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::execute (
    this=0x2afbdbe1ac40, $7=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:116
#12 0x00002afbb3e515ba in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbb45a4a00,
    parent=..., child=0x6, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:440
#13 0x00002afbb3e4fbd0 in tbb::internal::generic_scheduler::local_spawn_root_and_wait (this=0x2afbb45a4a00, first=..., next=@0x6,
    $0=<value optimized out>, $1=<value optimized out>, $2=<value optimized out>) at ../../src/tbb/scheduler.cpp:664
#14 0x00002afbb3e4fafb in tbb::internal::generic_scheduler::spawn_root_and_wait (this=0x2afbb45a4a00, first=..., next=@0x6,
    $1=<value optimized out>, $2=<value optimized out>, $3=<value optimized out>) at ../../src/tbb/scheduler.cpp:672
#15 0x000000000048a272 in tbb::task::spawn_root_and_wait (root=..., $=1=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task.h:693
#16 0x0000000000627540 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::run(const tbb::blocked_range<int> &, const class {...} &, const tbb::auto_partitioner &) (range=..., body=..., partitioner=..., $4=<value optimized out>,
    $5=<value optimized out>, $6=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:94
#17 0x000000000063309e in tbb::parallel_for(const tbb::blocked_range<int> &, const class {...} &) (range=..., body=...,
    $8=<value optimized out>, $9=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:165
#18 0x0000000000604e69 in Sim::computeMechIntrct (this=0x2afbb459fc80, $1=<value optimized out>) at mech_intrct.cpp:293
#19 0x00000000004b041f in operator() (this=0x2afbdc0b4248, $7=<value optimized out>) at run.cpp:174
#20 0x00000000004e0a5c in tbb::internal::function_task<lambda []>::execute (this=0x2afbdc0b4240, $8=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task_group.h:79
#21 0x00002afbb3e51f50 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::wait_for_all (this=0x2afbb45a4a00, parent=...,
    child=0x6, $J3=<value optimized out>, $J4=<value optimized out>, $J5=<value optimized out>) at ../../src/tbb/custom_scheduler.h:81
#22 0x00000000004dfd8f in tbb::task::wait_for_all (this=0x2afbb45c7a40, $=0=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task.h:704
#23 0x00000000004e040f in tbb::internal::task_group_base::wait (this=0x7fff0f9681d8, $3=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task_group.h:157
#24 0x00000000004b6f6b in Sim::run (this=0x2afbb459fc80, $=<value optimized out>) at run.cpp:176
#25 0x000000000042b58f in biocellion (xmlFile="fhcrc.xml") at sim.cpp:181
#26 0x000000000042722a in main (argc=2, ap_args=0x7fff0f96a218) at biocellion.cpp:30

(gdb) thread 5
[Switching to thread 5 (Thread 0x2afbda8db940 (LWP 11699))]#0 0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6
(gdb) where
#0 0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6
#1 0x00002afbb3e529b4 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb42b7e00,
    completion_ref_count=@0x0, return_if_no_work=2, $I7=<value optimized out>, $I8=<value optimized out>, $I9=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:261
#2 0x00002afbb3e516e9 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbb42b7e00,
    parent=..., child=0x2, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:547
#3 0x00002afbb3e4fbd0 in tbb::internal::generic_scheduler::local_spawn_root_and_wait (this=0x2afbb42b7e00, first=..., next=@0x2,
    $0=<value optimized out>, $1=<value optimized out>, $2=<value optimized out>) at ../../src/tbb/scheduler.cpp:664
#4 0x00002afbb3e4fafb in tbb::internal::generic_scheduler::spawn_root_and_wait (this=0x2afbb42b7e00, first=..., next=@0x2,
    $1=<value optimized out>, $2=<value optimized out>, $3=<value optimized out>) at ../../src/tbb/scheduler.cpp:672
#5 0x000000000048a272 in tbb::task::spawn_root_and_wait (root=..., $=1=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task.h:693
#6 0x0000000000626e6c in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::run(const tbb::blocked_range<int> &, const class {...} &, const tbb::auto_partitioner &) (range=..., body=..., partitioner=..., $1=<value optimized out>,
    $2=<value optimized out>, $3=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:94
#7 0x0000000000633066 in tbb::parallel_for(const tbb::blocked_range<int> &, const class {...} &) (range=..., body=...,
    $6=<value optimized out>, $7=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:165
#8 0x00000000005fb993 in operator() (this=0x2afbdc8add58, r=..., $F2=<value optimized out>, $F3=<value optimized out>) at mech_intrct.cpp:295
#9 0x0000000000627637 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::run_body (
    this=0x2afbdc8add40, r=..., $7=<value optimized out>, $8=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:110
#10 0x000000000061d883 in tbb::interface6::internal::partition_type_base<tbb::interface6::internal::auto_partition_type>::execute (
    this=0x2afbdc8add68, start=..., range=..., $5=<value optimized out>, $0=<value optimized out>, $1=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/partitioner.h:259
#11 0x0000000000626ff4 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::execute (
    this=0x2afbdc8add40, $7=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:116
#12 0x00002afbb3e515ba in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbb42b7e00,
    parent=..., child=0x2, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:440
#13 0x00002afbb3e4f189 in tbb::internal::arena::process (this=0x2afbb42b7e00, s=..., $t4=<value optimized out>, $t5=<value optimized out>)
    at ../../src/tbb/arena.cpp:98
#14 0x00002afbb3e4cf15 in tbb::internal::market::process (this=0x2afbb42b7e00, j=..., $0=<value optimized out>, $1=<value optimized out>)
    at ../../src/tbb/market.cpp:465
#15 0x00002afbb3e4a764 in tbb::internal::rml::private_worker::run (this=0x2afbb42b7e00, $L7=<value optimized out>)
    at ../../src/tbb/private_server.cpp:274
#16 0x00002afbb3e4a696 in tbb::internal::rml::private_worker::thread_routine (arg=0x2afbb42b7e00, $L9=<value optimized out>)
    at ../../src/tbb/private_server.cpp:231
#17 0x0000003811a0683d in start_thread () from /lib64/libpthread.so.0
#18 0x0000003810ad4fad in clone () from /lib64/libc.so.6

For few seemingly normal threads

(gdb) thread 2
[Switching to thread 2 (Thread 0x2afbdd8ea940 (LWP 11715))]#0 tbb::internal::generic_scheduler::reload_tasks (this=0x2afbdcc57e00,
    $5=<value optimized out>) at ../../src/tbb/scheduler.cpp:854
854    ../../src/tbb/scheduler.cpp: No such file or directory.
    in ../../src/tbb/scheduler.cpp
(gdb) where
#0 tbb::internal::generic_scheduler::reload_tasks (this=0x2afbdcc57e00, $5=<value optimized out>) at ../../src/tbb/scheduler.cpp:854
#1 0x00002afbb3e52726 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbdcc57e00,
    completion_ref_count=@0x2afbb45a3a00, return_if_no_work=7, $I7=<value optimized out>, $I8=<value optimized out>, $I9=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:193
#2 0x00002afbb3e516e9 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbdcc57e00,
    parent=..., child=0x7, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:547
#3 0x00002afbb3e4f189 in tbb::internal::arena::process (this=0x2afbdcc57e00, s=..., $t4=<value optimized out>, $t5=<value optimized out>)
    at ../../src/tbb/arena.cpp:98
#4 0x00002afbb3e4cf15 in tbb::internal::market::process (this=0x2afbdcc57e00, j=..., $0=<value optimized out>, $1=<value optimized out>)
    at ../../src/tbb/market.cpp:465
#5 0x00002afbb3e4a764 in tbb::internal::rml::private_worker::run (this=0x2afbdcc57e00, $L7=<value optimized out>)
    at ../../src/tbb/private_server.cpp:274
#6 0x00002afbb3e4a696 in tbb::internal::rml::private_worker::thread_routine (arg=0x2afbdcc57e00, $L9=<value optimized out>)
    at ../../src/tbb/private_server.cpp:231
#7 0x0000003811a0683d in start_thread () from /lib64/libpthread.so.0
#8 0x0000003810ad4fad in clone () from /lib64/libc.so.6

(gdb) thread 3
[Switching to thread 3 (Thread 0x2afbdd4e9940 (LWP 11714))]#0 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb45a3480, completion_ref_count=@0x0, return_if_no_work=true, $I7=<value optimized out>, $I8=<value optimized out>,
    $I9=<value optimized out>) at ../../src/tbb/custom_scheduler.h:173
173    ../../src/tbb/custom_scheduler.h: No such file or directory.
    in ../../src/tbb/custom_scheduler.h
(gdb) where
#0 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb45a3480, completion_ref_count=@0x0,
    return_if_no_work=true, $I7=<value optimized out>, $I8=<value optimized out>, $I9=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:173
#1 0x00002afbb3e516e9 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbdcc6fe00,
    parent=..., child=0x1, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:547
#2 0x00002afbb3e4f189 in tbb::internal::arena::process (this=0x2afbb45a3480, s=..., $t4=<value optimized out>, $t5=<value optimized out>)
    at ../../src/tbb/arena.cpp:98
#3 0x00002afbb3e4cf15 in tbb::internal::market::process (this=0x2afbb45a3480, j=..., $0=<value optimized out>, $1=<value optimized out>)
    at ../../src/tbb/market.cpp:465
#4 0x00002afbb3e4a764 in tbb::internal::rml::private_worker::run (this=0x2afbb45a3480, $L7=<value optimized out>)
    at ../../src/tbb/private_server.cpp:274
#5 0x00002afbb3e4a696 in tbb::internal::rml::private_worker::thread_routine (arg=0x2afbb45a3480, $L9=<value optimized out>)
    at ../../src/tbb/private_server.cpp:231
#6 0x0000003811a0683d in start_thread () from /lib64/libpthread.so.0
#7 0x0000003810ad4fad in clone () from /lib64/libc.so.6

(gdb) thread 7
[Switching to thread 7 (Thread 0x2afbda0d9940 (LWP 11694))]#0 0x0000003810ad19a9 in syscall () from /lib64/libc.so.6
(gdb) where
#0 0x0000003810ad19a9 in syscall () from /lib64/libc.so.6
#1 0x00002afbb3e4a8e2 in tbb::internal::rml::private_worker::run (this=0x2afbb4557dac, $L7=<value optimized out>)
    at ../../src/tbb/private_server.cpp:281
#2 0x00002afbb3e4a696 in tbb::internal::rml::private_worker::thread_routine (arg=0x2afbb4557dac, $L9=<value optimized out>)
    at ../../src/tbb/private_server.cpp:231
#3 0x0000003811a0683d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003810ad4fad in clone () from /lib64/libc.so.6

Any clues???

I also see several times, some threads end up inside the bool arena::is_out_of_work() function...

Thank you very much,

RafSchietekat · ‎05-30-2013

If you don't use prorities yourself, try again with __TBB_TASK_PRIORITY=0 (somewhere in tbb_config.h).

Vladimir_P_1234567890 · ‎05-31-2013

Hello what MPI version do you use?

can't it be related to post http://software.intel.com/en-us/forums/topic/392226?

--Vladimir

MLema2 · ‎05-31-2013

I got the same problem.. Only way I got to fix the issue is not to use context priorities at all.

http://software.intel.com/en-us/forums/topic/278901

Seunghwa_Kang · ‎05-31-2013

Raf Schietekat wrote:

If you don't use prorities yourself, try again with __TBB_TASK_PRIORITY=0 (somewhere in tbb_config.h).

Thank you for the suggestion.

I will try but I am using priorities (task::self().set_group_priority()) and have a reason to do this. I assigned low, normal, high priorities to three different task groups. It also seems like this problem is happening only when threre are more than one process per shared memory node.

Seunghwa_Kang · ‎05-31-2013

Vladimir Polin (Intel) wrote:

Hello what MPI version do you use?

can't it be related to post http://software.intel.com/en-us/forums/topic/392226?

--Vladimir

I am using MPICH2 (1.4.1p1). The blocking thread (waiting for an MPI message) in the GDB output is normal... The problem I am seeing is stall inside parallel_for even though there are many idle threads.

Thanks!!!

Seunghwa_Kang · ‎05-31-2013

Michel Lemay wrote:

I got the same problem.. Only way I got to fix the issue is not to use context priorities at all.

http://software.intel.com/en-us/forums/topic/278901

Thank you and it seems like this is related to priorities... I am assigning low, normal, and high priorities to three different task groups and each group invokes parallel_for in multiple levels. But it seems like this problem occurs only when there are more than one process per node in my case (I am using 4.1 update 3). Have you ever seen this with just one process per node?

A more genereal question I have is... is there any sort of resource sharing or communication among different processes using TBB inisde the TBB library?

Thanks,

RafSchietekat · ‎05-31-2013

The suggestion was just to try and help pinpoint the problem: how about Vladimir's question?

I did do some research about a problem with priorities, but there's a lot more that I haven't figured out yet.

Seunghwa_Kang · ‎05-31-2013

Raf Schietekat wrote:

The suggestion was just to try and help pinpoint the problem: how about Vladimir's question?

I did do some research about a problem with priorities, but there's a lot more that I haven't figured out yet.

It seems like my replies to other comments are still sleeping in the queue :-(

I commented out all task::self().set_group_priority calls in my code and I haven't observed an assertion failure or stall yet. It seems like this is somewhat relevent to priorities.

I don't think MPI is an issue here. I am using MPICH2 and I am not seeing anything strange related to MPI communication in GDB outputs.

Thanks,

MLema2 · ‎05-31-2013

Seunghwa Kang wrote:

Quote:

Michel Lemaywrote:
I got the same problem.. Only way I got to fix the issue is not to use context priorities at all.

http://software.intel.com/en-us/forums/topic/278901

Thank you and it seems like this is related to priorities... I am assigning low, normal, and high priorities to three different task groups and each group invokes parallel_for in multiple levels. But it seems like this problem occurs only when there are more than one process per node in my case (I am using 4.1 update 3). Have you ever seen this with just one process per node?

A more genereal question I have is... is there any sort of resource sharing or communication among different processes using TBB inisde the TBB library?

Thanks,

We run only one process with TBB scheduling. This process contains tens of threads with varying priorities. And such threads would schedule similar priorities to the context passed to parallel algorithms. i.e. Low priority threads (background processing) would typically launch low priority parallel loops and tasks. Inversely, high priority thread (from waiting user input) would launch high priority tasks.

I saw this issue on at least two machines with AMD and Intel processors with 32 cores and under heavy utilization. However, I've never been able to create simple piece of code mimicking this problem and send it to TBB team for review.

jimdempseyatthecove · ‎05-31-2013

Kang,

From your first description of your application, (in summary) you partitioned your MPI applications into seperate sockets, each running 8 threads in TBB... without oversubscription for the system. System has 64 hw threads, 8 MPI slices, each TBB'd to 8 threads.

From reading other forum posts, I seem to recall that the newest TBB library has added a "feature" whereby if a system is running multiple TBB applications, each assuming they have the full complement of hardware threads, that the developers have schemed a way for each TBB application to throttle-down on the number of worker threads. This code is new (may have bugs) and may cause issues with MPI especially across sync. If possible, try to disable this feature.

Jim Dempsey

Seunghwa_Kang · ‎07-22-2013

I am still having this issue with TBB 4.1 update 4.

Stall is quite difficult to reproduce but the following assertion fails pretty frequently even with the simple code I pasted below. My wild guess is this assertiion failure is a necessary condition for stall and hope fixing this will also fix the stall issue.

Assertion my_global_top_priority >= a.my_top_priority failed on line 506 of file ../../src/tbb/market.cpp

[cpp]

#include <stdio.h>
#include <stdlib.h>

#include <iostream>
#include <string>

#include <omp.h>

#include <tbb/tbb.h>

using namespace tbb;

#define ENABLE_PRIORITY 1
#define NUM_TIME_STEPS 100000

void myFunc( void );

int main( int argc, char* ap_args[] ) {
    /* initialize TBB */

    task_scheduler_init init( 64 );

    /* start time-stepping */

    /* this loop cannot be parallelized */
    for( int i = 0 ; i < NUM_TIME_STEPS ; ) {
#if ENABLE_PRIORITY
        task::self().set_group_priority( priority_normal );
#endif

        {
            task_group myGroup;

            myGroup.run( []{ myFunc(); } );

            myGroup.wait();
        }

        i++;

        std::cout << i << "/" << NUM_TIME_STEPS << std::endl;
    }

    std::cout << "Simulation finished." << std::endl;

    return 0;
}

void myFunc( void ) {
#if ENABLE_PRIORITY
    task::self().set_group_priority( priority_high );
#endif

    for( int i = 0 ; i < 10 ; i++ ) {
        /* make a copy of the agent and grid data at the beginning of this sub-step */

        parallel_for( blocked_range<int> ( 0, 38 ), [&]( const blocked_range<int>& r ) {
        for( int ii = r.begin() ; ii < r.end() ; ii++ ) {
            parallel_for( blocked_range<int> ( 0, 8 ), [&]( const blocked_range<int>& r2 ) {
            for( int jj = r2.begin() ; jj < r2.end() ; jj++ ) {
            }
            } );
        }
        } );
    }

    return;
}

[/cpp]

I ran the code on a Linux system with 32 cores and 64 hardware threads.

I build the executable by typing

icpc -std=c++0x -g -Wall -I /home/install/tbb41u4/include -I ../include -c test.cpp -o test.o
icpc -std=c++0x -static-intel -o test test.o -openmp-link static -openmp -L /home/install/tbb41u4/lib/intel64/gcc4.1 -ltbb_debug -ltbbmalloc_proxy_debug -ltbbmalloc_debug

Vladimir_P_1234567890 · ‎07-23-2013

hello, great reproducer!

I was able to reproduce the assert failure on windows machine.
Could you submit it via our contribution page? Then we can add it to our unit testing.

--Vladimir

MLema2 · ‎07-23-2013

Wow! This bug has been hiding and creeping for so long! I'm glad someone finaly reproduced it with a simple piece of code!

Good job M Kang!

Seunghwa_Kang · ‎09-13-2013

Just want to inform that this assertion is still failing with TBB 4.2 and the stalling issue is persisting.

Hope this to be fixed sometime soon.

jimdempseyatthecove · ‎09-15-2013

I am unable to directly test the code. Can you try this:

[cpp]
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <string>
#include <omp.h>
#include <tbb/tbb.h>
using namespace tbb;
#define ENABLE_PRIORITY 1
#define NUM_TIME_STEPS 100000
void myFunc( void );
int main( int argc, char* ap_args[] ) {
    /* initialize TBB */
    task_scheduler_init init( 64 );
    /* start time-stepping */
    /* this loop cannot be parallelized */
    for( int i = 0 ; i < NUM_TIME_STEPS ; ) {
#if ENABLE_PRIORITY
        task::self().set_group_priority( priority_normal );
#endif
        {
            task_group myGroup;
            myGroup.run( []{ myFunc(); } );
#if ENABLE_PRIORITY
            task::self().set_group_priority( priority_high );
#endif
            myGroup.wait();
#if ENABLE_PRIORITY
            task::self().set_group_priority( priority_normal );
#endif
        }
        i++;
        std::cout << i << "/" << NUM_TIME_STEPS << std::endl;
    }
    std::cout << "Simulation finished." << std::endl;
    return 0;
}
void myFunc( void ) {
#if ENABLE_PRIORITY
    task::self().set_group_priority( priority_high );
#endif
    for( int i = 0 ; i < 10 ; i++ ) {
        /* make a copy of the agent and grid data at the beginning of this sub-step */
        parallel_for( blocked_range<int> ( 0, 38 ), [&]( const blocked_range<int>& r ) {
        for( int ii = r.begin() ; ii < r.end() ; ii++ ) {
            parallel_for( blocked_range<int> ( 0, 8 ), [&]( const blocked_range<int>& r2 ) {
            for( int jj = r2.begin() ; jj < r2.end() ; jj++ ) {
            }
            } );
        }
        } );
    }
    return;
}
[/cpp]

Jim Dempsey

RafSchietekat · ‎09-15-2013

One thing that's immediately striking, and I've been here before but I may have abandoned it at the time (I just don't remember at this moment), is how "volatile intptr_t* tbb::internal::generic_scheduler::my_ref_top_priority" can reference either "volatile intptr_t tbb::internal::arena_base::my_top_priority" (master) or "intptr_t tbb::internal::market::my_global_top_priority" (worker): perhaps the non-volatile market member variable will work, perhaps not, but I think that it should all be tbb::atomic<intptr_t>, at least because a volatile is only a poor man's atomic anyway (silently relies on cooperation from both compiler and hardware for elementary atomic behaviour, nonportably implied memory semantics on some compilers but none by the Standard, no RMW operations), but also because there's no help from the compiler to avoid such confusion between volatile and non-volatile (which may temporarily reside in registers).

I have no idea at this time whether this might be the cause of the problem or just a red herring, but I would get rid of the volatile even if only as a matter of principle. Actually all uses of volatile in TBB, or at least those that are not burdened with backward compatibility.

(Added) It would seem that the market's my_global_top_priority is protected by my_arenas_list_mutex for writing, but, strictly speaking, although the new value should now not be hidden indefinitely, it's not quite the same as a read-write mutex, especially if there's a breach caused by the alias. Luckily my_ref_top_priority's referent can be made const, so that's not an issue.

(Added) Does the problem occur, e.g., on both Linux and Windows, or only on Linux? If so, it would mean that the code relies on nonportable memory semantics associated with the use of "volatile". Otherwise it's something else, or that and something else.

(Added) TODO: check that update_global_top_priority() is only called by a thread that holds a lock on my_arenas_list_mutex. These things should be documented...

Looking further (and I may add to this posting)...

Seunghwa_Kang · ‎09-15-2013

Hello Jim,

The code you attached also fails.

I am getting

Assertion my_global_top_priority >= a.my_top_priority failed on line 536 of file ../../src/tbb/market.cpp
Abort

Thanks,

jimdempseyatthecove wrote:

I am unable to directly test the code. Can you try this:

#include <stdio.h> #include <stdlib.h> #include <iostream> #include <string> #include <omp.h> #include <tbb/tbb.h> using namespace tbb; #define ENABLE_PRIORITY 1 #define NUM_TIME_STEPS 100000 void myFunc( void ); int main( int argc, char* ap_args[] ) { /* initialize TBB */ task_scheduler_init init( 64 ); /* start time-stepping */ /* this loop cannot be parallelized */ for( int i = 0 ; i < NUM_TIME_STEPS ; ) { #if ENABLE_PRIORITY task::self().set_group_priority( priority_normal ); #endif { task_group myGroup; myGroup.run( []{ myFunc(); } ); #if ENABLE_PRIORITY task::self().set_group_priority( priority_high ); #endif myGroup.wait(); #if ENABLE_PRIORITY task::self().set_group_priority( priority_normal ); #endif } i++; std::cout << i << "/" << NUM_TIME_STEPS << std::endl; } std::cout << "Simulation finished." << std::endl; return 0; } void myFunc( void ) { #if ENABLE_PRIORITY task::self().set_group_priority( priority_high ); #endif for( int i = 0 ; i < 10 ; i++ ) { /* make a copy of the agent and grid data at the beginning of this sub-step */ parallel_for( blocked_range<int> ( 0, 38 ), [&]( const blocked_range<int>& r ) { for( int ii = r.begin() ; ii < r.end() ; ii++ ) { parallel_for( blocked_range<int> ( 0, 8 ), [&]( const blocked_range<int>& r2 ) { for( int jj = r2.begin() ; jj < r2.end() ; jj++ ) { } } ); } } ); } return; }

Jim Dempsey

Seunghwa_Kang · ‎09-15-2013

I have tried this only on Linux, so one possibility is, this problem might be solved for Windows only.

Thanks,

Raf Schietekat wrote:

One thing that's immediately striking, and I've been here before but I may have abandoned it at the time (I just don't remember at this moment), is how "volatile intptr_t* tbb::internal::generic_scheduler::my_ref_top_priority" can reference either "volatile intptr_t tbb::internal::arena_base::my_top_priority" (master) or "intptr_t tbb::internal::market::my_global_top_priority" (worker): perhaps the non-volatile market member variable will work, perhaps not, but I think that it should all be tbb::atomic<intptr_t>, at least because a volatile is only a poor man's atomic anyway (silently relies on cooperation from both compiler and hardware for elementary atomic behaviour, nonportably implied memory semantics on some compilers but none by the Standard, no RMW operations), but also because there's no help from the compiler to avoid such confusion between volatile and non-volatile (which may temporarily reside in registers).

I have no idea at this time whether this might be the cause of the problem or just a red herring, but I would get rid of the volatile even if only as a matter of principle. Actually all uses of volatile in TBB, or at least those that are not burdened with backward compatibility.

(Added) It would seem that the market's my_global_top_priority is protected by my_arenas_list_mutex for writing, but, strictly speaking, although the new value should now not be hidden indefinitely, it's not quite the same as a read-write mutex, especially if there's a breach caused by the alias. Luckily my_ref_top_priority's referent can be made const, so that's not an issue.

(Added) Does the problem occur, e.g., on both Linux and Windows, or only on Linux? If so, it would mean that the code relies on nonportable memory semantics associated with the use of "volatile". Otherwise it's something else, or that and something else.

(Added) TODO: check that update_global_top_priority() is only called by a thread that holds a lock on my_arenas_list_mutex. These things should be documented...

Looking further (and I may add to this posting)...

jimdempseyatthecove · ‎09-15-2013

Try this:

remove my edits (go back to your posted code)
remove the main thread's messing with priority (leave as-is from init time)
Add a chunksize argument to your two inner parallel_for's such that the product of the two partitionings is less than total number of threads in the thread pool. This is not a fix but simply a diagnostic.

(0, 38, 13) - 3 tasks (threads)
(0, 8, 4) - 2 tasks (threads) x 3 tasks == 6 of the 8 in the init.

See if the code locks up.

------------------------

It is unclear as to what your expectations are with changing the priorities. These priorities are not system thread priorities. Instead these are task level priorities.

JIm Dempsey

Seunghwa_Kang · ‎09-15-2013

This still fails.

"It is unclear as to what your expectations are with changing the priorities. These priorities are not system thread priorities. Instead these are task level priorities."

The code I attached is just a reproducer. What I have in the real code is three task groups doing different works.

jimdempseyatthecove wrote:

Try this:

remove my edits (go back to your posted code)
remove the main thread's messing with priority (leave as-is from init time)
Add a chunksize argument to your two inner parallel_for's such that the product of the two partitionings is less than total number of threads in the thread pool. This is not a fix but simply a diagnostic.

(0, 38, 13) - 3 tasks (threads)
(0, 8, 4) - 2 tasks (threads) x 3 tasks == 6 of the 8 in the init.

See if the code locks up.

------------------------

It is unclear as to what your expectations are with changing the priorities. These priorities are not system thread priorities. Instead these are task level priorities.

JIm Dempsey

stall in parallel_for