stall in parallel_for - Page 2

Seunghwa_Kang · ‎05-30-2013

Hello,

I am using TBB 4.1 update 3 (Linux) on a workstation with 4 Intel Xeon sockets (X7560, total 32 cores, 64 threads, using Intel compiler 12.0.4) and testing with 4 MPI processes---each configured to use 8 threads (task_scheduler_init init(8)). Each processes has three task groups and each task group invokes parallel_for in multiple levels.

If I use the debug version of TBB libraries, I am seeing the following assertion failures (mostly the first one, and somewhat less frequently the second one)

Assertion my_global_top_priority >= a.my_top_priority failed on line 530 of file ../../src/tbb/market.cpp

Assertion prev_level.workers_requested >= 0 && new_level.workers_requested >= 0 failed on line 520 of file ../../src/tbb/market.cpp

If I use the production version of TBB libaries, I am seeing stall inside parallel_for (this happens roughly once every hour or two... though very irregular, the above assertion failures occur more frequently...)

In one case I see two processes out of the total four with two stalling threads (total four...) If I print few GDB outputs for one of the two problematic processes (both have 10 threads, the remaining two (without a stalling thread) has 8 and 10, respectively),

(gdb) info threads
10 Thread 0x2afbd94d6940 (LWP 11682) 0x0000003810ad19a9 in syscall () from /lib64/libc.so.6
* 9 Thread 0x2afbd98d7940 (LWP 11684) 0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6
8 Thread 0x2afbd9cd8940 (LWP 11690) 0x00002afbb3e5280e in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb43dfe00, completion_ref_count=@0x0, return_if_no_work=true, $I7=<value optimized out>, $I8=<value optimized out>,
    $I9=<value optimized out>) at ../../src/tbb/custom_scheduler.h:253
7 Thread 0x2afbda0d9940 (LWP 11694) 0x0000003810ad19a9 in syscall () from /lib64/libc.so.6
6 Thread 0x2afbda4da940 (LWP 11695) 0x00002afbb3e52810 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb42cfe00, completion_ref_count=@0x0, return_if_no_work=5, $I7=<value optimized out>, $I8=<value optimized out>,
    $I9=<value optimized out>) at ../../src/tbb/custom_scheduler.h:253
5 Thread 0x2afbda8db940 (LWP 11699) 0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6 <= stalling
4 Thread 0x2afbdacdc940 (LWP 11705) 0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6 <= waiting for a MPI message
3 Thread 0x2afbdd4e9940 (LWP 11714) tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (
    this=0x2afbb45a3480, completion_ref_count=@0x0, return_if_no_work=true, $I7=<value optimized out>, $I8=<value optimized out>,
    $I9=<value optimized out>) at ../../src/tbb/custom_scheduler.h:173
2 Thread 0x2afbdd8ea940 (LWP 11715) tbb::internal::generic_scheduler::reload_tasks (this=0x2afbdcc57e00, $5=<value optimized out>)
    at ../../src/tbb/scheduler.cpp:854
1 Thread 0x2afbb41f5f60 (LWP 11678) 0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6 <=stalling

(gdb) thread 1
[Switching to thread 1 (Thread 0x2afbb41f5f60 (LWP 11678))]#0 0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6
(gdb) where
#0 0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6
#1 0x00002afbb3e529b4 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb45a4a00,
    completion_ref_count=@0x0, return_if_no_work=6, $I7=<value optimized out>, $I8=<value optimized out>, $I9=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:261
#2 0x00002afbb3e516e9 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbb45a4a00,
    parent=..., child=0x6, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:547
#3 0x00002afbb3e4fbd0 in tbb::internal::generic_scheduler::local_spawn_root_and_wait (this=0x2afbb45a4a00, first=..., next=@0x6,
    $0=<value optimized out>, $1=<value optimized out>, $2=<value optimized out>) at ../../src/tbb/scheduler.cpp:664
#4 0x00002afbb3e4fafb in tbb::internal::generic_scheduler::spawn_root_and_wait (this=0x2afbb45a4a00, first=..., next=@0x6,
    $1=<value optimized out>, $2=<value optimized out>, $3=<value optimized out>) at ../../src/tbb/scheduler.cpp:672
#5 0x000000000048a272 in tbb::task::spawn_root_and_wait (root=..., $=1=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task.h:693
#6 0x0000000000626e6c in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::run(const tbb::blocked_range<int> &, const class {...} &, const tbb::auto_partitioner &) (range=..., body=..., partitioner=..., $1=<value optimized out>,
    $2=<value optimized out>, $3=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:94
#7 0x0000000000633066 in tbb::parallel_for(const tbb::blocked_range<int> &, const class {...} &) (range=..., body=...,
    $6=<value optimized out>, $7=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:165
#8 0x00000000005fb993 in operator() (this=0x2afbdbe1ac58, r=..., $F2=<value optimized out>, $F3=<value optimized out>) at mech_intrct.cpp:295
#9 0x0000000000627637 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::run_body (
    this=0x2afbdbe1ac40, r=..., $7=<value optimized out>, $8=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:110
#10 0x000000000061d883 in tbb::interface6::internal::partition_type_base<tbb::interface6::internal::auto_partition_type>::execute (
    this=0x2afbdbe1ac68, start=..., range=..., $5=<value optimized out>, $0=<value optimized out>, $1=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/partitioner.h:259
#11 0x0000000000626ff4 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::execute (
    this=0x2afbdbe1ac40, $7=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:116
#12 0x00002afbb3e515ba in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbb45a4a00,
    parent=..., child=0x6, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:440
#13 0x00002afbb3e4fbd0 in tbb::internal::generic_scheduler::local_spawn_root_and_wait (this=0x2afbb45a4a00, first=..., next=@0x6,
    $0=<value optimized out>, $1=<value optimized out>, $2=<value optimized out>) at ../../src/tbb/scheduler.cpp:664
#14 0x00002afbb3e4fafb in tbb::internal::generic_scheduler::spawn_root_and_wait (this=0x2afbb45a4a00, first=..., next=@0x6,
    $1=<value optimized out>, $2=<value optimized out>, $3=<value optimized out>) at ../../src/tbb/scheduler.cpp:672
#15 0x000000000048a272 in tbb::task::spawn_root_and_wait (root=..., $=1=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task.h:693
#16 0x0000000000627540 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::run(const tbb::blocked_range<int> &, const class {...} &, const tbb::auto_partitioner &) (range=..., body=..., partitioner=..., $4=<value optimized out>,
    $5=<value optimized out>, $6=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:94
#17 0x000000000063309e in tbb::parallel_for(const tbb::blocked_range<int> &, const class {...} &) (range=..., body=...,
    $8=<value optimized out>, $9=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:165
#18 0x0000000000604e69 in Sim::computeMechIntrct (this=0x2afbb459fc80, $1=<value optimized out>) at mech_intrct.cpp:293
#19 0x00000000004b041f in operator() (this=0x2afbdc0b4248, $7=<value optimized out>) at run.cpp:174
#20 0x00000000004e0a5c in tbb::internal::function_task<lambda []>::execute (this=0x2afbdc0b4240, $8=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task_group.h:79
#21 0x00002afbb3e51f50 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::wait_for_all (this=0x2afbb45a4a00, parent=...,
    child=0x6, $J3=<value optimized out>, $J4=<value optimized out>, $J5=<value optimized out>) at ../../src/tbb/custom_scheduler.h:81
#22 0x00000000004dfd8f in tbb::task::wait_for_all (this=0x2afbb45c7a40, $=0=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task.h:704
#23 0x00000000004e040f in tbb::internal::task_group_base::wait (this=0x7fff0f9681d8, $3=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task_group.h:157
#24 0x00000000004b6f6b in Sim::run (this=0x2afbb459fc80, $=<value optimized out>) at run.cpp:176
#25 0x000000000042b58f in biocellion (xmlFile="fhcrc.xml") at sim.cpp:181
#26 0x000000000042722a in main (argc=2, ap_args=0x7fff0f96a218) at biocellion.cpp:30

(gdb) thread 5
[Switching to thread 5 (Thread 0x2afbda8db940 (LWP 11699))]#0 0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6
(gdb) where
#0 0x0000003810abb2f7 in sched_yield () from /lib64/libc.so.6
#1 0x00002afbb3e529b4 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb42b7e00,
    completion_ref_count=@0x0, return_if_no_work=2, $I7=<value optimized out>, $I8=<value optimized out>, $I9=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:261
#2 0x00002afbb3e516e9 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbb42b7e00,
    parent=..., child=0x2, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:547
#3 0x00002afbb3e4fbd0 in tbb::internal::generic_scheduler::local_spawn_root_and_wait (this=0x2afbb42b7e00, first=..., next=@0x2,
    $0=<value optimized out>, $1=<value optimized out>, $2=<value optimized out>) at ../../src/tbb/scheduler.cpp:664
#4 0x00002afbb3e4fafb in tbb::internal::generic_scheduler::spawn_root_and_wait (this=0x2afbb42b7e00, first=..., next=@0x2,
    $1=<value optimized out>, $2=<value optimized out>, $3=<value optimized out>) at ../../src/tbb/scheduler.cpp:672
#5 0x000000000048a272 in tbb::task::spawn_root_and_wait (root=..., $=1=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/task.h:693
#6 0x0000000000626e6c in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::run(const tbb::blocked_range<int> &, const class {...} &, const tbb::auto_partitioner &) (range=..., body=..., partitioner=..., $1=<value optimized out>,
    $2=<value optimized out>, $3=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:94
#7 0x0000000000633066 in tbb::parallel_for(const tbb::blocked_range<int> &, const class {...} &) (range=..., body=...,
    $6=<value optimized out>, $7=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:165
#8 0x00000000005fb993 in operator() (this=0x2afbdc8add58, r=..., $F2=<value optimized out>, $F3=<value optimized out>) at mech_intrct.cpp:295
#9 0x0000000000627637 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::run_body (
    this=0x2afbdc8add40, r=..., $7=<value optimized out>, $8=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:110
#10 0x000000000061d883 in tbb::interface6::internal::partition_type_base<tbb::interface6::internal::auto_partition_type>::execute (
    this=0x2afbdc8add68, start=..., range=..., $5=<value optimized out>, $0=<value optimized out>, $1=<value optimized out>)
    at /home/kang697/install/tbb41u3/include/tbb/partitioner.h:259
#11 0x0000000000626ff4 in tbb::interface6::internal::start_for<tbb::blocked_range<int>, lambda [], tbb::auto_partitioner const>::execute (
    this=0x2afbdc8add40, $7=<value optimized out>) at /home/kang697/install/tbb41u3/include/tbb/parallel_for.h:116
#12 0x00002afbb3e515ba in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbb42b7e00,
    parent=..., child=0x2, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:440
#13 0x00002afbb3e4f189 in tbb::internal::arena::process (this=0x2afbb42b7e00, s=..., $t4=<value optimized out>, $t5=<value optimized out>)
    at ../../src/tbb/arena.cpp:98
#14 0x00002afbb3e4cf15 in tbb::internal::market::process (this=0x2afbb42b7e00, j=..., $0=<value optimized out>, $1=<value optimized out>)
    at ../../src/tbb/market.cpp:465
#15 0x00002afbb3e4a764 in tbb::internal::rml::private_worker::run (this=0x2afbb42b7e00, $L7=<value optimized out>)
    at ../../src/tbb/private_server.cpp:274
#16 0x00002afbb3e4a696 in tbb::internal::rml::private_worker::thread_routine (arg=0x2afbb42b7e00, $L9=<value optimized out>)
    at ../../src/tbb/private_server.cpp:231
#17 0x0000003811a0683d in start_thread () from /lib64/libpthread.so.0
#18 0x0000003810ad4fad in clone () from /lib64/libc.so.6

For few seemingly normal threads

(gdb) thread 2
[Switching to thread 2 (Thread 0x2afbdd8ea940 (LWP 11715))]#0 tbb::internal::generic_scheduler::reload_tasks (this=0x2afbdcc57e00,
    $5=<value optimized out>) at ../../src/tbb/scheduler.cpp:854
854    ../../src/tbb/scheduler.cpp: No such file or directory.
    in ../../src/tbb/scheduler.cpp
(gdb) where
#0 tbb::internal::generic_scheduler::reload_tasks (this=0x2afbdcc57e00, $5=<value optimized out>) at ../../src/tbb/scheduler.cpp:854
#1 0x00002afbb3e52726 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbdcc57e00,
    completion_ref_count=@0x2afbb45a3a00, return_if_no_work=7, $I7=<value optimized out>, $I8=<value optimized out>, $I9=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:193
#2 0x00002afbb3e516e9 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbdcc57e00,
    parent=..., child=0x7, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:547
#3 0x00002afbb3e4f189 in tbb::internal::arena::process (this=0x2afbdcc57e00, s=..., $t4=<value optimized out>, $t5=<value optimized out>)
    at ../../src/tbb/arena.cpp:98
#4 0x00002afbb3e4cf15 in tbb::internal::market::process (this=0x2afbdcc57e00, j=..., $0=<value optimized out>, $1=<value optimized out>)
    at ../../src/tbb/market.cpp:465
#5 0x00002afbb3e4a764 in tbb::internal::rml::private_worker::run (this=0x2afbdcc57e00, $L7=<value optimized out>)
    at ../../src/tbb/private_server.cpp:274
#6 0x00002afbb3e4a696 in tbb::internal::rml::private_worker::thread_routine (arg=0x2afbdcc57e00, $L9=<value optimized out>)
    at ../../src/tbb/private_server.cpp:231
#7 0x0000003811a0683d in start_thread () from /lib64/libpthread.so.0
#8 0x0000003810ad4fad in clone () from /lib64/libc.so.6

(gdb) thread 3
[Switching to thread 3 (Thread 0x2afbdd4e9940 (LWP 11714))]#0 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb45a3480, completion_ref_count=@0x0, return_if_no_work=true, $I7=<value optimized out>, $I8=<value optimized out>,
    $I9=<value optimized out>) at ../../src/tbb/custom_scheduler.h:173
173    ../../src/tbb/custom_scheduler.h: No such file or directory.
    in ../../src/tbb/custom_scheduler.h
(gdb) where
#0 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0x2afbb45a3480, completion_ref_count=@0x0,
    return_if_no_work=true, $I7=<value optimized out>, $I8=<value optimized out>, $I9=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:173
#1 0x00002afbb3e516e9 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2afbdcc6fe00,
    parent=..., child=0x1, $I4=<value optimized out>, $I5=<value optimized out>, $I6=<value optimized out>)
    at ../../src/tbb/custom_scheduler.h:547
#2 0x00002afbb3e4f189 in tbb::internal::arena::process (this=0x2afbb45a3480, s=..., $t4=<value optimized out>, $t5=<value optimized out>)
    at ../../src/tbb/arena.cpp:98
#3 0x00002afbb3e4cf15 in tbb::internal::market::process (this=0x2afbb45a3480, j=..., $0=<value optimized out>, $1=<value optimized out>)
    at ../../src/tbb/market.cpp:465
#4 0x00002afbb3e4a764 in tbb::internal::rml::private_worker::run (this=0x2afbb45a3480, $L7=<value optimized out>)
    at ../../src/tbb/private_server.cpp:274
#5 0x00002afbb3e4a696 in tbb::internal::rml::private_worker::thread_routine (arg=0x2afbb45a3480, $L9=<value optimized out>)
    at ../../src/tbb/private_server.cpp:231
#6 0x0000003811a0683d in start_thread () from /lib64/libpthread.so.0
#7 0x0000003810ad4fad in clone () from /lib64/libc.so.6

(gdb) thread 7
[Switching to thread 7 (Thread 0x2afbda0d9940 (LWP 11694))]#0 0x0000003810ad19a9 in syscall () from /lib64/libc.so.6
(gdb) where
#0 0x0000003810ad19a9 in syscall () from /lib64/libc.so.6
#1 0x00002afbb3e4a8e2 in tbb::internal::rml::private_worker::run (this=0x2afbb4557dac, $L7=<value optimized out>)
    at ../../src/tbb/private_server.cpp:281
#2 0x00002afbb3e4a696 in tbb::internal::rml::private_worker::thread_routine (arg=0x2afbb4557dac, $L9=<value optimized out>)
    at ../../src/tbb/private_server.cpp:231
#3 0x0000003811a0683d in start_thread () from /lib64/libpthread.so.0
#4 0x0000003810ad4fad in clone () from /lib64/libc.so.6

Any clues???

I also see several times, some threads end up inside the bool arena::is_out_of_work() function...

Thank you very much,

RafSchietekat · ‎09-17-2013

Raf Schietekat wrote:

"volatile intptr_t* tbb::internal::generic_scheduler::my_ref_top_priority" can reference either "volatile intptr_t tbb::internal::arena_base::my_top_priority" (master) or "intptr_t tbb::internal::market::my_global_top_priority" (worker)

I forgot one: "intptr_t tbb::task_group_context::my_priority". This may be concurrently updated, which by itself probably also requires this to be an atomic variable instead of an ordinary one.

Raf Schietekat wrote:

(Added) Does the problem occur, e.g., on both Linux and Windows, or only on Linux? If so, it would mean that the code relies on nonportable memory semantics associated with the use of "volatile". Otherwise it's something else, or that and something else.

Apparently Vladimir Polin had already reproduced it on Windows.

(Added) There is also a potential issue that set_priority() is not serialised, i.e., it might happen that, with A ancestor of B ancestor of C, A has new priority a, B has new priority b, but C also has new priority a (instead of b). This is from inspecting the source code (not confirmed); I don't know whether it is relevant here.

(Added) This is seriously going to take longer than five minutes to figure out...

Seunghwa_Kang · ‎10-29-2013

This is still failing with 4.2 update 1. The reproducer code is still producing the same assertion failure. I hope this to get fixed sometime soon.

Anton_M_Intel · ‎11-06-2013

Sorry, but not in the next release or two. It might be the case of bad assertion only. I.e. the logic behind task priorities probably works correctly. And thanks for reminding, it eventually will get into our focus

Seunghwa_Kang · ‎11-21-2013

Anton Malakhov (Intel) wrote:

Sorry, but not in the next release or two. It might be the case of bad assertion only. I.e. the logic behind task priorities probably works correctly. And thanks for reminding, it eventually will get into our focus

The stall issue is still there, so it's unlikely that this is just a bad assertion. And I also observed the __TBB_ASSERT failure inside assert_market_valid() in market.h and another assertion failure I don't remember. These happen significantly less frequently. It is pretty likely there is a data synchronization issue when priority is involved.

If you execute the code below with more than 1 MPI process, you will see the stall issue pretty frequently (approx. once a minute in my system) but I haven't observed the stall problem once I set ENABLE_PRIORITY to 0.

Another problem is that CPU usage goes to max (e.g. assuming there are 64 hardware threads and two MPI processes, the CPU utilization for each process reaches 3200% even though I set the number of threads to 8).

I hope this can help you to fix the problem.

[cpp]

#include <assert.h>

#include <iostream>

#include <vector>

#include <mpi.h>

#include <tbb/tbb.h>

using namespace std;

using namespace tbb;

#define ENABLE_PRIORITY 1

const int NUM_THREADS = 8;

const int MAIN_LOOP_CNT = 10000;

int g_rank;

MPI_Comm g_mpiCommWorldDefault;

MPI_Comm g_mpiCommWorldLow;

MPI_Comm g_mpiCommWorldHigh;

void computeLowPrior( void );

void updateHighPrior( void );

int main( int argc, char* ap_args[] ) {
task_group highPriorGroup;

task_group lowPriorGroup;

int level;

int ret;

/* initialize MPI */

ret = MPI_Init_thread( NULL, NULL, MPI_THREAD_MULTIPLE, &level );

assert( ret == MPI_SUCCESS );

assert( level == MPI_THREAD_MULTIPLE );

ret = MPI_Comm_dup( MPI_COMM_WORLD, &g_mpiCommWorldDefault );

assert( ret == MPI_SUCCESS );

ret = MPI_Comm_dup( MPI_COMM_WORLD, &g_mpiCommWorldLow );

assert( ret == MPI_SUCCESS );

ret = MPI_Comm_dup( MPI_COMM_WORLD, &g_mpiCommWorldHigh );

assert( ret == MPI_SUCCESS );

ret = MPI_Comm_rank( g_mpiCommWorldDefault, &g_rank );

assert( ret == MPI_SUCCESS );

/* initialize TBB */

task_scheduler_init init( NUM_THREADS );

/* main computation */

for( int i = 0 ; i < MAIN_LOOP_CNT ; i++ ) {

#if ENABLE_PRIORITY

task::self().set_group_priority( priority_normal );

#endif

highPriorGroup.run( []{ updateHighPrior(); } );

lowPriorGroup.run( []{ computeLowPrior(); } );

lowPriorGroup.wait();

highPriorGroup.wait();

ret = MPI_Barrier( g_mpiCommWorldDefault );

assert( ret == MPI_SUCCESS );

if( g_rank == 0 ) {

cout << "loop " << i << " finished." << endl;

}

return 0;

}

void computeLowPrior( void ) {

#if ENABLE_PRIORITY

task::self().set_group_priority( priority_low );

#endif

int ret;

/* compute */

parallel_for( blocked_range<int> ( 0, 10000, 1 ), [&]( const blocked_range<int>& r ) {

for( int ii = r.begin() ; ii < r.end() ; ii++ ) {

}

} );

/* communicate */

ret = MPI_Barrier( g_mpiCommWorldLow );

assert( ret == MPI_SUCCESS );

/* compute */

parallel_for( blocked_range<int> ( 0, 10000, 1 ), [&]( const blocked_range<int>& r ) {

for( int ii = r.begin() ; ii < r.end() ; ii++ ) {

}

} );

return;

}

void updateHighPrior( void ) {

#if ENABLE_PRIORITY

task::self().set_group_priority( priority_high );

#endif

for( int i = 0 ; i < 1000 ; i++ ) {

int ret;

/* compute */

parallel_for( blocked_range<int> ( 0, 10000, 1 ), [&]( const blocked_range<int>& r ) {

for( int ii = r.begin() ; ii < r.end() ; ii++ ) {

}

} );

/* communicate */

ret = MPI_Barrier( g_mpiCommWorldHigh );

assert( ret == MPI_SUCCESS );

}

return;

}

[/cpp]

Anton_M_Intel · ‎11-26-2013

Thanks, it really helps to understand the issue.

First of all, I'd admit there is something non-optimal with task priorities implementation.

But let's keep it for a second because this reproducer clearly shows that task priority is not the culprit.. but rather a catalyst for the problem in your code.

Seems like nobody pay enough attention that your problem statement involves 4 MPI processes.. and you use MPI barriers from inside of TBB tasks. It should be the rule #1 written in capital letters in the Reference: make sure your TBB-based code does not involve any kind of barriers!!! (unless you are really sure what you are doing). Or similar to Cilk guys saying about TLS: friends, tell friends DO NOT USE BARRIERS WITH TBB.

Basically, you expect either a mandatory parallelism or that two different TBB invocations would behave the same way in respect to sequence of the code execution. Which is not the case and thus perfectly detected by these barriers.

So, I think your usage model should be fixed for better conformance with TBB application design rules. Other issue with the code: unfortunately, constructor of task_group provoke default initialization of TBB, so the task_scheduler_init concurrency was not respected (I've filled the bug report for this).

As for what's wrong with TBB, I've modified your reproducer to remove MPI and to set a barrier between "computeLowPrior" and "updateHighPrior". It halts because the former stuck waiting for low-priority parallel_for which is blocked because other threads hunt after higher priority tasks. It seems like not contradicting to what TBB promises though look non-optimal.

Seunghwa_Kang · ‎11-26-2013

Anton Malakhov (Intel) wrote:

Thanks, it really helps to understand the issue.

First of all, I'd admit there is something non-optimal with task priorities implementation.

But let's keep it for a second because this reproducer clearly shows that task priority is not the culprit.. but rather a catalyst for the problem in your code.

Seems like nobody pay enough attention that your problem statement involves 4 MPI processes.. and you use MPI barriers from inside of TBB tasks. It should be the rule #1 written in capital letters in the Reference: make sure your TBB-based code does not involve any kind of barriers!!! (unless you are really sure what you are doing). Or similar to Cilk guys saying about TLS: friends, tell friends DO NOT USE BARRIERS WITH TBB.

Basically, you expect either a mandatory parallelism or that two different TBB invocations would behave the same way in respect to sequence of the code execution. Which is not the case and thus perfectly detected by these barriers.

So, I think your usage model should be fixed for better conformance with TBB application design rules. Other issue with the code: unfortunately, constructor of task_group provoke default initialization of TBB, so the task_scheduler_init concurrency was not respected (I've filled the bug report for this).

As for what's wrong with TBB, I've modified your reproducer to remove MPI and to set a barrier between "computeLowPrior" and "updateHighPrior". It halts because the former stuck waiting for low-priority parallel_for which is blocked because other threads hunt after higher priority tasks. It seems like not contradicting to what TBB promises though look non-optimal.

Thank you very much for the comments, and I have several follow-up questions,

"make sure your TBB-based code does not involve any kind of barriers!!! (unless you are really sure what you are doing)"

Is this a correctness requirement or a performance requirement? I understand the performance impact of invoking a barrier inside a TBB thread, and at least if I disable priority, if there are two or more TBB threads, there is no deadlock issue. TBB has its own correctness requirement in this regard?

" It halts because the former stuck waiting for low-priority parallel_for which is blocked because other threads hunt after higher priority tasks. It seems like not contradicting to what TBB promises though look non-optimal."

I don't fully understand this. One TBB worker thread may block on a barrier and become unavailable, but other TBB threads should do work if there are available work, but it seems like this is not what happens. Are there additional caveats regarding this? My understanding is if one thread blocks, only that thread should become unavailable for work, instead of making the whole thread pool unavailable, but it seems like this is not a way TBB behaves.

Thank you very much,

RafSchietekat · ‎11-26-2013

A drop in performance because of blocking can go all the way to zero, so it's generally too risky to mess with that even if a mere drop in performance is still acceptable to your application (mistakes like wrong location of task_scheduler_init, changes in runtime environment with default task_scheduler_init, composition with other code, maintenance issues, ...).

I'm surprised too that the scheduler literally "postpones execution of tasks with lower priority until all higher priority task are executed" (and I think that should be "higher-priority tasks"). What's the reason for that?

But I thought there already was a reproducer without MPI barriers, in #12?

Vladimir_P_1234567890 · ‎11-27-2013

Hello,

you can use synchronization API.

the shared queue does not guarantee precise first-in first-out behavior. Tasks that uses 3rd party barrier might be executed by the same thread in different order so it will cause a deadlock. look at Scheduling Algorithm chapter of the reference manual.

--Vladimir

Anton_M_Intel · ‎11-27-2013

Seunghwa Kang wrote:

Quote:

"make sure your TBB-based code does not involve any kind of barriers!!! (unless you are really sure what you are doing)"

Is this a correctness requirement or a performance requirement? I understand the performance impact of invoking a barrier inside a TBB thread, and at least if I disable priority, if there are two or more TBB threads, there is no deadlock issue. TBB has its own correctness requirement in this regard?

It is correctness requirement, please see caution at Task Scheduler section of the Reference manual (http://software.intel.com/en-us/node/468188):

CAUTION

There is no guarantee that potentially parallel tasks actually execute in parallel, because the scheduler adjusts actual parallelism to fit available worker threads. For example, given a single worker thread, the scheduler creates no actual parallelism. For example, it is generally unsafe to use tasks in a producer consumer relationship, because there is no guarantee that the consumer runs at all while the producer is running.

So, any relations between TBB tasks must be expressed to the scheduler in TBB terms.

SeunghwaQuote:

" It halts because the former stuck waiting for low-priority parallel_for which is blocked because other threads hunt after higher priority tasks. It seems like not contradicting to what TBB promises though look non-optimal."

I don't fully understand this. One TBB worker thread may block on a barrier and become unavailable, but other TBB threads should do work if there are available work, but it seems like this is not what happens. Are there additional caveats regarding this? My understanding is if one thread blocks, only that thread should become unavailable for work, instead of making the whole thread pool unavailable, but it seems like this is not a way TBB behaves.

The current task priority implementation is done in the way that until any higher-priority work is not finished, lower priority will no be executed. TBB tasks are not supposed for long blocking waits.

Anton_M_Intel · ‎11-27-2013

More specifically, it looks like the thread executing high-priority parallel_for steals a task of low-priority parallel_for and puts it into local 'offloaded' tasks list thus preventing other threads from stealing/executing it. For the given use-case it could be workarounded by e.g. coping the last '#if __TBB_TASK_PRIORITY' section from arena::process() method of arena.cpp into the end of local_wait_for_all() method of custom_scheduler.h (with corresponding inter-classes corrections like removing 's.' from one members and adding 'my_arena->' to other members). But it does not solve the root-cause of the issue and probably not likely to happen in production release.

Seunghwa_Kang · ‎11-27-2013

Vladimir Polin (Intel) wrote:

Hello,

you can use synchronization API.

the shared queue does not guarantee precise first-in first-out behavior. Tasks that uses 3rd party barrier might be executed by the same thread in different order so it will cause a deadlock. look at Scheduling Algorithm chapter of the reference manual.

--Vladimir

Thank you very much for the answer but I don't understand this.

The MPI_Barrier is a barrier between two (or more) different MPI processes, and I am not sure how TBB synchronization functions can achieve synchronization across two different MPI processes.

And I am not sure what's the difference between invoking MPI_Barrier and putting some sleep function or intensive loops. As long as I understand, TBB does not know whether I call MPI_barrier or not, and MPI_Barrier invocation will just to cause the thread to block for some time, and I don't understand "Tasks that uses 3rd party barrier might be executed by the same thread in different order so it will cause a deadlock." In my understanding, nothing gets queued by invoking MPI_barrier.

Seunghwa_Kang · ‎11-27-2013

Anton Malakhov (Intel) wrote:

More specifically, it looks like the thread executing high-priority parallel_for steals a task of low-priority parallel_for and puts it into local 'offloaded' tasks list thus preventing other threads from stealing/executing it. For the given use-case it could be workarounded by e.g. coping the last '#if __TBB_TASK_PRIORITY' section from arena::process() method of arena.cpp into the end of local_wait_for_all() method of custom_scheduler.h (with corresponding inter-classes corrections like removing 's.' from one members and adding 'my_arena->' to other members). But it does not solve the root-cause of the issue and probably not likely to happen in production release.

This aligns with my experience and the debugger output.

"But it does not solve the root-cause of the issue."

I assume the root cause is violating "There is no guarantee that potentially parallel tasks actually execute in parallel, because the scheduler adjusts actual parallelism to fit available worker threads. For example, given a single worker thread, the scheduler creates no actual parallelism. For example, it is generally unsafe to use tasks in a producer consumer relationship, because there is no guarantee that the consumer runs at all while the producer is running."

But I am curious about the rationale behind this. Clearly if there is only one worker thread, there can be a deadlock, and I am explicitly taking care of this. If the number of worker threads is smaller than the number of potentially blocking threads, I don't run those in parallel. And for most current and future microprocessors (e.g. Xeon Phi) with many cores, I am not sure this is something that really needs to be enforced. And as I remember, there is a TBB example for using priority saying, a high priority task waits for user inputs, and a low priority task performs background computation, but this also violates the requirement.

and from the same page, "they should generally avoid making calls that might block for long periods, because meanwhile that thread is precluded from servicing other tasks."

and this is something I am fully aware of.

To work around, I can spawn pthreads instead of task groups, and invoke parallel_for inside the spawned pthreads (so to block only within the spawned pthread, not inside TBB tasks), but not sure I really should add this additional step. And will this fix the issue you described ("More specifically, it looks like the thread executing high-priority parallel_for steals a task of low-priority parallel_for and puts it into local 'offloaded' tasks list thus preventing other threads from stealing/executing it.")?

"and probably not likely to happen in production release."

and what is the production release? Code compiled without the debug option? The program behavior does not change much whether I use the debug version or the release version in my experience. Or are you talking about the commercial version of TBB? As long as I am aware of, there is no difference between the free and commercial versions of TBB.

Thank you very much,

RafSchietekat · ‎11-27-2013

Raf Schietekat wrote:

I'm surprised too that the scheduler literally "postpones execution of tasks with lower priority until all higher priority task are executed" (and I think that should be "higher-priority tasks"). What's the reason for that?

Anton Malakhov (Intel) wrote:

But it does not solve the root-cause of the issue and probably not likely to happen in production release.

Still wondering? Why shouldn't this simply be the last scheduling option instead of just wasting time? If priorities are set at the task_group_context level, why should all the workers in the current arena suddenly be blinded to lower-priority ones even after verifying that there is nothing else to do? Would it cause some form of priority inversion perhaps?

Raf Schietekat wrote:

But I thought there already was a reproducer without MPI barriers, in #12?

Still wondering? I see that Vladimir acknowledged it in #13, Michel chimed in in $14, so that's when I started my (idle?) speculation, and Anton threw cold water on any hope for a quick solution in #24. Is it a different issue or what?

Seunghwa Kang wrote:

But I am curious about the rationale behind this. Clearly if there is only one worker thread, there can be a deadlock, and I am explicitly taking care of this. If the number of worker threads is smaller than the number of potentially blocking threads, I don't run those in parallel. And for most current and future microprocessors (e.g. Xeon Phi) with many cores, I am not sure this is something that really needs to be enforced. And as I remember, there is a TBB example for using priority saying, a high priority task waits for user inputs, and a low priority task performs background computation, but this also violates the requirement.

How are you "explicitly taking care of this"? Maybe just not in #25?

I don't think that not allowing a program to require concurrency is just a matter of availability of resources, but rather a valuable principle for debugging and probably compositing. (My only problem is with the wasted resources by ignoring lower-priority work.)

What example is that? Does it imply concurrent execution of background computation?

Seunghwa Kang wrote:

To work around, I can spawn pthreads instead of task groups, and invoke parallel_for inside the spawned pthreads (so to block only within the spawned pthread, not inside TBB tasks), but not sure I really should add this additional step. And will this fix the issue you described ("More specifically, it looks like the thread executing high-priority parallel_for steals a task of low-priority parallel_for and puts it into local 'offloaded' tasks list thus preventing other threads from stealing/executing it.")?

Each application thread gets its own arena, and there is no catastrophic interference between arenas because only fully unemployed worker threads migrate between them (although becoming unemployed might take a while). So there's one example where available parallelism might well be lower than physical parallelism, but that shouldn't be a problem if the program is according to TBB design rules. It may or may not be a problem for you that this also means that priorities don't work across arenas except to help decide where unemployed workers should migrate to, if my understanding is still correct there ("Though masters with lower priority tasks may be left without workers, the master threads are never stalled themselves.").

Seunghwa Kang wrote:

and what is the production release? Code compiled without the debug option? The program behavior does not change much whether I use the debug version or the release version in my experience. Or are you talking about the commercial version of TBB? As long as I am aware of, there is no difference between the free and commercial versions of TBB.

I think this was about private experiment vs. downloadable release (even non-stable), not debug vs. release.

Anton_M_Intel · ‎11-28-2013

Seunghwa Kang wrote:

Quote:

"But it does not solve the root-cause of the issue."

I assume the root cause is violating "There is no guarantee that potentially parallel tasks actually execute in parallel..."

But I am curious about the rationale behind this.

Raf is right. The rationale for such optional parallelism is composablity.. when a worker thread at any moment can join parallel execution or leave it after completion of a task (at the outer level).

Seunghwa Kang wrote:

To work around, I can spawn pthreads instead of task groups, and invoke parallel_for inside the spawned pthreads (so to block only within the spawned pthread, not inside TBB tasks), but not sure I really should add this additional step. And will this fix the issue you described ("More specifically...

It will work since the tasks of different parallel_for's will be isolated and no stealing will be possible between them. Please also consider using of tbb::task_arena [CPF] feature as it allows to create the same isolated arenas without creating additional thread.

Seunghwa Kang wrote:

"and probably not likely to happen in production release."

and what is the production release?

Raf is right, I suggested a private workaround for TBB library which is not likely to happen in official version (until we will be 100% sure how to do it)

Anton_M_Intel · ‎11-28-2013

Raf Schietekat wrote:

Still wondering? Why shouldn't this simply be the last scheduling option instead of just wasting time? If priorities are set at the task_group_context level, why should all the workers in the current arena suddenly be blinded to lower-priority ones even after verifying that there is nothing else to do? Would it cause some form of priority inversion perhaps?

We admit this inefficiency as I said. It was done with medium-to-short task sizes in mind. my_offloaded_tasks is private member of dispatcher and cannot be made visible to others yet until the owner makes it itself. However, taking lower-priority tasks should be somehow limited by high-priority tasks anyway, otherwise, priorities will not work as intended. I'm thinking about implementation of is_out_of_work() method which will be able to grab offloaded tasks if they are blocked in workers busy with other work.

Raf Schietekat wrote:

Still wondering? I see that Vladimir acknowledged it in #13, Michel chimed in in $14, so that's when I started my (idle?) speculation, and Anton threw cold water on any hope for a quick solution in #24. Is it a different issue or what?

It turned out to be two separate issues. The actual problem was the deadlock. Assertion failure was suspected as a culprit. But they are not connected.

Thanks for answering the rest of the questions.

Seunghwa_Kang · ‎11-29-2013

How are you "explicitly taking care of this"? Maybe just not in #25?

I don't think that not allowing a program to require concurrency is just a matter of availability of resources, but rather a valuable principle for debugging and probably compositing. (My only problem is with the wasted resources by ignoring lower-priority work.)

What example is that? Does it imply concurrent execution of background computation?

[\quote]

1. How are you "explicitly taking care of this"? Maybe just not in #25?

if( #_threads < 2 ) {

updateHighPrior();

computeLowPrior();

}

else {

highPriorGroup.run( []{ updateHighPrior(); } );
lowPriorGroup.run( []{ computeLowPrior(); } );
lowPriorGroup.wait();
highPriorGroup.wait();

}

2. Composability clearly matters, but I think the composability in a main program and the composability issue in a library program needs to be considered separately. For library programs to be used by arbitrary users, avoiding any potential composability issues could be highly desirable, but for a program with more limited use cases, forcing a program to block in only a single threaded mode can cause more problems. And for TBB to be used in a broader context, blocking should not cause unexpected side effects besides just making the blocking thread unavailable (e.g. worker threads are just idling and not doing any work even when there are available worker threads and available works). And it seems like the suggested update will fix this issue.

3. To explain a bit about the application, this is a simulation program, and each MPI process is responsible for sub-partitions of the entire simulation domain. The program runs on a cluster computer with multiple nodes. And there are data exchanges at the sub-partition boundaries. The barriers in the sample code is actually data exchanges in the real program. And forcing data exchanges to occur in only one thread can limit algorithm design in many high-performance computing applications.

RafSchietekat · ‎12-01-2013

First of all a correction: I shouldn't have used "compositing" instead of "composing".

I don't fully understand the previous posting (other than the formatting mishaps), or even the need for priorities in this program, but even if that lower-priority work were still available for second-hand stealing (or whatever that would be called) it's still stretching the main purpose of TBB (which is efficient execution of CPU-bound work while transparently adapting to available parallelism) if you're doing things that require a certain level of concurrency. Sometimes it works, sometimes it becomes... complicated.

It does not seem to be in the cards for TBB to also become a self-supported reactive-programming toolkit. If you want to use a synchronous API for MPI, you should probably do that with plain application threads anyway. Otherwise it depends on your needs whether you should combine TBB with something else or use another toolkit instead, I think.

(Added 2013-12-02) Note that the "something else" above could be either rolling your own solution with an application thread to handle all the asynchronous stuff or using another toolkit for that, together with TBB. You can then hook into TBB by using a continuation and a dummy child to simulate the blocking without actually blocking (spawn the child to execute the continuation). I know it's tempting to second-guess the prohibition against blocking by using platform-specific assumptions or cheating with more threads specified in task_scheduler_init, and maybe composability is not a good-enough reason against that in non-library code, and you may get away with it some or even a lot of the time, but what you also get is new and exciting opportunities to get into trouble.