Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Inner parallelization calls outer

onnog
Beginner
1,686 Views
According to the documentation, the task scheduler should basically go DFS in a thread's own task pool but steal from other tasks near the root. To my understanding, this implies that when you have two nested parallel loops, a thread that arrives at the inner loop should never decide to postpone processing of the inner loop or continuing after the inner loop but decide to start a new task from the outer loop.

However this is what we observed after several days of bug-chasing in our program in some rare and non-reproducable cases.

I was able to write a minimal example that demonstrates this behaviour more reliably. On the two computers where we tested the example, the program below eventually prints "inner parallelization called outer". When you set a breakpoint on that line and analyze the callstack, you see that in deed the thread is in a task from the outer loop, had arrived at the inner loop and started a task from the outer loop.

We suspect that this behaviour might be a bug in TBB. We are using tbb40_20120613 on Windows 7. I am running 8 threads (4 phyisical cores with hyperthreading).[cpp]volatile bool first_task; __declspec(thread) bool thread_is_running_inner_parallelization = false; int g_i; class F { public: void operator() (int p_task_id) const; }; void F::operator() (int p_task_id) const { // first task has inner parallelization if (first_task) { first_task = false; int sum = 0; thread_is_running_inner_parallelization = true; for (int i=0; i<10000; ++i) { g_i = i; tbb::parallel_for (0, 2, [&](int k) { // we came from another thread to help with this task // (which is OK but rarely happens and the bug always occurs shortly after this) if (!thread_is_running_inner_parallelization) std::cout << "stolen " << i << "/" << k << " "; // task must do some work, otherwise the problem will not occur for (int j=0; j<100000; ++j) sum += (i*j)%7; }); } thread_is_running_inner_parallelization = false; // print the result of the computation to avoid optimizing the work away // and to see when first task is finished std::cout << sum ; } // other tasks basically do nothing else { // print heartbeat if (p_task_id%1000000==0) std::cerr << "."; // here is the bug if (thread_is_running_inner_parallelization) std::cout << "inner parallelization called outer " << g_i << " " << std::endl; } } int main () { // set global variable first_task = true; // create very many tasks on the outer parallelization, // to make sure we won't run out of them early. tbb::parallel_for (0, 2000000000, F()); } [/cpp] (EDIT: Syntax highlighting improved.)
0 Kudos
48 Replies
Anton_M_Intel
Employee
211 Views
"When a task comes to a resource which is not yet initialized, it should be able to terminate itself (by recycling itself as continuationor creating a new continuation - doesn't matter) to allow the scheduler to transfer the thread to high-priority arena."
But that task is not everything the thread is currently doing (exceptions like root tasks notwithstanding), so ad-hoc continuations at the arena boundary are not sufficient.
[AM] Do you mean if nested parallelism (i.e. wait_for_all inside a task) is used? Yes, priorities will not help as well. But if a task finishes on the outermost level (i.e. no nesting), the worker can offload all its remaining tasks and move to high-priority arena.


"It is exactly what this new CPF does.. and it was not an easy work (and still should be improved)."
I'm sure the initial work may have been the hardest work; now the question is how hard it would be to bolt on this new idea: thread arrives at arena, changes allegiance until arena work is finished, thread resumes where it was "interrupted", all typical stack-based operation.
[AM] That is exactly 'multiple masters' problem.. explicit join is different from how workers behave. It is like master. But there is only one slot for master (and it is distingueshed from workers by slot number currently). The slot selection code should be added to task_arena implementation and scheduler should be refactored to extend assumptions and diffirentiate user threads and threads from TBB pool in other way.. it is at least..

Of course it would be best to first meditate a bit about possible generalisations: can there be multiple initialisations with dag-like interdependencies, are there other applications where a thread is free to leave when only its own task pool is emptied out instead of all the task pools in the arena, ... But not for too long!
[AM] There is an idea how to avoid the deadlock on nested parallelism without dealing with separete arenas at all.. similar to how PPL does it (I guess with great help of ConcRT). But I'm not sure when we can get to implementing it.

0 Kudos
RafSchietekat
Valued Contributor III
211 Views
"Do you mean if nested parallelism (i.e. wait_for_all inside a task) is used? Yes, priorities will not help as well. But if a task finishes on the outermost level (i.e. no nesting), the worker can offload all its remaining tasks and move to high-priority arena."
The remaining tasks on the stack will be ancestors of that continuation, or likely be constrained some other way. The only way to expect that worker to be able to offload its remaining tasks is if there were no remaining tasks to offload in the first place, i.e., no blocking calls ever, and everything in terms of continuations. That is not a realistic assumption.
0 Kudos
onnog
Beginner
211 Views
[Anton] "When a task comes to a resource which is not yet initialized, it should be able to terminate itself (by recycling itself as continuationor creating a new continuation - doesn't matter) to allow the scheduler to transfer the thread to high-priority arena."
[Raf] But that task is not everything the thread is currently doing (exceptions like root tasks notwithstanding), so ad-hoc continuations at the arena boundary are not sufficient.


Also, there is data of the task itsself in the callstack. This would get lost if the task terminates on finding an uninitialized resource. At least it would mean extremely heavy refactoring to write the task in a way that it can terminate without data loss when it finds an unitialized resource. I find that highly undesirable.

If I got you correctly, a thread that is waiting on task_group (e.g. for the resource) is free to take other tasks from its current arena, but not from other arenas. Right? So, would it be possible to (manually?) migrate the thread that waits for the resource to the initialization arena and take tasks from there? It could leave the task that waits for the initialization in the callstack. To me, that seems to solve the problem.

EDIT: When the initialization arena is closed, the thread should fall back to the outer arena.
EDIT: Crap at end of post removed.

0 Kudos
onnog
Beginner
211 Views
Hm.. I found no direct warning regarding parallel work under lock in the Reference, we should describe it more explicitly. thanks!

What I had in mind when I said you did warn against holding locks too long:
TBB lazy class
You are right, this does not warn against doing parallel work under the lock. It's based on more general considerations.

Note that the improved TBB lazy class from the document will not deadlock with parallel work in the initialization, but might restart the initialization several times (which is almost as bad).
0 Kudos
onnog
Beginner
211 Views
[AM] Think of it as if arena has the fixed number of task pools. Each master thread (main or std::thread) which schedule a work or calls task_scheduler_init creates own implicit arena. So, an (implicit) arena is tied to a thread, and a task is tied to an arena where it was spawned. Worker threads can migrate between arenas but only when it complete all the work from its local task-pool.

So when a worker thread goes to an arena, it takes one of the arena's task pools thereby making it its local task pool? Before my idea was that every thread has a task pool and that every task pool belongs to a thread. That is just a simplification to describe the basic case that there is one arena and this arena has by default the same number of task pools as there are threads, right?

Does the condition "when it complete all the work from its local task-pool" imply that there is no unfinished task being processed (e.g. waiting for stolen subtasks to finish)?

What I need would be:
1. if task in local pool, take it
2. if no task in local pool, but currently processing unfinished task, steal from other pools in current arena only. If no tasks left in current arena, spin/idle.
3. if no task in local pool and not processing anything, either steal from other pools in current arena or migrate to another arena.

Is that how TBB works?

4. Have a way to manually migrate a thread that is waiting for the initialization to the inialization arena, overriding rule 2.



0 Kudos
Anton_M_Intel
Employee
211 Views
Quoting onnog
If I got you correctly, a thread that is waiting on task_group (e.g. for the resource) is free to take other tasks from its current arena, but not from other arenas. Right?
[AM] Right
So, would it be possible to (manually?) migrate the thread that waits for the resource to the initialization arena and take tasks from there? It could leave the task that waits for the initialization in the callstack. To me, that seems to solve the problem.
[AM] It is not yet possible but it could be in future with multiple masters joining the same explicit task_arena object.
0 Kudos
Anton_M_Intel
Employee
211 Views
Quoting onnog
So when a worker thread goes to an arena, it takes one of the arena's task pools thereby making it its local task pool? Before my idea was that every thread has a task pool and that every task pool belongs to a thread.
[AM] it was the implementation prior to 4.1.. but it doesn't matter who own the task pool (thread or slot) because workers migrate always with empty task pool
That is just a simplification to describe the basic case that there is one arena and this arena has by default the same number of task pools as there are threads, right?
[AM] yes, it's simplification

Does the condition "when it complete
all the work from its local task-pool" imply that there is no unfinished task being processed (e.g. waiting for stolen subtasks to finish)?
[AM] does 'unfinished task' means nested parallelism, e.g. calls to wait_for_all() or an algorithm from inside a task execution? In this case, a thread cannot migrate anyway until the task finishes the wait and exits.


What I need would be:
1. if task in local pool, take it
2. if no task in local pool, but currently processing unfinished task, steal from other pools in current arena only. If no tasks left in current arena, spin/idle.
3. if no task in local pool and not processing anything, either steal from other pools in current arena or migrate to another arena.

Is that how TBB works?
[AM] yes


4. Have a way to manually migrate a thread that is waiting for the initialization to the inialization arena, overriding rule 2.
[AM] ok, we will consider implementing the multiple masters for task_arena
0 Kudos
Reply