Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Inner parallelization calls outer

onnog
Beginner
2,217 Views
According to the documentation, the task scheduler should basically go DFS in a thread's own task pool but steal from other tasks near the root. To my understanding, this implies that when you have two nested parallel loops, a thread that arrives at the inner loop should never decide to postpone processing of the inner loop or continuing after the inner loop but decide to start a new task from the outer loop.

However this is what we observed after several days of bug-chasing in our program in some rare and non-reproducable cases.

I was able to write a minimal example that demonstrates this behaviour more reliably. On the two computers where we tested the example, the program below eventually prints "inner parallelization called outer". When you set a breakpoint on that line and analyze the callstack, you see that in deed the thread is in a task from the outer loop, had arrived at the inner loop and started a task from the outer loop.

We suspect that this behaviour might be a bug in TBB. We are using tbb40_20120613 on Windows 7. I am running 8 threads (4 phyisical cores with hyperthreading).[cpp]volatile bool first_task; __declspec(thread) bool thread_is_running_inner_parallelization = false; int g_i; class F { public: void operator() (int p_task_id) const; }; void F::operator() (int p_task_id) const { // first task has inner parallelization if (first_task) { first_task = false; int sum = 0; thread_is_running_inner_parallelization = true; for (int i=0; i<10000; ++i) { g_i = i; tbb::parallel_for (0, 2, [&](int k) { // we came from another thread to help with this task // (which is OK but rarely happens and the bug always occurs shortly after this) if (!thread_is_running_inner_parallelization) std::cout << "stolen " << i << "/" << k << " "; // task must do some work, otherwise the problem will not occur for (int j=0; j<100000; ++j) sum += (i*j)%7; }); } thread_is_running_inner_parallelization = false; // print the result of the computation to avoid optimizing the work away // and to see when first task is finished std::cout << sum ; } // other tasks basically do nothing else { // print heartbeat if (p_task_id%1000000==0) std::cerr << "."; // here is the bug if (thread_is_running_inner_parallelization) std::cout << "inner parallelization called outer " << g_i << " " << std::endl; } } int main () { // set global variable first_task = true; // create very many tasks on the outer parallelization, // to make sure we won't run out of them early. tbb::parallel_for (0, 2000000000, F()); } [/cpp] (EDIT: Syntax highlighting improved.)
0 Kudos
48 Replies
RafSchietekat
Valued Contributor III
633 Views
Nice to already have a reproducer.

Please do make the resource (single s) pointer an atomic. BTW, what does d_ mean?
0 Kudos
onnog
Beginner
633 Views

Please do make the resource (single s) pointer an atomic

Done (in original post)

BTW, what does d_ mean?

We introduced prefixes for variables more than a decade ago and stick to the letters we chose those days. As m_ was already taken by mutable, we use d_ for member variables, following some outdated book on software design.
0 Kudos
RafSchietekat
Valued Contributor III
633 Views
"Done (in original post)"
Perhaps it's mostly pedantic (especially in this simple situation), but the intent of the atomic may be made clearer (through explicit memory semantics), and some machine cycles saved, by using the following code instead (not tested):

[cpp]int& Lazy::operator* () { if (int* l_resource = d_resource.load<:ACQUIRE>()) return *l_resource; mutex_type::mutex::scoped_lock lock(d_mutex); if (int* l_resource = d_resource.load<:RELAXED>()) return *l_resource; while (g_timer.user_time()<1); smp::parallel_invoke (Task11(), Task12()); int* l_resource = new int (42); d_resource.store<:RELEASE>(l_resource); return *l_resource; } [/cpp]
0 Kudos
RafSchietekat
Valued Contributor III
633 Views
#19 "Well, now would probably be a good moment for somebody from the TBB team to let us know what they think about this, whether it was already part of the plan or whether it will now be added?"
Are there any thoughts about this yet?
0 Kudos
onnog
Beginner
633 Views
Are there any thoughts about this yet?

As far as we are concerned: We have replaced the lazy initialization by an early initialization. This is quite nasty becaue the logic if the resource is needed is quite complicated. We now have to keep the conditions in various modules for resource usage synchronous with the conditions for resource initialization. Moreover, we decided to simplify the logic when doing the initialization: We check necessary conditions for initializations, which means that we occasionally do initialization though it is not needed.

Because of this problem, we put replacement of TBB by Microsoft PPL on our mid-term todo-list. We have already checked that PPL does not have the same problem. However, replacement would affect the whole project and would require some evaluation of PPL beforehand.
0 Kudos
Anton_M_Intel
Employee
633 Views
Ok, after a few attempts to understand what's going on in this forum thread, I think I can answer something relevant. AFA I Understand, you try to call TBB scheduling (parallel_invoke) holding a lock which can be requested on the outer level as well. Didn't we warn against such a design anti-pattern?
So, the bad news is that we don't directlysupport it yet. Good news is that we understand that somethimesfor complex modular programs there is no other way but to support it. Raf is right saying that new CPF feature task_arena coming in the next TBB release will help to isolate the work under the lock so *no* external tasks can be stolen.
Actually, we already have arenas which separated between and tied to user threads as described in the mentioned blog. Thus, we can probably claim we already support this anti-pattern via ugly work-araund :)involving additional user thread
0 Kudos
RafSchietekat
Valued Contributor III
633 Views
"Thus, we can probably claim we already support this anti-pattern via ugly work-araund :)involving additional user thread"
Any prospects for task_arena to include the necessary details so that the user doesn't have to deal with thread pools but can instead reuse all the threads that would be blocked anyway, etc.? Note that this means that the arena should dynamically increase its cardinality as workers arrive at its boundary. But I confess that I should probably have insisted that a partial workaround is already possible without this (by using a thread from a pool and accepting temporary undersubscription).
0 Kudos
onnog
Beginner
633 Views
Dear Anton, Thank you for your Reply.

Understand, you try to call TBB scheduling (parallel_invoke) holding a lock which can be requested on the outer level as well. Didn't we warn against such a design anti-pattern?


In deed, somewhere in the documentation there is a hint that is no good idea to hold locks too long. However, lazy initialization itsself is not an anti-pattern. Plus, parallelization inside the initialization is necessary for performance reasons. We are looking for a way to do that with TBB. (Getting the hold-lock-too-long anti pattern working is one of several possible ways. If you can show me another way, I'll be happy with that.)

I wasn't able to concretize the information from the blog to actually working code. Can you give some hints?

When is the next TBB release with support for task_arena supposed to be released?
0 Kudos
Anton_M_Intel
Employee
633 Views
Hm.. I found no direct warning regarding parallel work under lock in the Reference, we should describe it more explicitly. thanks!
The release is expected in a few weeks. But 'CPF' actually means 'backward-incompatible'. So, you should be careful deciding on usage of a CPF feature in a product.
The workaround is to use separate (e.g. std::) thread which starts the parallel processing under the lock. In this case, the work will be separated between your main thread and this additional initialization thread. If starting a thread under the lock is too expensive, please consider creating the thread in advance and putting it asleep until lazy initialization starts, then it can start a parallel processing, and when it finishes it should release the lock andunblock the main thread.
0 Kudos
RafSchietekat
Valued Contributor III
633 Views
Why would it break compatibility if arenas were never exposed before?

And could you address #27 (workaround suboptimal because parallelism probably not realised in initialisation code, specific features needed)?
0 Kudos
Anton_M_Intel
Employee
633 Views
I meant that we don't promise backward-compatibility and we do not do versioning for community preview features (CPF). In particular, if you link your app with tbb_preview.dll of one version, it is not guaranteed it will work with tbb_preview.dll of the next version without recompilation (as it is for tbb.dll). Worse, on Linux and other platforms with weak symbols, if e.g.two modules are compiled with CPF of different versions, it can easily break the program because a loader can find two symbols with the same name but different layout/implementation, choose one, but it can be incompatible with a context where it is used.
Sorry, can you eleborate what is your question (re #27..)
0 Kudos
onnog
Beginner
633 Views
The workaround is to use separate (e.g. std::) thread which starts the parallel processing under the lock. In this case, the work will be separated between your main thread and this additional initialization thread. If starting a thread under the lock is too expensive,

It will not be too expensive to start the thread under the lock.

We already considered this approach, but we could not complete this to design a solution. How exactly is the work separated? I assume, the std::thread for initialization has its own arena, so I can trick TBB to have two arenas without explicitely accessing arenas, right? On the one hand, tasks are collected in per-thread task pools. On the other hand, tasks belong to an arena. What's the relationship between those per-thread task pools and arenas? Does every thread belong to an arena? (If so, how is the assignment done?) Does every thread have one task pool per arena?

Note that is not enough to prevent the second master (the std::thread for initialization) from stealing outer tasks: The example from comment #20 is based on the initialization thread stealing an outer task. But a deadlock would also occur like this: Thread T from the pool steals work from the initialization thread F, and then another thread T' from the pool steals work from T. At some point, T might have to wait for T', while still processing a subtask of the initialization. If T than starts another outer task, it might end up working for the lock. T' finishes, but T is trapped waiting for the lock and cannot continue. However, F will eventually wait for T and cannont complete the initialization and release the lock.

So: Does TBB prevent any threads that are processing any initialization-subtask (like T above) from stealing outer tasks before they complete their initialization-subtask?

On the other hand, I always want to have the maximum number of threads running. When the first thread arrives at the resource, it falls asleep waiting for the resource. But it starts the initialization thread, so everything is fine. If another thread arrives at the ressource, it should either add a new thread to the thread pool or look for other work itsself to keep the maximum number of threads working. How can I do that?
0 Kudos
RafSchietekat
Valued Contributor III
633 Views
"I meant that we don't promise backward-compatibility and we do not do versioning for community preview features (CPF)."
Ah, of course... I'm afraid "backward-incompatbility" may not be the best term for that.

"Sorry, can you eleborate what is your question (re #27..)"
The suggested workaround would allow initialisation with one thread per resource (manually taken from a pool), but TBB thinks the other threads are all occupied, making that an arena of one master thread without workers, whereas some threads are just waiting for the initialisation to finish, leading to undersubscription in the presence of lots of parallel work to be done as part of that initialisation. So proper support for this scenario would either allow as many threads in as are currently blocked, or would reuse the threads waiting at the boundary but prevent them from stealing outside the arena associated with the resource being initialised, and it would dynamically adjust that number as additional threads arrive at the arena waiting for the resource.

(Added) Well, there may be some workers in the additional arena, but there would still be undersubscription.

(Added) Leaving the floor to Anton to answer #32...
0 Kudos
Anton_M_Intel
Employee
633 Views
Quoting onnog
We already considered this approach, but we could not complete this to design a solution. How exactly is the work separated? I assume, the std::thread for initialization has its own arena, so I can trick TBB to have two arenas without explicitely accessing arenas, right?
[AM] Right

On the one hand, tasks are collected in per-thread task pools. On the other hand, tasks belong to an arena. What's the relationship between those per-thread task pools and arenas? Does every thread belong to an arena? (If so, how is the assignment done?) Does every thread have one task pool per arena?
[AM] Think of it as if arena has the fixed number of task pools. Each master thread (main or std::thread) which schedule a work or calls task_scheduler_init creates own implicit arena. So, an (implicit) arena is tied to a thread, and a task is tied to an arena where it was spawned. Worker threads can migrate between arenas but only when it complete all the work from its local task-pool.

Note that is not enough to prevent the second master (the std::thread for initialization) from stealing outer tasks: The example from comment #20 is based on the initialization thread stealing an outer task. But a deadlock would also occur like this: Thread T from the pool steals work from the initialization thread F, and then another thread T' from the pool steals work from T. At some point, T might have to wait for T', while still processing a subtask of the initialization. If T than starts another outer task, it might end up working for the lock. T' finishes, but T is trapped waiting for the lock and cannot continue. However, F will eventually wait for T and cannont complete the initialization and release the lock.

So: Does TBB prevent any threads that are processing any initialization-subtask (like T above) from stealing outer tasks before they complete their initialization-subtask?
[AM] The tasks in arenas are separated in both directions.

On the other hand, I always want to have the maximum number of threads running. When the first thread arrives at the resource, it falls asleep waiting for the resource. But it starts the initialization thread, so everything is fine. If another thread arrives at the ressource, it should either add a new thread to the thread pool or look for other work itsself to keep the maximum number of threads working. How can I do that?
[AM] Please try tbb::task_group which starts a single initialization task, which creates new std::thread/arena to process the lazy initialization and waits until it finishes, and then any thread can wait on this task_group effectively executing remaining outer work (except a thread blocked on execution of the initialization task - it is necessary to avoid oversubscription.. on the other hand a temporary undersubscription is much worse than temporary oversubscription...).

0 Kudos
RafSchietekat
Valued Contributor III
633 Views
"[AM] Please try tbb::task_group which starts a single initialization task, which creates new std::thread/arena to process the lazy initialization and waits until it finishes, and then any thread can wait on this task_group effectively executing remaining outer work (except a thread blocked on execution of the initialization task - it is necessary to avoid oversubscription.. on the other hand a temporary undersubscription is much worse than temporary oversubscription...)."
It would be nice for the documentation to explicitly state that multiple threads can wait() for a single task_group, instead of requiring the reader to read on about structured_task_group and interpret its restrictions as implicit capabilities of task_group. I wouldn't have guessed so, until now; did I guess right?

And of course it remains to be seen how much work can be done without idling anyway before the underpowered initialisation work finishes, because intuitively a lot of the code will depend on this initialisation. Even then, a lot of (suboptimal) stealing may be involved.
0 Kudos
Anton_M_Intel
Employee
633 Views
"Sorry, can you eleborate what is your question (re #27..)"
The suggested workaround would allow initialisation with one thread per resource (manually taken from a pool), but TBB thinks the other threads are all occupied, making that an arena of one master thread without workers, whereas some threads are just waiting for the initialisation to finish, leading to undersubscription in the presence of lots of parallel work to be done as part of that initialisation. So proper support for this scenario would either allow as many threads in as are currently blocked, or would reuse the threads waiting at the boundary but prevent them from stealing outside the arena associated with the resource being initialised, and it would dynamcally adjust that number as additional threads arrive at the arena waiting for the resource.

(Added) Well, there may be some workers in the additional arena, but there would still be undersubscription.

I hope I answered this question in previous message by suggesting to use task_group. To be more specific, waiting on a task from a different arena does not mean joining that different arena. So, you are still able to wait and execute tasks from current ('outer') arena.

0 Kudos
onnog
Beginner
633 Views
And of course it remains to be seen how much work can be done without idling anyway before the underpowered initialisation work finishes, because intuitively a lot of the code will depend on this initialisation.

In our case, we are quite certain that if a task arriving at the resource can only take other outer tasks, for some instances, eventually all outer tasks will be finished (if not depending on the resource) or waiting for the resource. It's necessary, that threads from the outer arena that wait for the initialization can go into the initialization arena (but not the other way round).

Before a solution for that problem is in sight, we will not try to implement the suggested solution, but rather stick to early initialization or eventually evaluate PPL.
0 Kudos
Anton_M_Intel
Employee
633 Views
Arhg.. I see what is your concern now. yes.. waiting on a task_goup will not help the initialization because it spins in outer arena.
But if such a task can recycle itself instead of waiting on initilization, the task priorities can help. Initialization thread must start work with high priority, it effectively marks initialization arena with high-priority and assign all the available resources to it, then all the workers can switch to initialization as soon as they finish processing a current task. But it will not help the main thread which should just wait when initialization completes (or it can spin on recycling if it is acceptable). Initialization task which starts the std::thread can recycle as well (instead of what I suggested before) to unblock the worker and allow it migrating into initialization arena.
With new feature, the design could be simplified by calling initialization arena directly but the first implementation does not actually support participating of multiple threads excplicitly joined arena to actually do the work.. we are not yet sure it is really desired feature (let's call it 'multiple masters fortask_arena')
0 Kudos
RafSchietekat
Valued Contributor III
633 Views
"Arhg.. I see what is your concern now. yes.. waiting on a task_goup will not help the initialization because it spins in outer arena."
That's confusing: does a task blocked on task_group::wait() steal (as you wrote earlier) or spin (as you write now)?

"But if such a task can recycle itself instead of waiting on initilization, the task priorities can help. Initialization thread must start work with high priority, it effectively marks initialization arena with high-priority and assign all the available resources to it, then all the workers can switch to initialization as soon as they finish processing a current task."
I'm not sure that higher priority is the primary concern.

"But it will not help the main thread which should just wait when initialization completes (or it can spin on recycling if it is acceptable). Initialization task which starts the std::thread can recycle as well (instead of what I suggested before) to unblock the worker and allow it migrating into initialization arena."
I assume you mean that a "continuation" is used (which could be recycled but also a new task)? But how does this free up the entire thread that was runnng the original task, unless this was a root task or everything is organised with continuations? Otherwise I'm afraid this is not a sufficient solution in that regard...
"With new feature, the design could be simplified by calling initialization arena directly but the first implementation does not actually support participating of multiple threads excplicitly joined arena to actually do the work.. we are not yet sure it is really desired feature (let's call it 'multiple masters fortask_arena')"
So the feature has to be widely desired as well as really cool? What a high burden! :-)

How difficult could it be to temporarily reassign a thread to a different arena until its work there is finished?
0 Kudos
Anton_M_Intel
Employee
613 Views
"Arhg.. I see what is your concern now. yes.. waiting on a task_goup will not help the initialization because it spins in outer arena."
That's confusing: does a task blocked on task_group::wait() steal (as you wrote earlier) or spin (as you write now)?
[AM] It spins stealing and executing the tasks but it just spins when there are no tasks to steal. :)

"But if such a task can recycle itself instead of waiting on initilization, the task priorities can help. Initialization thread must start work with high priority, it effectively marks initialization arena with high-priority and assign all the available resources to it, then all the workers can switch to initialization as soon as they finish processing a current task."
I'm not sure that higher priority is the primary concern.
[AM] Are you sure you are not mixing it with a thread priority? task priorities helps transfer workers to high-priority (initialization) arena. It is exactly what you (both) are asking for AFAIU. Without that, a half of workers will keep stealing outer tasks.

"But it will not help the main thread which should just wait when initialization completes (or it can spin on recycling if it is acceptable). Initialization task which starts the std::thread can recycle as well (instead of what I suggested before) to unblock the worker and allow it migrating into initialization arena."
I assume you mean that a "continuation" is used (which could be recycled but also a new task)? But how does this free up the entire thread that was runnng the original task, unless this was a root task or everything is organised with continuations? Otherwise I'm afraid this is not a sufficient solution in that regard...
[AM] When a task comes to a resource which is not yet initialized, it should be able to terminate itself (by recycling itself as continuationor creating a new continuation - doesn't matter) to allow the scheduler to transfer the thread to high-priority arena.

"With new feature, the design could be simplified by calling initialization arena directly but the first implementation does not actually support participating of multiple threads excplicitly joined arena to actually do the work.. we are not yet sure it is really desired feature (let's call it 'multiple masters fortask_arena')"
So the feature has to be widely desired as well as really cool? What a high burden! :-)

[AM] Well :) multiple masters per arena require (further) refactoring of the scheduler as a whole. And if there is no use-cases, why to bother? This discussion shows there are.. priorities will not help if task cannot be terminated but need to block instead.
One of the idea for CPF is to gather feedbacks on the usage model.. well, it is not out yet but starts getting hot already :)

How difficult could it be to temporarily reassign a thread to a different arena until its work there is finished?

[AM] It is exactly what this new CPF does.. and it was not an easy work (and still should be improved).

0 Kudos
RafSchietekat
Valued Contributor III
613 Views
"Are you sure you are not mixing it with a thread priority? task priorities helps transfer workers to high-priority (initialization) arena. It is exactly what you (both) are asking for AFAIU. Without that, a half of workers will keep stealing outer tasks."
Since the threads predominately won't actually be free to migrate (unless everything is rewritten in terms of continuations), priorities won't be of much if any use.

"When a task comes to a resource which is not yet initialized, it should be able to terminate itself (by recycling itself as continuationor creating a new continuation - doesn't matter) to allow the scheduler to transfer the thread to high-priority arena."
But that task is not everything the thread is currently doing (exceptions like root tasks notwithstanding), so ad-hoc continuations at the arena boundary are not sufficient.

"Well :) multiple masters per arena require (further) refactoring of the scheduler as a whole. And if there is no use-cases, why to bother? This discussion shows there are.. priorities will not help if task cannot be terminated but need to block instead."
Prorities will never be the solution: requiring a rewrite of the whole application in terms of continuations is not realistic. If it were, the problem wouldn't occur in the first place, as this forum thread has already established.
"One of the idea for CPF is to gather feedbacks on the usage model.. well, it is not out yet but starts getting hot already :)"
Actually I don't have any idea yet about what else it might be good for! :-)

"It is exactly what this new CPF does.. and it was not an easy work (and still should be improved)."
I'm sure the initial work may have been the hardest work; now the question is how hard it would be to bolt on this new idea: thread arrives at arena, changes allegiance until arena work is finished, thread resumes where it was "interrupted", all typical stack-based operation.

Of course it would be best to first meditate a bit about possible generalisations: can there be multiple initialisations with dag-like interdependencies, are there other applications where a thread is free to leave when only its own task pool is emptied out instead of all the task pools in the arena, ... But not for too long!
0 Kudos
Reply