Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Inner parallelization calls outer

onnog
Beginner
1,575 Views
According to the documentation, the task scheduler should basically go DFS in a thread's own task pool but steal from other tasks near the root. To my understanding, this implies that when you have two nested parallel loops, a thread that arrives at the inner loop should never decide to postpone processing of the inner loop or continuing after the inner loop but decide to start a new task from the outer loop.

However this is what we observed after several days of bug-chasing in our program in some rare and non-reproducable cases.

I was able to write a minimal example that demonstrates this behaviour more reliably. On the two computers where we tested the example, the program below eventually prints "inner parallelization called outer". When you set a breakpoint on that line and analyze the callstack, you see that in deed the thread is in a task from the outer loop, had arrived at the inner loop and started a task from the outer loop.

We suspect that this behaviour might be a bug in TBB. We are using tbb40_20120613 on Windows 7. I am running 8 threads (4 phyisical cores with hyperthreading).[cpp]volatile bool first_task; __declspec(thread) bool thread_is_running_inner_parallelization = false; int g_i; class F { public: void operator() (int p_task_id) const; }; void F::operator() (int p_task_id) const { // first task has inner parallelization if (first_task) { first_task = false; int sum = 0; thread_is_running_inner_parallelization = true; for (int i=0; i<10000; ++i) { g_i = i; tbb::parallel_for (0, 2, [&](int k) { // we came from another thread to help with this task // (which is OK but rarely happens and the bug always occurs shortly after this) if (!thread_is_running_inner_parallelization) std::cout << "stolen " << i << "/" << k << " "; // task must do some work, otherwise the problem will not occur for (int j=0; j<100000; ++j) sum += (i*j)%7; }); } thread_is_running_inner_parallelization = false; // print the result of the computation to avoid optimizing the work away // and to see when first task is finished std::cout << sum ; } // other tasks basically do nothing else { // print heartbeat if (p_task_id%1000000==0) std::cerr << "."; // here is the bug if (thread_is_running_inner_parallelization) std::cout << "inner parallelization called outer " << g_i << " " << std::endl; } } int main () { // set global variable first_task = true; // create very many tasks on the outer parallelization, // to make sure we won't run out of them early. tbb::parallel_for (0, 2000000000, F()); } [/cpp] (EDIT: Syntax highlighting improved.)
0 Kudos
48 Replies
RafSchietekat
Valued Contributor III
949 Views
"According to the documentation, the task scheduler should basically go DFS in a thread's own task pool but steal from other tasks near the root."
Tasks are stolen, not being stolen from.

"To my understanding, this implies that when you have two nested parallel loops, a thread that arrives at the inner loop should never decide to postpone processing of the inner loop or continuing after the inner loop but decide to start a new task from the outer loop."
That's the tendency, but it stopped being an implicit guarantee since 2.1 or so (I found "20090313 open-source release" in CHANGES.txt, and 2.1 is right above that). I don't recall whether it was ever an explicit guarantee.

"However this is what we observed after several days of bug-chasing in our program in some rare and non-reproducable cases.

I was able to write a minimal example that demonstrates this behaviour more reliably. On the two computers where we tested the example, the program below eventually prints "inner parallelization called outer". When you set a breakpoint on that line and analyze the callstack, you see that in deed the thread is in a task from the outer loop, had arrived at the inner loop and started a task from the outer loop.

We suspect that this behaviour might be a bug in TBB. We are using tbb40_20120613 on Windows 7. I am running 8 threads (4 phyisical cores with hyperthreading)."

It is not a bug, that's exactly what you should expect because of stealing from another thread's depth-agnostic deque-implemented pool. What high-level problem do you want to solve?
0 Kudos
onnog
Beginner
949 Views
Tasks are stolen, not being stolen from.

Typo. I wanted to write that that a thread steals from other threads near the root.

It is not a bug, that's exactly what you should expect because of stealing from another thread's depth-agnostic deque-implemented pool.

Can you explain what happens here and why that is OK?

I understand that it's OK that tasks get stolen from the thread running first_task (let's call it first_thread). Then it can happen then that first_thread is idle, but one task from the inner for-loop is still running (in another thread, which I'll call slave).

I do not understand why first_thread is stealing from other threads. I would find it slightly better (but still poor) if first_thread just sits idly and waits for slave to return. A really good solution would steal, but only tasks that are direct or indirect children of the node where first_thread is waiting. (It may be possible to express in terms of threads, where to steal.) I any case, I want first_thread to be ready when slave finishes, so first_thread can continue first_task. Above stealing policy guarantees that. When first_thread has stolen a task from the outer loop, it is most likely not ready in time.

What high-level problem do you want to solve?

Disregarding our high-level problem we think that there should not run more outer-loop-tasks simultaniously than there are threads. Current TBB behaviour results in abandoning an outer-loop-task when it arrives at an inner parallelization and starting a new outer-loop-task. This results in problems such as unnecessary memory consumption or later average availability of results from the tasks (e.g. to display to the user).

Sure, when the outer parallelization uses all threads, the inner is useless. But removing the inner parallelization is not an option because there are less outer tasks on other instances or more CPUs on other machines.

For our actual high-level problem we know the solution, but are not completely satisfied with it. We have the inner parallelization in a lazy initialization. TBB abandons constructing the object when it arrives at the inner parallelization. We had a mutex implementation, so the new outer task waited for the initialization to complete (deadlock). This may be a case of holding the mutex to long. But starting the lazy initialization again (as in your lazy pattern) isn't good either. One might need a lot of attempts before one initialization completes!

The solution we want to implement is to initialize the objects before the outer parallelization.

0 Kudos
RafSchietekat
Valued Contributor III
949 Views
"Can you explain what happens here and why that is OK?"
If all chunks of an inner loop are being executed, the first thread that finishes will have nothing to do but to steal from another thread, and it will tend to grab a task related to the outer loop.

"The solution we want to implement is to initialize the objects before the outer parallelization."
You should probably express the shared dependencies on lower-level initialisations at the TBB level, by maintaining a shared collection of followers in the object being initialised, and releasing each of them when initialisation is complete. A client of such an object would then set a reference count, register itself with the object, and wait for it to finish.
0 Kudos
onnog
Beginner
949 Views
If all chunks of an inner loop are being executed, the first thread that finishes will have nothing to do but to steal from another thread, and it will tend to grab a task related to the outer loop.

Well, we agree that this happens. We disagree if this is desirable behaviour if the first thread that finishes is the thread that spawned the inner tasks.

I don't get your suggestion how to solve the problem with lazy initialization. Are you talking of something like a future class? How does the solution make sure that the thread that does the initialization is not blocked waiting for the initialization (after it has droped out the initialization due to inner parallelization)? Even if it polls work while waiting, it has an incomplete outer task in its callstack. And therefore cannot complete the initialization.
0 Kudos
RafSchietekat
Valued Contributor III
949 Views
"I don't get your suggestion how to solve the problem with lazy initialization."
I was assuming that you want to apply lazy initialisation from the first thread that uses a resource through a shared handle.

"Are you talking of something like a future class?"
A non-TBB future would block a thread and cause undersubscription, so this would be a way to expose the future to the scheduler.

"How does the solution make sure that the thread that does the initialization is not blocked waiting for the initialization (after it has droped out the initialization due to inner parallelization)?"
That brings me to the limitations of this solution. There have been discussions about futures and agents etc. earlier, and they would seem to cause issues with the scheduler, but I don't recall all the details. I don't see a problem if the "TBB future" (to coin a name for the pattern) is a nonblocking leaf task (no dependencies of its own) that would then either become merged with the first task visiting it (to avoid it getting lost in the deque somehow) or become a dependency of any other task visiting it while it is already being processed. While this is not TBB's normal bread and butter (only tree-shaped graphs of tasks are directly supported in the syntax), the scheduler is fully capable of handlng directed acyclic graphs of tasks, and this would ensure that topology. Does this apply to your high-level goal? Let's see... thread not first at the scene will be "blocked", sure, and able to steal an outer-loop task in the meantimen, but the first thread that encountered the resource uninitialised would already be busy executing it and so nobody would be blocked forever waiting for the initialisation to finish. I don't see what "after it has droped out the initialization due to inner parallelization" could possibly mean?

"Even if it polls work while waiting, it has an incomplete outer task in its callstack. And therefore cannot complete the initialization."
Maybe there would be more stealing in the scenario (wild guess), which is not the optimal thing to occur for performance to say the least, it may well be better for performance than if everybody would just block on mutexes (subject for study or do I just need a coffee at this point?). But I don't see why initialisation would not be able to complete?

I do concede that working with dags is somewhat of a drag in TBB, but you could probably encapsulate that.

Does that come close?
0 Kudos
onnog
Beginner
949 Views
A non-TBB future would block a thread and cause undersubscription, so this would be a way to expose the future to the scheduler.

OK, so a TBB future is better. Not blocking a thread is nice to have. However, the essence of the problem is not maximizing the number of working threads, but to get the solution working at all. The pros and cons of blocking the threads in mutexes, non-TBB futures or TBB-futures are not what bothers me. The problem is that any these solution deadlocks if there is a parallelization in the initialization.

I don't see a problem if the "TBB future" [...] is a nonblocking leaf task

Everything is fine if the TBB future doesn't spawn new tasks.

thread not first at the scene will be "blocked", sure, and able to steal an outer-loop task in the meantimen, but the first thread that encountered the resource uninitialised would already be busy executing it

When a "thread not first at the scene" (call it N) is "blocked", it may also steal a task spawned by the first thread (call it F). The stolen task contributes to building the ressource. E.g. the stolen task may be one index in an inner parallel_for. F is currently executing another index of the same parallel_for. Now F happens to finish before N. There are no more indices left. So F steals an outer-loop task (that's what I called dropping out - that's what causes the problem and has to be avoided) and gets "blocked" waiting for the ressource. N finishes, but F still cannot continue with the initalization, because it is "busy" with the outer loop task. Deadlock. Even if F is still polling work while waiting: It cannot resume the initialization task, it can only start new tasks.

To sum up, any lazy initialization only works when there is no parallelization in the initialization. :-(

I don't see what "after it has droped out the initialization due to inner parallelization" could possibly mean?

To visualize that, put a breakpoint on the line "inner parallelization called outer" and look at the callstack. Imagine that the "if (first_task)" scope is the initialization and sum is the ressource. The only difference between lazy initialization and my example code is that in my example the other threads are not waiting for the computation of sum to complete.
0 Kudos
RafSchietekat
Valued Contributor III
949 Views
"The problem is that any these solution deadlocks if there is a parallelization in the initialization."
Did I miss that? Even this should not be a deal breaker if you can at least guarantee that the initialisation-task graph is entirely separate from the rest of the program, has no potental cycles, and can therefore be composed with the rest of the programmer to a bigger graph that is at all times acyclic even if not always connected (as far as TBB is concerned). My initial proposal was simpler, but based on stricter assumptions for higher convincing power. :-)

"The stolen task contributes to building the ressource."
Please explain, because this could imply breaking the assumption of a dag.

"N finishes, but F still cannot continue with the initalization, because it is "busy" with the outer loop task. Deadlock."
F doesn't continue with the initialisation, it just needs to get the go-ahead to return from the wait, which it will eventually do as its other work winds down. It might happen that there is nothing to be stolen at some point even though another way of scheduling would allow an idle thread to resume from a wait now buried somewhere on its stack, but this is not a permanent condition, and how bad this is for performance is a matter for experimentation (unless you can do this in cerebro), not pessimism.

"To sum up, any lazy initialization only works when there is no parallelization in the initialization. :-("
See above.
0 Kudos
onnog
Beginner
949 Views
"The stolen task contributes to building the ressource."
Please explain, because this could imply breaking the assumption of a dag.

Instead of repeating myself I'll post code that demonstrates the deadlock. I hope that this answers your questions. The initialization-task graph is entierely separate from the rest of the program. Our overall task graph is a dag.

Please think away the variable first_thread in the example. It is not in the production code that we are debugging. The purpose of this variable is to make sure that only the first thread depends on the initialization while tasks processed by other threads finish quickly. Thus it is maximizing the probability that the other threads steal from the first thread (which does the initialization). So the variable first_thread is a device that increases the probability that the bug occurs dramatically. In principle, the bug can occur without this variable (and actually did in some runs of our production code).

With mutex_type=tbb the code runs into an assertion when the first thread has dropped out of the initialization and wants to grab the mutex a second time. With mutex_type=boost, it just deadlocks because the first thread waits for the mutex which it already holds.

Note that the problem cannot be solved by replacing the mutex with some code that makes threads poll for other work when they find the resource creation in progress. If you state something else, could you please post code to prove that?

[cpp]//namespace mutex_type = tbb; namespace mutex_type = boost; __declspec(thread) bool first_thread = false; mutex_type::mutex d_mutex; class Lazy { private: int* d_ressource; Lazy (const Lazy&); void operator= (const Lazy&); public: Lazy (); int& operator* (); }; Lazy::Lazy () : d_ressource(nullptr) { } int& Lazy::operator* () { if (d_ressource) return *d_ressource; mutex_type::mutex::scoped_lock lock(d_mutex); if (d_ressource) return *d_ressource; int sum = 0; for (int i=0; i<10000; ++i) { tbb::parallel_for (0, 2, [&](int k) { if (!first_thread) std::cout << "stolen" << std::endl; for (int j=0; j<100000; ++j) sum += (i*j)%7; }); } d_ressource = new int (sum); return *d_ressource; } class F { private: Lazy* d_lazy; public: F (Lazy*); void operator() (int) const; }; F::F (Lazy* p_lazy) : d_lazy(p_lazy) { } void F::operator() (int) const { if (first_thread) std::cout << **d_lazy << std::endl; } int main () { first_thread = true; Lazy ressource; tbb::parallel_for (0, 2000000000, F(&ressource)); } [/cpp]
0 Kudos
RafSchietekat
Valued Contributor III
949 Views
"Instead of repeating myself I'll post code that demonstrates the deadlock. I hope that this answers your questions. The initialization-task graph is entierely separate from the rest of the program. Our overall task graph is a dag."
BTW, be careful with that double-checked locking: you should use an atomic there.

"Note that the problem cannot be solved by replacing the mutex with some code that makes threads poll for other work when they find the resource creation in progress. If you state something else, could you please post code to prove that?"
Sure it can, even if I don't have the time right now to write that code for you. The first task to visit a resource should attach the resource's initialisation task to the TBB graph by creating it as its child, and any subsequent task should add itself as an additional parent c/o an explicit collection in the resource or its handle (you'll have to be careful how to do that in a thread-safe manner).

0 Kudos
onnog
Beginner
949 Views
BTW, be careful with that double-checked locking: you should use an atomic there.

Yes, I've heard that. But so far the lock never caused problems. If you replace the lock by an atomic, the code would not deadlock, but instead start creation of the ressource again and again.

Sure it can, even if I don't have the time right now to write that code for you. The first task to visit a resource should attach the resource's initialisation task to the TBB graph by creating it as its child, and any subsequent task should add itself as an additional parent c/o an explicit collection in the resource or its handle (you'll have to be careful how to do that in a thread-safe manner).

This would be interesting for us. However I don't understand how to implement that to make an esstianal difference over my code. This is why I asked for code. As you don't have the time to write it, I have some questions how to do that myself.

The first task creates the initialization task and then does spawn_and_wait_for_all on it?
Other tasks attach themselves and then do wait_for_all?

This way, the threads do not just wait for the mutex, but look for other tasks. Good.

But what's the essential point that prevents the deadlock? Which of the following steps does not happen?
- The initialization task (run by thread F) still spawns subtasks in the inner parallelization.
- Some of these subtasks might still be stolen by another thread N.
- F might still finish its own subtask before N.
- F then polls for work.
- F might start another outer task.
- The outer task attaches itsself to wait for the initialization.
- Now F polls for work again. It may or may not find work. But even if N finished in the meantime and F does not find work, F still has the outer task on top of its callstack. Therefore F cannot continue with the initialization. N cannot continue with the initialization either because it does not have the beginning of the initialization in its callstack.
0 Kudos
RafSchietekat
Valued Contributor III
949 Views
"Yes, I've heard that. But so far the lock never caused problems. If you replace the lock by an atomic, the code would not deadlock, but instead start creation of the ressource again and again."
For portability (and presentability), even with the mutex (that's not the issue), d_ressource should really be redeclared as atomic and used with acquire or release semantics.

I'll have a look at the rest later.
0 Kudos
RafSchietekat
Valued Contributor III
949 Views
Thank you for an interesting question, and sorry for just taking a swing at it before. I try to keep a good batting average here, but occasionally I do strike out. (If I should be messing up even my metaphores, please note that I've never even seen a game on TV.)

I've been a sceptic about depth-agnostic deque-based scheduling myself at some point. It seemed vulnerable to stack overruns, against which the countermeasure is not to steal past a certain threshold (traditional sequential stack overruns are still a hazard of course). This implementation is capable of scheduling dags of tasks, but that does not apply here, since the parents of the shared resource-initialisation task are already being executed. That was one part of my mistake. The other part was that, while recognising futures as cyclic dependency-creating devices, I thought that if only this was being avoided by policy you would get the above-mentioned safe dag and things would just work.

On the other hand, even considering task depth, it would still be possible for F to start initialising the resource, go out stealing a lower-level task somewhere else (let's say that levels grow downward), and still find itself needing the same resource, similarly resulting in deadlock. This may be something that was not considered in the original analysis around outer and inner loops, so hopefully it is a helpful new insight here.

Could the solution be to apply continuations? That would seem to be a necessary part of getting around frames being buried inside a stack, but how much of the implementation would then need to be in terms of continuations? At first sight it would need to be all of the resource-initialising code, but then what do you do when you would like to use a parallel_for, which is only offered as a blocking call?

Declaring the resource-initialising thread unauthorised for stealing (using API to be added for this purpose), much as when its stack level is past the above-mentioned threshold, is probably still insufficient if its own tasks may be stolen by threads that are not dedicated to the effort by some arena-related mechanism. And of course there's the recurring idea of using fibers between acts of theft so that no frame is ever artificially buried on the stack because of the scheduler alone.

I hope I have rectified some of my errors in this posting, but unfortunately I was only able to come up with largely hypothetical solutions. If anybody can improve on those, please do chime in at this point!
0 Kudos
onnog
Beginner
949 Views
Thank you for an interesting question, and sorry for just taking a swing at it before. I try to keep a good batting average here, but occasionally I do strike out.

Thank you for admitting honestly.

Could the solution be to apply continuations?

Rewriting the initialization in terms of continuations is not an option. In our real application, the initialization is the full time job of one of my colleagues who has been working at it for several years. We cannot refactor all of this.

Declaring the resource-initialising thread unauthorised for stealing [...] is probably still insufficient if its own tasks may be stolen by threads that are not dedicated to the effort by some arena-related mechanism.

In deed. Moreover, it would be nice if the resource-initializing thread steals back stolen tasks. Look at the implementation of parallel_for. Say range is 0..7. F starts to work at 0 and pushes 4..7, 2..3, 1 in its dqueue (or similar, I have not studied the TBB implementation that intensively). Now 4..7 is stolen by N. N works at 4, pushing 6..7, 5 in N's dqueue. Suppose 2..3 and 1 compute quickly or are stolen by other threads. Then after finishing with 0, F should steal back 6..7 or 5. Also, if subtasks of N are stolen by N' and then N steals an outer task, we again have a deadlock.

And of course there's the recurring idea of using fibers between acts of theft so that no frame is ever artificially buried on the stack because of the scheduler alone.

That's beyond my current understanding of TBB.

Thank you for your replies.
0 Kudos
RafSchietekat
Valued Contributor III
949 Views
"That's beyond my current understanding of TBB."
It's not something currently in TBB, but it has been mentioned before on this forum.

Maybe the upcoming direct control of arenas might support a solution here, with the following twist: each additional thread arriving at a resource and needing to wait for it to be initialised could temporarily authorise the scheduler to add a new thread to the arena (from a pool of course), after the first thread had set it up and either created a new thread or reused itself; when the work on the resource is done, the arena is destroyed, the authorisations forgotten, and the original threads would be able to continue. Not all resources would be lucky enough to have multiple worker threads in their respective arenas, but perhaps there's a good chance that multiple interest is met with commensurate response, so to say.

(Added 2012-07-14) Oh yes, the other threads arriving at the initialisation code could perhaps set affinity for the threads to be admitted to the arena, or, why not, reuse themselves much like the first thread, as long as they are similarly prevented from wandering out of the arena until they are finished with its work.

(Fixed typo 2012-07-16)
0 Kudos
onnog
Beginner
949 Views
each additional thread arriving at a resource and needing to wait for it to be intialised could temporarily authorise the scheduler to add a new thread to the arena (from a pool of course), after the first thread had set it up and either created a new thread or reused itself

I don't understand that. Maybe due the fact that I don't know what precisely an arena is.

If you consider a solution with new threads: It's too late for thread creation when an additional thread arrives at the ressource. When F arrives at the ressource for the second time, you already are in a losing position. It doen't help to create a new thread then.

Basically, there are 2 approaches IMHO:

Either make sure that threads waiting for stolen child tasks complete only steals tasks that directly or indirectly are spawned by those tasks. I implemented that solution some time ago in a chess playing software (without TBB, just Windows threads or pthreads respectively) It worked fine because every node in a game tree has potential for parallelism, so the thread can find tasks to steal easily, even when restricted where to steal. It might perform poorly (making threads sitting idle because they don't find tasks to steal legally) if there is less parallelism in the program. This way, tasks are completed as soon as possible.

A considerable (but untested) alternative is to have more threads. You have three thread pools. Any thread is in exactly one of them.
- the usual pool P, limited in size by the constructor parameter of task scheduler (which usually is one thread per CPU core)
- a pool of "clean" threads C. C is initially empty. Threads are put there rather then destroyed, when they are done with all of their work. This way they can be reused instead of new threads. By taking a thread from C in case C is empty I mean creating a new thread.
- a heap of blocked threads B

Scheduling:
- When a thread finds that it cannot continue with its current task (either waiting for a ressource or waiting for stolen tasks to complete), it does not stealing, but goes to B. Other threads in B are checked if they can continue now. If so, one of them returns to P. Otherwise, a thread from C goes to P and starts stealing.
- When a thread t completes with its outermost task, it checks the threads in B if one of them continue. If so, one of them goes to P. t goes to C. Otherwise t remains in P.

This way, you usually have more threads then CPU cores, but P is limited to the constructor parameter, so never more threads are working simultaniously. Tasks do not strictly complete as soon as possible (once in B, a thread can resume only if another thread finishes or gets blocked), but at least there are no deadlocks any more. Also, there might be better CPU usage than when restricting where to steal. Memory consuption for the stacks of the threads might be a problem.

0 Kudos
RafSchietekat
Valued Contributor III
949 Views
"I don't understand that. Maybe due the fact that I don't know what precisely an arena is."
So far it's only an implementation construct (see "TBB 3.0 task scheduler improves composability of TBB based solutions. Part 1."), but they will soon be exposed in the API.

"If you consider a solution with new threads: It's too late for thread creation when an additional thread arrives at the ressource. When F arrives at the ressource for the second time, you already are in a losing position. It doen't help to create a new thread then."
I dare hope that my blindness was only intermittent... In what I just proposed, F (or its stand-in) would be constrained to working on initialising the resource. Use of the arena would implement roughly what you're also proposing, so I hope that it will come with the necessary buttons and levers for such a purpose (or I hereby propose that they will). If the other threads arriving at the resource were to reuse themselves, there would also be much less of a penalty than in case of a context switch, and there would still be emphasis on finishing work on the arena/initialisation as quickly as possible, so perhaps that's even better than merely preventing deadlocks. If anything, there might be too many threads in the arena by default if there is not enough parallel slack at that level, but probably not for too long, so it might not be worthwhile to entangle everything for fear of the possibility of a little bit of idle time before at least gaining some experience with this solution.

"Memory consuption for the stacks of the threads might be a problem."
How, exactly?
0 Kudos
onnog
Beginner
949 Views
In what I just proposed, F (or its stand-in) would be constrained to working on initialising the resource.

OK, that would solve the problem. You would have to impose that constraint on F as soon as F starts with initializing. I was confused because you didn't write that explicitely. You only wrote what would happen for "each additional thread arriving at a resource".

So do I understand you correctly?
1. When a thread (F) starts to initialize the resource, it creates a new arena and stays in the new arena until the initializaion is complete.
2. The other threads stay in the main arena. Only if one of it also waits for the resource one of them enters the new arena.
("each additional thread arriving at a resource and needing to wait for it to be intialised could temporarily authorise the scheduler to add a new thread to the arena")

@2: Why do you think you should restrict the number of threads that enter the new arena? Why not allow any thread that polls for work enter the new arena (especially in case there are no tasks in the main arena left)? In any case, a more or less static assignment of threads to arenas may limit the performance.

I don't know what methods/features are provided by areneas and don't have the time to dig into the internals of TBB that far. When a thread hasn't found tasks in new arena and it doesn't have partially completed tasks from the arena in the callstack, the thread can safely leave the arena. When you enhance your solution by that, that might solve the problem. Esp. nice that it avoids context switches.

"Memory consuption for the stacks of the threads might be a problem."
How, exactly?


I suggested to have more threads. To my understanding, each of them needs an own stack, which is reserved even if not currently used by the thread. So more threads need more memory.
0 Kudos
onnog
Beginner
949 Views
see "TBB 3.0 task scheduler improves composability of TBB based solutions. Part 1."

Thank you for that link. In the second part, I find this code:

[cpp]task& r1 = *new( task::allocate_root() ) empty_task; r1.set_ref_count(2); task& t1 = *new( r1.allocate_child() ) BackgroundTask; task::spawn(t1); task& r2 = *new( task::allocate_root() ) empty_task; r2.set_ref_count(2); task& t2 = *new( r2.allocate_child() ) ForegroundTask; task::spawn(t2); r2.wait_for_all(); task::destroy(r2); r1.wait_for_all(); task::destroy(r1);[/cpp] I wonder if the problem that I reported can occur there in a similar way:
Let F work at ForegroundTask.
If a thread N happens to steal a child of ForegroundTask, F might have to wait for N and steal a child of BackgroundTask in the meantime.
With F being temporarily trapped in BackgroundTask, ForegroundTask cannot complete.
F will eventually finish, so this isn't a deadlock. But still ForegroundTask (e.g. GUI) will not respond for some time.
0 Kudos
RafSchietekat
Valued Contributor III
949 Views
"So do I understand you correctly?"
Yes.

"@2: Why do you think you should restrict the number of threads that enter the new arena? Why not allow any thread that polls for work enter the new arena (especially in case there are no tasks in the main arena left)? In any case, a more or less static assignment of threads to arenas may limit the performance."
The main thing is that the existing thread reuses itself (my favourite) or taps a thread from a pool to step in (back-up solution, perhaps also based on current stack level), to avoid oversubscription. I'm not entirely sure that the arena would always be able to absorb even every waiting thread, e.g., when many threads need a shared resource with not much inherent parallelism, let alone additional ones, but that could also just be a matter of arena configuration (inherit the count from the arena of the thread that created the arena, or use the maximum from the arenas of the other threads, unless explicitly overridden?). Something like that?

"I don't know what methods/features are provided by areneas and don't have the time to dig into the internals of TBB that far."
I think we're building a wish list for the new API here...

"When a thread hasn't found tasks in new arena and it doesn't have partially completed tasks from the arena in the callstack, the thread can safely leave the arena. When you enhance your solution by that, that might solve the problem."
That might be a good generalisation, even though it wouldn't apply here (waiting for resource initialisation to finish).

"I suggested to have more threads. To my understanding, each of them needs an own stack, which is reserved even if not currently used by the thread. So more threads need more memory."
OK, I thought you meant memory consumption per stack.

Well, now would probably be a good moment for somebody from the TBB team to let us know what they think about this, whether it was already part of the plan or whether it will now be added?
0 Kudos
onnog
Beginner
876 Views
Well, now would probably be a good moment for somebody from the TBB team to let us know what they think about this, whether it was already part of the plan or whether it will now be added?

In deed, I agree with that.

In the meantime, I created an example that does not depend on lucky or unlucky timing. I do all timing myself with a Timer class. The code requires some obvious modifications to run without our internal framework. Especially, you will need a Timer class.

(EDIT: atomic added, base class introduced)[cpp]//namespace smp = Concurrency; namespace smp = tbb; //namespace mutex_type = tbb; namespace mutex_type = boost; Timer g_timer; mutex_type::mutex g_mutex; int g_count = 0; class Task { private: virtual void run () const = 0; public: void operator() () const { { mutex_type::mutex::scoped_lock lock(g_mutex); static char* name[] = {"F","N","M"}; if (Platform::thread_name().empty() && g_count<3) { Platform::set_thread_name (name[g_count]); ++g_count; } std::cout << std::left; std::cout << "thread " << std::setw(10) << Platform::thread_id_and_name() << ":" << std::setw(13) << typeid(*this).name() << " started at " << g_timer.user_time() << std::endl; } run (); { mutex_type::mutex::scoped_lock lock(g_mutex); std::cout << "thread " << std::setw(10) << Platform::thread_id_and_name() << ":" << std::setw(13) << typeid(*this).name() << " finished at " << g_timer.user_time() << std::endl; } } }; class Task11 : public Task { public: void run () const { while (g_timer.user_time()<4); } }; class Task12 : public Task { public: void run () const { while (g_timer.user_time()<5); } }; class Lazy { private: mutex_type::mutex d_mutex; tbb::atomic d_resource; Lazy (const Lazy&); void operator= (const Lazy&); public: Lazy (); int& operator* (); }; Lazy::Lazy () : d_resource() { } int& Lazy::operator* () { if (d_resource) return *d_resource; mutex_type::mutex::scoped_lock lock(d_mutex); if (d_resource) return *d_resource; while (g_timer.user_time()<1); smp::parallel_invoke (Task11(), Task12()); d_resource = new int (42); return *d_resource; } Lazy g_lazy; class Task1 : public Task { public: void run () const { std::cout << *g_lazy << std::endl; } }; class Task21 : public Task { public: void run () const { while (g_timer.user_time()<2); } }; class Task221 : public Task { public: void run () const { while (g_timer.user_time()<6); } }; class Task222 : public Task { public: void run () const { std::cout << *g_lazy << std::endl; } }; class Task22 : public Task { public: void run () const { while (g_timer.user_time()<3); smp::parallel_invoke (Task221(), Task222()); } }; class Task2 : public Task { public: void run () const { smp::parallel_invoke (Task21(), Task22()); } }; BOOST_AUTO_TEST_SUITE( TbbT ); BOOST_AUTO_TEST_CASE( inner_parallelization_calls_outer ) { tbb::task_scheduler_init scheduler_init(3); g_timer.start(); smp::parallel_invoke (Task1(), Task2()); } BOOST_AUTO_TEST_SUITE_END(); [/cpp] Expected behaviour:
(EDIT: We don't know exactly the output, there are degrees of freedom)
Program should print 42 twice (plus some debug output) and terminate normally after 6 seconds.

Actual behaviour:
(EDIT: Thread names added)
Program deadlocks after printing:[plain]thread [6008] F :class Task1 started at 0 thread [4956] N :class Task2 started at 0 thread [4956] N :class Task21 started at 0 thread [4300] M :class Task22 started at 0 thread [6008] F :class Task11 started at 1.01401 thread [4956] N :class Task21 finished at 2.04361 thread [4956] N :class Task12 started at 2.04361 thread [4300] M :class Task221 started at 3.01082 thread [6008] F :class Task11 finished at 4.00923 thread [6008] F :class Task222 started at 4.00923 thread [4956] N :class Task12 finished at 5.00763 thread [4300] M :class Task221 finished at 6.02164
[/plain]
0 Kudos
Reply