Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Need help with pipeline deadlock

benz__alexander
Beginner
1,786 Views

Hi everyone,

I have a problem concerning proper termination of pipelines and I'm running out of ideas on where the problem might be.

The setup is as follows. I have one class that handles loading and decompressing of video streams. This process is realised via the TBB pipeline mechanism. Once the start() method of my class is called, it begins loading and decompressing frames and delivers them to a follow-up component.

Calling start on the class spawns a thread which initialized the scheduler and calls pipeline.run() (which blocks until the pipeline is done). Calling stop on the class tells the first tbb::filter to stop loading frames and to return with NULL (to terminate the pipeline). There are multiple instances of this class running (with different input streams).

The problem I have is that once in a while when calling stop, the first filter returns with NULL but the pipeline is not stopped which results in the main thread method (the one that called pipeline::run()) not returning.

Inspecting the threads, there are a couple of tasks waiting :

tbb::internal::GenericScheduler::wait_while_pool_is_empty ()

but none of the threads hangs in the instance that blocks.

Any help is apreciated,

- ALex

0 Kudos
61 Replies
Dmitry_Vyukov
Valued Contributor I
395 Views
We won't know the answer to that until we try it out. If the stall issue turns out to be significant, we may have to create >P software threads on a system with P hardware threads, and somehow block/unblock the software threads so that there are only P running at at time. Or maybe it will turn out that a little occasional oversubscription is not a big problem. There's much to be learned by implementing and trying it out.

Context switches are indeed cheap now. So limited oversubscription can be the answer. But there are other problems - most notably per thread memory cosnumption in scalable alloc (it seems that then you nevertheless will have to add something like scalable_alloc_flush()), and other per-thread resources (stack, maybe some user per-thread resources).

0 Kudos
Dmitry_Vyukov
Valued Contributor I
395 Views
Quoting - Dmitriy Vyukov

A way to implementthe policyis to prevent threads from simultaneously having two employers ("moonlighting"). The formal rules are:

  1. A thread is either employed by a thread (possibly itself) or unemployed.
  2. A master thread is always employed by itself.
  3. A worker thread is unemployed if it is not running any tasks and its task pool is empty. Otherwise a worker is employed.
  4. A worker can change from unemployed to employed by stealing work. The employer of the thief is the employer of the victim. This rule also covers tasks obtained via advertisements in a thread's mailbox.
  5. For two distinct employers X and Y,a thief employed by X cannot steal a task from avictim employed by Y.

And why not just add additional check into inner-most master-thread loop? Now it looks like:

void master_thread_loop()
{
for (;;)

{
while (task* t = pop_from_my_stack())
{
process(t);
}

if (parent_task->reference_counter == 0)

break;

steal_and_block_and_etc();

}
}

Loop must be modified this way:

void master_thread_loop()
{
for (;;)

{
while (task* t = pop_from_my_stack())
{
process(t);

if (parent_task->reference_counter == 0)

break;

}

if (parent_task->reference_counter == 0)

break;

steal_and_block_and_etc();

}

if (my_stack_is_not_empty)

transfer_my_stack_to_somewhere();
}

I believe this will also fix the problem. But... hmmm... this looks simpler, and additional overheads in inner-most loop are negligible. And there will be no induced stalls, and no oversubscription, and no additional memory consumption in scalable alloc.

Master thread is just free to exit the game as soon as it wants.

What do you think?

IIRC I've seen something similar in Doug Lea's Java Fork/Join Framework (which is also Cilk clone).

0 Kudos
Dmitry_Vyukov
Valued Contributor I
395 Views

Well, worker threads can (must?) also check parent tasks' counter more frequently, but only in recursive scheduler loops. And master threads must transfer their uncompleted tasks only in non-recursive scheduler loops. So here is patched version:

void scheduler_loop()
{
for (;;)

{
while (task* t = pop_from_my_stack())
{
process(t);

if ((recursive || master) && parent_task->reference_counter == 0)

break;

}

if (parent_task->reference_counter == 0)

break;

steal_and_block_and_etc();

}

if (recursive && my_stack_is_not_empty)

transfer_my_stack_to_somewhere();
}

I've not worker out all the details, but I beleive the idea is working.

0 Kudos
RafSchietekat
Valued Contributor III
395 Views

"What do you think?" I think that I don't understand what you're trying to do, but maybe I'm the only one? (I am a bit distracted by very strange problems on the way to floating-point atomics, where I only recently managed to solve a crash in optimised builds by using unions instead of reinterpret_cast(), even before any floating-point type was in sight, so that g++ version I'm using is looking mighty suspicious to me now.)

0 Kudos
Dmitry_Vyukov
Valued Contributor I
395 Views
Quoting - Raf Schietekat

"What do you think?" I think that I don't understand what you're trying to do, but maybe I'm the only one?

I am trying to fix the error described in the first post in this topic.

The master thread doesn't return from spawn_root_and_wait() because it is processing tasks from other pipeline, and is not even checking whether its root task (pipeline) has already finished. If the thread will be checking whether its root task has already finished, then the deadlock is resolved and the problem is solved.

0 Kudos
RafSchietekat
Valued Contributor III
395 Views

"If the thread will be checking whether its root task has already finished, then the deadlock is resolved and the problem is solved." I can see how that might sometimes help, but not how this would actually prevent the thread from still sometimes/often getting caught up in a never-ending story. You would need to guarantee that task closures (a task and its descendants) execute in a bounded time, which happens to be the basic requirement to make global progress possible, but then why run all these additional master threads instead of an arguably more efficient "FIFO"-capable scheduler? How will they be reined in from potentially massive oversubscription (how would you serve 10000 pipelines?), making suboptimal context-switch choices, littering the process with stacks, etc.? Isn't this just retracing the earlier history of multithreading?

I think it's a shame that TBB started with a promising approach (task-based optional concurrency), started with a mistake (stealing LIFO instead of FIFO, thus only supporting finite workloads), and now, in a panic, bolts off in the opposite direction.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
395 Views
Quoting - Raf Schietekat

"If the thread will be checking whether its root task has already finished, then the deadlock is resolved and the problem is solved." I can see how that might sometimes help, but not how this would actually prevent the thread from still sometimes/often getting caught up in a never-ending story.

Please, describe how "never-ending" story can happen if we will apply my proposed patch? (I don't see)

Quoting - Raf Schietekat

You would need to guarantee that...

NO! Master thread is free to exit the game as soon as he wants (even during "never-ending" story).

Quoting - Raf Schietekat

... instead of an arguably more efficient "FIFO"-capable scheduler?

FIFO-capable scheduler won't help in this situation (IIRC we already was here). Master thread can be executing infinite sequence of tasks in any order, but if it will be not checking root task's reference counter it will be in infinite loop!

0 Kudos
Dmitry_Vyukov
Valued Contributor I
395 Views
Quoting - Raf Schietekat

How will they be reined in from potentially massive oversubscription (how would you serve 10000 pipelines?), making suboptimal context-switch choices, littering the process with stacks, etc.? Isn't this just retracing the earlier history of multithreading?

I think it's a shame that TBB started with a promising approach (task-based optional concurrency), started with a mistake (stealing LIFO instead of FIFO, thus only supporting finite workloads), and now, in a panic, bolts off in the opposite direction.

This is different problem, and I an NOT trying to solve it now.

If you have 10000 threads with independent work, then why you need TBB at all?! You already have sufficient parallelism. Just execute every work in its thread.

0 Kudos
RafSchietekat
Valued Contributor III
395 Views

"Please, describe how "never-ending" story can happen if we will apply my proposed patch? (I don't see)" New master steals task, task spawns children and waits_for_all(), child spawns grandchildren and waits_for_all(), some descendant gets the brilliant idea to run an unbounded client-server operation, ...

"NO! Master thread is free to exit the game as soon as he wants (even during "never-ending" story)." How would you abandon a stolen task's execute() before it returns? Are we still talking about 1:1 schedulers?

"FIFO-capable scheduler won't help in this situation (IIRC we already was here). Master thread can be executing infinite sequence of tasks in any order, but if it will be not checking root task's reference counter it will be in infinite loop!" We already were here, but it seems that some additional effort is required to convince you. :-) With a FIFO-capable scheduler, we wouldn't be using a master thread for running the pipeline in the first place (you can do it all in one thread if need be). It is probably worthwhile repeating that: running 10000 pipelines at the same time, i.e., with overlapping activation periods, does not mean mandatory concurrency in the sense of requiring multithreading (whether they be kernel threads or user threads or a combination) and/or parallelism (multiple cores and/or hyperthreading), and all "context switches" are where the user chooses them to be.

"This is different problem, and I an NOT trying to solve it now." It was meant as a general remark, but you're also thinking threads instead of tasks here.

"If you have 10000 threads with independent work, then why you need TBB at all?! You already have sufficient parallelism. Just execute every work in its thread." The thing is that I don't want 10000 threads, only as many worker threads as will keep the hardware busy.

0 Kudos
Alexey-Kukanov
Employee
395 Views
Quoting - Raf Schietekat
I think it's a shame that TBB started with a promising approach (task-based optional concurrency), started with a mistake (stealing LIFO instead of FIFO, thus only supporting finite workloads), and now, in a panic, bolts off in the opposite direction.

We do steal FIFO; who said the opposite? The problem of only supporting finite workloads is different than that.

And we are not in a panic; neither we bolt off "in the opposite direction". Support for multiple master threads is in TBB since the very first version; it is not being invented to solve the problem that initiated this thread, as you might think. We do not see a reason why customers should not start TBB algorithms from different threads if they wish so, and so this mode is supported. The plan Arch proposed is intended to improve this support, becauseit's not the first time our customers get surprised that two independent jobs (I will call aset of tasksto solve some problem a job)become "tightly coupled" inside TBB so that they only finish together. The algorithm Dmitry proposed is also intended to "decouple" what should not be coupled; and we considersomething like thatas well.

The ability to start a few independent jobs from a single thread is also a known and popular feature request for us. The very simple and natural thing would be to convert it into nested parallelism, by spawning a task for each job, let the task be stolen, and an algorithm started by the thief; and this will likely be the solution we will implement first of all.However, it does not suite the additional requirement of making simultaneous progress in each of the jobs, becausein general does not provide such guaranteeif there are more jobs than available HW threads.

As I already said, the requirement of making simultaneous progress means that either the jobs are not really independent, or for some reason they require mandatory concurrency (one may argue that it is essentially the same in this case - the jobs are not really independent, and thus require mandatory concurrency, as e.g. in a producer-consumer pattern). The mandatory concurrency does not scale to thousands of jobs. I can't imagine the need to start 100000 pipelines simultaneously and require that every single one should constantly make progress. If those are truly independent, it should not matter in which order the execution happens (and then this thing would be easy to convert into nested parallelism. Otherwise,the whole thing should better be reworked to group dependent jobs together.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
395 Views
Quoting - Raf Schietekat

"Please, describe how "never-ending" story can happen if we will apply my proposed patch? (I don't see)" New master steals task, task spawns children and waits_for_all(), child spawns grandchildren and waits_for_all(), some descendant gets the brilliant idea to run an unbounded client-server operation, ...


Damn! You are 200% right!

Blocking wait for children is so awful, so I even forgot about it. In my library I just don't supporting blocking waits, this solves many problems. And IIRC original Cilk automatically transforms blocking waits to continuations...

Hmmm... But how FIFO will help here? If task uses blocking wait, then you can't return from it until all children tasks will complete. You can't force it anyhow. Your FIFO trick will work only if tasks don't use blocking waits too. No?

Quoting - Raf Schietekat
"NO! Master thread is free to exit the game as soon as he wants (even during "never-ending" story)." How would you abandon a stolen task's execute() before it returns? Are we still talking about 1:1 schedulers?

I can't.

So the only solution I see for now is Arch's proposal about "Mini Concurrency Runtime"...

Quoting - Raf Schietekat
"FIFO-capable scheduler won't help in this situation (IIRC we already was here). Master thread can be executing infinite sequence of tasks in any order, but if it will be not checking root task's reference counter it will be in infinite loop!" We already were here, but it seems that some additional effort is required to convince you. :-) With a FIFO-capable scheduler, we wouldn't be using a master thread for running the pipeline in the first place

I think it's irrelevant to scheduling order. LIFO scheduler can gain from continuation style the same way as FIFO scheduler.


Quoting - Raf Schietekat
(you can do it all in one thread if need be). It is probably worthwhile repeating that: running 10000 pipelines at the same time, i.e., with overlapping activation periods, does not mean mandatory concurrency in the sense of requiring multithreading (whether they be kernel threads or user threads or a combination) and/or parallelism (multiple cores and/or hyperthreading), and all "context switches" are where the user chooses them to be.

Hmmm... I have to think more on this... but if tasks will be using blocking waits... I think they can blow up thread's stack... or be subject to induced deadlocks...

Quoting - Raf Schietekat

"This is different problem, and I an NOT trying to solve it now." It was meant as a general remark, but you're also thinking threads instead of tasks here.

"If you have 10000 threads with independent work, then why you need TBB at all?! You already have sufficient parallelism. Just execute every work in its thread." The thing is that I don't want 10000 threads, only as many worker threads as will keep the hardware busy.

It's you who mention threads first. If you don't want 10000 threads then don't start them. Start pipelines one-by-one (or N-by-N). I think that is not very good idea to start 10000 root tasks simultaneously anyway.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
395 Views

The algorithm Dmitry proposed is also intended to "decouple" what should not be coupled; and we considersomething like thatas well.

As Raf noted, there is inherent problem with blocking waits in my proposal. Basically, if task executes wait_for_all() in the middle of execute() method, then only Almighty can force it to return from its execute() method.

I will think more on this, but initially I don't see any potential fixes for my proposal.

The ability to start a few independent jobs from a single thread is also a known and popular feature request for us. The very simple and natural thing would be to convert it into nested parallelism, by spawning a task for each job, let the task be stolen, and an algorithm started by the thief; and this will likely be the solution we will implement first of all.

Doesn't spawn_root_and_wait( task_list& root_list ) do the thing?

0 Kudos
Dmitry_Vyukov
Valued Contributor I
395 Views

As I already said, the requirement of making simultaneous progress means that either the jobs are not really independent, or for some reason they require mandatory concurrency (one may argue that it is essentially the same in this case - the jobs are not really independent, and thus require mandatory concurrency, as e.g. in a producer-consumer pattern). The mandatory concurrency does not scale to thousands of jobs. I can't imagine the need to start 100000 pipelines simultaneously and require that every single one should constantly make progress. If those are truly independent, it should not matter in which order the execution happens (and then this thing would be easy to convert into nested parallelism. Otherwise,the whole thing should better be reworked to group dependent jobs together.

I also think that the number of simultaneously running root tasks (or jobs) must be determined by library, because in some sense it's the same as the number of threads.

The interesting area for research in this context is:

1. If there are too many master threads or just executing root tasks (if asynchronous submission of root tasks will be supported), temporary hold up newcoming master threads - just block in the beginning of spawn_root_and_wait(). Here is obvious problem that obligatory concurrency (which user can be relying on) is not respected (thus can lead to deadlocks - special care is required to prevent deadlocks).

2. ... Damn! I've already forgot while writing first one...

0 Kudos
Dmitry_Vyukov
Valued Contributor I
395 Views
Quoting - Dmitriy Vyukov

I also think that the number of simultaneously running root tasks (or jobs) must be determined by library, because in some sense it's the same as the number of threads.

The interesting area for research in this context is:

1. [...]

2. ... Damn! I've already forgot while writing first one...

Oh Ok, I've remembered.

2. Assume we have N logical processors. And we already have N simultaneously working master threads. Well, we can just turn off all worker threads, and let each master thread do its own root task from beginning to the end. No stealing, no decrease in locality, no oversubscription, no ..., only perfect utilization of hardware.

But the process is still under control. I.e. if number of master threads is decreased, or just application is changed, or hardware platform is changed, run-time is ready to take everything into it's own hands.

0 Kudos
Alexey-Kukanov
Employee
395 Views
Quoting - Dmitriy Vyukov

The ability to start a few independent jobs from a single thread is also a known and popular feature request for us. The very simple and natural thing would be to convert it into nested parallelism, by spawning a task for each job, let the task be stolen, and an algorithm started by the thief; and this will likely be the solution we will implement first of all.

Doesn't spawn_root_and_wait( task_list& root_list ) do the thing?

It probably does most of it, but for sake of ease of use it should be wrapped into something with a nice name and accepting lambdas :)

0 Kudos
RafSchietekat
Valued Contributor III
395 Views

Just a few reactions at this time (it's Friday night for me):

Alexey: "We do steal FIFO; who said the opposite?" Maybe having another look at the code would convince you?

Alexey: "The problem of only supporting finite workloads is different than that." I must apologise here. I have been fretting ever since I wrote this what exactly was the problem with stealing LIFO. Maybe it's about fairness, maybe it's in conjunction with using a queue, but now I wish I had removed it again because I don't remember exactly and I don't want to be thinking about it now. If I don't come up with a rationale at some later point, feel free to rub it in, though. :-)

Alexey: "Support for multiple master threads is in TBB since the very first version" Sure, but I'm referring to it as a weakness here (because TBB simply has no other way to run multiple pipelines), fraught with problems that now need a hurried fix.

Dmitriy: "It's you who mention threads first. If you don't want 10000 threads then don't start them. Start pipelines one-by-one (or N-by-N). I think that is not very good idea to start 10000 root tasks simultaneously anyway." But I didn't start 10000 threads, the scheduler takes care of running those 10000 pipelines for me, whether it has 1 core and 1 thread or 16 cores and 32 threads to play with. Note that this is not something TBB can do at this time.

Dmitriy: "Doesn't spawn_root_and_wait( task_list& root_list ) do the thing?" If you don't mind spending a lot of time waiting...

0 Kudos
Alexey-Kukanov
Employee
395 Views
Quoting - Raf Schietekat
Just a few reactions at this time (it's Friday night for me):

It's Friday night for me too. Actually, already Saturday :)

Quoting - Raf Schietekat
Alexey: "We do steal FIFO; who said the opposite?" Maybe having another look at the code would convince you?

Oh, yeah. At the same level of depth, we steal LIFO, you are right. This is going to be fixed.

Quoting - Raf Schietekat
Alexey: "Support for multiple master threads is in TBB since the very first version" Sure, but I'm referring to it as a weakness here (because TBB simply has no other way to run multiple pipelines), fraught with problems that now need a hurried fix.

As I said, some customers want to run TBB algorithms from separate threads, for whatever reasons. One of the reasons might be e.g. that they already applied functional thread-level parallelismand are not going to change it for a moment, while could exploit inner-level data parallelism in those threads.There could plenty of other reasons. And TBB was designed to coexist with other threading approaches. You might think of supporting multiple masters as a weakness; I prefer to think of it as support of customer needs.

And I do not understand who is "fraught with problems that now need a hurried fix" - you, or you think us? For support of multiple pipelines running, see previous posts and below.

Quoting - Raf Schietekat
But I didn't start 10000 threads, the scheduler takes care of running those 10000 pipelines for me, whether it has 1 core and 1 thread or 16 cores and 32 threads to play with. Note that this is not something TBB can do at this time.

Why not? Run parallel_for(blocked_range(0,10000), MyPipelineStarter()); this won't provide simultaneous progress with each one, but I already said enough on this.

Quoting - Raf Schietekat
Dmitriy: "Doesn't spawn_root_and_wait( task_list& root_list ) do the thing?" If you don't mind spending a lot of time waiting...

Not waiting, but executing the job that need to be executed. That's what relaxed sequential execution is about: you need to do some job now, and you do, but if spare resources are available, they are used as well.

If you want to offload the jobs and do something else, and then come back and check if the job is done, usually it means that there are some other threads that do the job asynchronously.TBB supports this model as well, though currently on task level only. If you don't mind to wait (i.e. call wait_for_all) at the check point, which is the usual case with asynchronous execution. At this point, if the job was not done, the originating thread will do it itself.

But if you do mind to wait at any given moment, now or later, it's called mandatory concurrency. Surprise, TBB supports this as well - via starting a separatetbb_thread :)

What TBB does not support is "let me offload the job and do something else, then may be do piece of the job expected to be done by workers, then do something else again, and so on". There is no way back from wait_for_all until all existing job is done. Are you looking for that?

0 Kudos
RafSchietekat
Valued Contributor III
395 Views

"I prefer to think of it as support of customer needs." TBB supports both task-based concurrency and thread-based concurrency... as long as the customer chooses the latter to run multiple pipelines even if he doesn't want to... but it doesn't actually work anyway.

'And I do not understand who is "fraught with problems that now need a hurried fix" - you, or you think us?' The support for multiple master threads is fraught with problems.

"Why not? Run parallel_for(blocked_range(0,10000), MyPipelineStarter()); this won't provide simultaneous progress with each one, but I already said enough on this." Well, that's what I'm saying: TBB doesn't support simultaneous progress yet (as in overlapping activation periods of optional concurrency as I defined it earlier).

"But if you do mind to wait at any given moment, now or later, it's called mandatory concurrency." No, it's called wasting performance by mandatory idle waiting if the only way to provide simultaneous progress is to loop over spawn_root_and_wait_for_all() in whatever guise. Even if I do the work of retrofitting pipeline with a perform_work() method (like ::CORBA::ORB's plus its invisible ramifications), and register my 10000 pipelines with a manager, looping over "parallel_for(blocked_range(0,10000), myManager);" may spend much of its time waiting at the gate, unless I pay the price associated with a low-enough grain size (and the more processors I have, the lower the grain size needs to be). And I have to do it that way, because my only alternative is to run each pipeline in its own thread... but then I can't use TBB's pipeline anyway, because 10000 of those would only trip over each others' shoe laces.

It doesn't have to be like that. Please take the next logical step. Proposing thread partitioning is embarrassing.

0 Kudos
Alexey-Kukanov
Employee
392 Views
Quoting - Raf Schietekat
"But if you do mind to wait at any given moment, now or later, it's called mandatory concurrency." No, it's called wasting performance by mandatory idle waiting if the only way to provide simultaneous progress is to loop over spawn_root_and_wait_for_all() in whatever guise.

I did not mean the current TBB state when I wrote the above quoted sentence. I meant the general situation when some job was offloaded and (since no desire to wait) expected to be performed by someone else (another thread, usually).

Also at the moment I wrote that I did not fully realize that the key point is potentiallyinfinite nature of data source for pipelines, which you probably had in mind from the beginning; thus the misunderstanding. Everything I said before was applicable for pipelines with a priory known finiteness ofdata.

Quoting - Raf Schietekat
TBB supports both task-based concurrency and thread-based concurrency... as long as the customer chooses the latter to run multiple pipelines even if he doesn't want to... but it doesn't actually work anyway.


Correction: multiple pipelines that never end unless stopped externally, and so are desired to provide simultaneous progress. I agree we do not yet do it right. Still, I think these infinite work pipelines are rather for the design withasyncronous agents than for a (relaxed) sequential program. And I agree that TBB does not have support for (a big number of) asynchronous agents either.

Quoting - Raf Schietekat
The support for multiple master threads is fraught with problems.

If we wouldtry to make TBB ideal from the very beginning, either there still would be no TBB, or rather it would be there somewhat later andnot ideal anyway, because we simply can not foresee all ways the customers try to use it. Thoughthe need to isolate masters from each other might come to our mind.

We might as well adjust the number of active workers depending on the number of active masters.

By the way, the topicstarter did not explain why he starts pipelines from different threads; might be because of no alternative, but might as well be because his program is designed this way for whatever other reasons.

0 Kudos
RafSchietekat
Valued Contributor III
392 Views

My point is that asynchronous agents, or at least a useful subset of them, including pipelines with overlapping activations, can and should be served by a relaxed sequential scheduler.

0 Kudos
Reply