Need help with pipeline deadlock - Page 2

benz__alexander · ‎11-11-2008

Hi everyone,

I have a problem concerning proper termination of pipelines and I'm running out of ideas on where the problem might be.

The setup is as follows. I have one class that handles loading and decompressing of video streams. This process is realised via the TBB pipeline mechanism. Once the start() method of my class is called, it begins loading and decompressing frames and delivers them to a follow-up component.

Calling start on the class spawns a thread which initialized the scheduler and calls pipeline.run() (which blocks until the pipeline is done). Calling stop on the class tells the first tbb::filter to stop loading frames and to return with NULL (to terminate the pipeline). There are multiple instances of this class running (with different input streams).

The problem I have is that once in a while when calling stop, the first filter returns with NULL but the pipeline is not stopped which results in the main thread method (the one that called pipeline::run()) not returning.

Inspecting the threads, there are a couple of tasks waiting :

tbb::internal::GenericScheduler::wait_while_pool_is_empty ()

but none of the threads hangs in the instance that blocks.

Any help is apreciated,

- ALex

RafSchietekat · ‎11-17-2008

The band-aid would not restart the process, only the pipelines, which is probably acceptable if it only has to intervene occasionally, for lack of a real solution at this time. "Not very elegant, but it may do the trick."

A solution would use "FIFO" for all unbounded pipelines, well, all unbounded workloads in the program, really (a corrollary of a principle I've presented before), with "bounded" probably more a qualitative concept than a quantitative one. Such pipelines may only spawn global-progress tasks for the first filter, preferably with a grainsize larger than one to trade low latency for high throughput (sic?), i.e., with a higher grainsize both latency and throughput will increase. Downstream work would still be spawned on the "LIFO" part of the scheduler, as well as any internal work (dependencies of the filters). A worker thread will therefore always come back to its own pipeline after a bounded amount of time, so it will always "return from run()" (of course, run() requires parallelism if more than one pipeline is started, so an API change would be welcome, but that's a different matter). Note that this is just for global progress, so there can still be annoying delays if a worker thread gets stuck in a slow lane, but it's still a useful first step because traffic will never come to an unbounded halt (you may notice that my daily commute is a useful source of inspiration).

robert-reed · ‎11-17-2008

Quoting - Dmitriy V'jukov

Hmmm... reminds of... deadlock

;)

Hmmmm... smells more to me like a livelock. No thread is starved for resources and it's not really like either because progress is being made, just not on the terminated thread.

robert-reed · ‎11-17-2008

Quoting - Raf Schietekat

A solution would use "FIFO" for all unbounded pipelines, well, all unbounded workloads in the program, really (a corrollary of a principle I've presented before), with "bounded" probably more a qualitative concept than a quantitative one. Such pipelines may only spawn global-progress tasks for the first filter, preferably with a grainsize larger than one to trade low latency for high throughput (sic?), i.e., with a higher grainsize both latency and throughput will increase. Downstream work would still be spawned on the "LIFO" part of the scheduler, as well as any internal work (dependencies of the filters).

I don't see how this would behave differently than the current schemeif the initiating thread of one pipeline ever momentarily runs out of work and steals from another pipeline. How does the FIFO prevent an errant thread from getting stuck in an abundance of work on some other LIFO?

RafSchietekat · ‎11-17-2008

"Hmmmm... smells more to me like a livelock." I would call it starvation: livelock means being very busy doing nothing at all. It happens to be intra-thread instead of inter-thread, but that does not seem essential.

"How does the FIFO prevent an errant thread from getting stuck in an abundance of work on some other LIFO?" Between a worker stealing FIFO (like Cilk?) by using deques instead of singly linked lists, and programmer discipline to only spawn bounded workloads for LIFO scheduling, would there still be an opportunity for a task to be starved forever?

ARCH_R_Intel · ‎11-17-2008

Andrey Marochko, Andrey Kukanov, and myself have been discussing this issue. Hereis my summary of our thoughts.

The root problem is "may" versus "must" parallelism. The TBB scheduler assumes "may" parallelism, meaning that parallelism is not required for correctness, and thus scheduling can be as unfair as possible. "Must" parallelism, in contrast, requires some degree of fairness. "May" parallelism enables the efficiency properties of Cilk-style scheduling. But "must" parallelism is sometimes necessary for correctness (notably to avoid deadlock!).

In TBB, we call threads created by the user "masters" and threads created by the TBB scheduler "workers". The following policy makes sense:Parallelism between masters must be treated as "must" parallelism. Threads employed behalf of masters must respect the "must" parallelism between their employers. All other parallelism is "may" parallelism.

A way to implementthe policyis to prevent threads from simultaneously having two employers ("moonlighting"). The formal rules are:

A thread is either employed by a thread (possibly itself) or unemployed.
A master thread is always employed by itself.
A worker thread is unemployed if it is not running any tasks and its task pool is empty. Otherwise a worker is employed.
A worker can change from unemployed to employed by stealing work. The employer of the thief is the employer of the victim. This rule also covers tasks obtained via advertisements in a thread's mailbox.
For two distinct employers X and Y,a thief employed by X cannot steal a task from avictim employed by Y.

The "employed thief" case arises only in nested steals. That is when a thread X executing a task has to wait for other threads to finish stolen subtasks, and is looking for work to do in the meantime.

I believe the policy would fix the reported problem. We'll have to work out the data structures. My impression is that if there are only a few masters, there is a relatively quick hackto implement the policy with tolerable efficiency. The hack is to implement rule 4 by tagging threads with their employer id, and implementing 5 as a rejection rule in the stealing logic. It's less than ideal efficiency because with M masters the probability of success for a steal is 1/M.

We could use the quick hack to verify that the policy indeed solves the problem, then work on an efficient implementation.

- Arch Robison

RafSchietekat · ‎11-17-2008

My position is that a scheduler should allow the user to specify the "QoS"-related trade-offs to be made, and that global progress should be one of several options. The proposed partitioning skips global progress in the trade-efficiency-for-fairness direction (it will also solve the problem at hand, but it will often be less efficient, unless it is the only acceptable option because of a low tolerance for high latency in the program). How can it be used for serving multiple mouths other than with oversubscription and/or capitulating to old-style multithreading (to which I expected TBB to be the answer)? How will the number of workers at any one time be managed to avoid oversubscription, and how will they be scheduled between the masters? I think that work should instead be devoted to allow the user to avoid the need for more master threads: running multiple pipelines at once does not inherently require parallism, nor should TBB's implementation.

Alexey-Kukanov · ‎11-18-2008

Quoting - Raf Schietekat

I think that work should instead be devoted to allow the user to avoid the need for more master threads: running multiple pipelines at once does not inherently require parallism, nor should TBB's implementation.

Running multiple pipelines at once does inherently require parallelism, otherwise why not just run them one by one? Of course there is some valid sequential execution that has the property of making progress in every simultaneous pipeline; as we know, time sharing and preemption make semblance of concurrent execution even on a single core. But if making simultaneous progress is a requirement, then concurrency is mandated, even if an execution engine resolves it via time sharing & preemption.

On the other hand, I agree that users should have an option to choose the behavior they need.

RafSchietekat · ‎11-18-2008

"otherwise why not just run them one by one" Because that would not be simultaneous: even round-robin would be better.

Alexey-Kukanov · ‎11-18-2008

Quoting - Raf Schietekat

"otherwise why not just run them one by one" Because that would not be simultaneous: even round-robin would be better.

Exactly. That's my point: despite existing valid sequential execution (such as round-robin), the requirement on simultaneous progress is nothing else but mandated concurrency.

RafSchietekat · ‎11-18-2008

Surely you're joking, Mr. Kukanov!

ARCH_R_Intel · ‎11-18-2008

No joke: "simultaneous progress"by definitionimplies "mandatory concurrency".

RafSchietekat · ‎11-18-2008

Now I believe TBB is in trouble, if the chief architect does not see that... unless you thought that I meant that TBB's present API does not require parallelism to operate multiple pipelines at the same time? Indeed, that would have been a mistake, but actually my point is that this may be so now, but also that it should not be so: in addition to run(), there should be a method that means "go register yourself with the nice scheduler so that it will occasionally call you", and then the scheduler would do just that. Any parallelism between pipelines would be entirely optional: on a single-core system, only one thread would run, everybody would be served (no starvation), and things would proceed smoothly (no deadlock). But of course the organisation should be somewhat more sophisticated than round-robin, for a maximally efficient exploitation of concurrency when the opportunity presents itself.

ARCH_R_Intel · ‎11-18-2008

Quoting - Raf Schietekat

Now I believe TBB is in trouble, if the chief architect does not see that... unless you thought that I meant that TBB's present API does not require parallelism to operate multiple pipelines at the same time? Indeed, that would have been a mistake, but actually my point is that this may be so now, but also that it should not be so: in addition to run(), there should be a method that means "go register yourself with the nice scheduler so that it will occasionally call you", and then the scheduler would do just that. Any parallelism between pipelines would be entirely optional: on a single-core system, only one thread would run, everybody would be served (no starvation), and things would proceed smoothly (no deadlock). But of course the organisation should be somewhat more sophisticated than round-robin, for a maximally efficient exploitation of concurrency when the opportunity presents itself.

If a pipeline filter is free to make kernel calls that can block, then a single thread cannot guarantee progress of two pipelines. There has to be at least two software threads, one to serve each pipeline. The concurrency is not optional. Of course the underlying OS is free to context switch a single hardware thread between the software threads, hence there might be no parallelism.

The semantics requested by Alex Benz is entirely reasonable. What we (Alexey, Andrey, and I) are proposing is to delineate "must" versus "may" parallelism by treating user-created threads as "must" parallelism. If we implement this distinction in the scheduler, then the "go register yourself with a nice scheduler" becomes a matter of starting the pipelines on separate threads.

RafSchietekat · ‎11-19-2008

"If a pipeline filter is free to make kernel calls that can block, then a single thread cannot guarantee progress of two pipelines." Until TBB provides specific, targeted support for filters that can block, they should probably not be used (note that this is related to but separate from the issue of running multiple pipelines simultaneously, because starvation is not necessarily related to blocking, I suppose). A workaround seems to be to have a "master thread" detect work from each input and fire up a related pipeline when required, which will close down when no pending input is found; pooling should alleviate some overhead.

"There has to be at least two software threads, one to serve each pipeline. The concurrency is not optional. Of course the underlying OS is free to context switch a single hardware thread between the software threads, hence there might be no parallelism." Multiple threads in this case are mandatory only because of the current limitations of TBB. Blocking in a filter is a "forced error" for lack of appropriate support from TBB (but see workaround above).

"The semantics requested by Alex Benz is entirely reasonable." Absolutely. If only TBB didn't have these limitations that forced him to use old-style multithreading. At this point TBB looks promising, but there's still work ahead.

"What we (Alexey, Andrey, and I) are proposing is to delineate "must" versus "may" parallelism by treating user-created threads as "must" parallelism. If we implement this distinction in the scheduler, then the "go register yourself with a nice scheduler" becomes a matter of starting the pipelines on separate threads." There will always be situations where multiple threads are mandatory, or just inherited from legacy software. But TBB should not unnecessarily impose them on the user because of a lack of support for "proper" task-based optional parallelism.

RafSchietekat · ‎11-19-2008

I wrote 'A workaround seems to be to have a "master thread" detect work from each input and fire up a related pipeline when required, which will close down when no pending input is found; pooling should alleviate some overhead.'

To avoid any misunderstanding: this is only a workaround to avoid the use of blocking input filters, it offers no direct relief to pipelines starved for other reasons, but such an arrangement should improve the odds and at least you'll have the consolation that all worker threads are doing something useful, which may also help against a starvation problem in the first place. Worth a try, perhaps. With a progress-capable scheduler there would be no unbounded starvation if unbounded filters are avoided. Thread partitioning would solve both problems at the same time, but how many more cores are there than active pipelines for this to still be different from old-style multithreading, assuming that TBB then actively manages the number of worker threads to avoid immediate oversubscription problems?

benz__alexander · ‎11-20-2008

Quoting - Andrey Marochko

Alex, am I right to understand that your pipelines are endless by themselves, and require an external signal to stop them? And until the first pipeline stops your program does not stop other ones?

Yes, the first part is true. I need to tell the pipeline from a different thread to quit, that is a flag will be set which will be queried by the first filter in the pipeline. But I can quit a random filter chain (pipeline) any time I want and restart it as long as the described situation is not met.

Thanks for giving this topic so much thought. This is really apreciated! I'll give you a short heads up on where I stand.

To get on with our software, I implemented a simple "alternative" for the pipeline part. It works and is not much slower that the pipeline. BUT it''s not nearly as comfortable as using a tbb:pipeline, so keep up with the work :-) The problem for me is I can't use the pipeline in other parts of our software as long as I can't get around the deadlock issues. I already put a lot of time into this to really find the root cause. Unfortunately I also can't extract the problem into a simple example for you to do some debugging. But from the discussions so far, I guess you're on the right track.

Basically, the software we're develping is an HD a/v playout solution which internally uses a filter graph system similiar to the DirectShow framework. The tbb:pipelines are used for basic low level tasks in the filter chain that need to be done as quick as possible, like loading and decoding frames (encapsulated in one single high level filter). I think this should give you a basic idea of what' I'm trying to achieve.

- ALex

Dmitry_Vyukov · ‎11-25-2008

Quoting - Robert Reed (Intel)

Hmmmm... smells more to me like a livelock. No thread is starved for resources and it's not really like either because progress is being made, just not on the terminated thread.

Think of the deadlock on spin-mutexes with processing of other pipeline's tasks as yield :)

RafSchietekat · ‎11-25-2008

"Think of the deadlock on spin-mutexes with processing of other pipeline's tasks as yield :)" I'm surprised: no observations more substantial than this about a thread that goes to the heart of current and future usability of TBB?

Dmitry_Vyukov · ‎12-03-2008

Quoting - Arch Robison (Intel)

A way to implementthe policyis to prevent threads from simultaneously having two employers ("moonlighting"). The formal rules are:

A thread is either employed by a thread (possibly itself) or unemployed.

A master thread is always employed by itself.

A worker thread is unemployed if it is not running any tasks and its task pool is empty. Otherwise a worker is employed.

A worker can change from unemployed to employed by stealing work. The employer of the thief is the employer of the victim. This rule also covers tasks obtained via advertisements in a thread's mailbox.

For two distinct employers X and Y,a thief employed by X cannot steal a task from avictim employed by Y.

Oh, I see, Mini Concurrency Runtime? Will Master threads be placing orders for worker-threads? :) And since MSVC2010 there will be Mini Concurrency Runtime on top of Big Concurrency Runtime :)

It's interesting idea, probably it will even increase performance in some cases by improving locality of processing... But what do you think about potential induced stalls (employed worker thread has to stand idle when there is more work to do)?

ARCH_R_Intel · ‎12-03-2008

Quoting - Dmitriy Vyukov

It's interesting idea, probably it will even increase performance in some cases by improving locality of processing... But what do you think about potential induced stalls (employed worker thread has to stand idle when there is more work to do)?

We won't know the answer to that until we try it out. If the stall issue turns out to be significant, we may have to create >P software threads on a system with P hardware threads, and somehow block/unblock the software threads so that there are only P running at at time. Or maybe it will turn out that a little occasional oversubscription is not a big problem. There's much to be learned by implementing and trying it out.

RafSchietekat · ‎12-03-2008

"There's much to be learned by implementing and trying it out." I think you should delegate such half-cocked exploration to other projects and only do innovation with a clear direction to further the strengths of task-based concurrency, but I guess taking a detour to unblock a situation is better than standing still.