Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

pipeline never terminates

MLema2
New Contributor I
842 Views

Hi,

We have a problem here that is very very rare but problematic enough to raise some questions.  When this happens, the pipeline never finishes but is not executing anything either.

Here is the boiled down piece of code: 

    InterlockedCounter  currentInFlight(0);

    // Open transaction and get next command to process
    auto GetNextCommand = [&]() -> Command* {
        Command* cmd = nullptr;
        while (!cmd && !NeedToStop() && HasWorkToDo()) {
            // Most of the time, skip counting memory and approximate
            if (needToCheckCheckMemory) {
                // Must not clash with other commands populating the structures
                while (currentInFlight != 0 && !NeedToStop()) {
                    Thread::ThisThread::MilliSleep(10);
                }
				...
            } else { ... }

            cmd = new Command;
            ++currentInFlight;
        }
        return cmd;
    };

    tbb::parallel_pipeline(kMaxInFlightCommands,
        tbb::make_filter<void, Command*>(
            tbb::filter::serial_in_order,
            [&](tbb::flow_control& p_FC) -> Command* {
                Command* cmd = GetNextCommand();
                if (cmd == nullptr) {
                    p_FC.stop();
                }
                return cmd;
            }) &
        tbb::make_filter<Command*, Command*>(
            tbb::filter::serial_in_order,
            [&](Command* p_pCmd) -> Command* {
			    ...
                return p_pCmd;
            }) &
        tbb::make_filter<Command*, void>(
            tbb::filter::parallel,
            [&](Command* p_pCmd) {
                delete p_pCmd;
				--currentInFlight;
            })
        );

The idea here is to perform some memory counting calculation once in a while but not too often. When this occurs, we wait for the pipeline to flush itself and let currentInFlight decrement to 0.  At this point, no other thread is modifying theses structures and we can safely traverse them.  Then, let the flow continue and issue new commands to process..

However, sometimes, currentInFlight never reaches 0 and first stage of the pipeline enters an infinite loop.  This seems to be an impossible condition since all code paths should lead to a currentInFlight decrement.

How could this be possible?  

Everything looks like we have commands sleeping somewhere in the pipeline.  

Or has it been stolen by the the thread executing the first stage (which is spinning at this moment)?

Does it have anything to do with last stage being parallel?

Of course, we probably need to refactor this piece of code but I'd like to understand why this is happening.

Any ideas?

0 Kudos
1 Solution
Alexey-Kukanov
Employee
842 Views

I think it can be caused by so-called missed wakeup in the task scheduler.

For sake of better performance, the scheduler takes certain assumptions on the ways tasks are used. In particular, there are important assumptions for spawned tasks: these are ready to execute (i.e. have no unresolved dependencies; in particular two spawned tasks may never be dependent), and most of the time there will be more tasks spawned.

Due to these assumptions, the protocol that detects whether any tasks are left in a work arena may miss the last spawned task in a thread; we call it "missed wakeup" because from another viewpoint the spawning thread did not detect ("missed") that it should signal about task availability. In other words, threads may leave the arena despite some tasks still being there; and it's not considered an issue because (a) the next spawned task (which is likely to come - see above) will re-invite worker threads, (b) a thread that spawned the task is the one responsible for executing it (i.e. a worker thread may not leave the arena until its task pool is empty), and (c) parallelism is optional and not guaranteed, so an application thread must participate in execution of the work it created.

However your case, due to the waiting task expecting other tasks to make progress, mandates concurrency (more exactly, it requires that concurrency, if happened once, remains available till the end of pipeline execution) thus breaking the above assumptions.

The problem might happen as follows:

  • The application (master) thread executes the middle serial stage for a token, while all other allowed tokens are waiting their turn to enter that stage. No tasks are available to steal.
  • The master thread completes the middle stage and spawns a task to process the next token, but that task is missed by a worker thread that explores if the arena still has some work.
  • The worker thread decides that there is no more work left. All workers will leave the arena.
  • The master thread passes its current token through the rest of the pipeline. To start processing a new token, it reuses the task object at hand and does not spawn a new task.
  • The master thread enters the input stage with the new token, and blocks there waiting for previous tokens to finish.

As the result, the pipeline cannot make progress anymore.

View solution in original post

0 Kudos
14 Replies
Alexey-Kukanov
Employee
842 Views

Are there any nested TBB parallel constructs in the pipeline stages?

0 Kudos
MLema2
New Contributor I
842 Views

No, there is no other nested parallel construct in this pipeline.   Could this have an impact?

However, the system might be running other tbb tasks at the same time in other threads.

0 Kudos
RafSchietekat
Valued Contributor III
842 Views

Blocking is never recommended, which is putting it very mildly...

Maybe multiple items are stuck in the intermediate stage, a task exits the pipeline with ++my_pipeline.input_tokens at 2 and so doesn't recycle itself (otherwise there might be a race on the input stage), and then the only remaining stage_task stalls in the input stage? That only happens with unfortunate timing, i.e., very rarely. Try counting each item as it enters and leaves each stage. You should be seeing some items having left the initial stage without having started the second stage, indicating that they are stuck in the input buffer of the intermediate stage.

I did find myself frowning at the use of end_of_input from multiple threads, not all of them reading... And why is the condition on my_pipeline.input_tokens also obeyed if the initial stage is parallel?

My diagnosis is that the pipeline is not misbehaving, and that the problem is caused by blocking in the initial stage. Instead, you should simply stop and restart the pipeline.

(Added) Disclaimer: I have not fully analysed the code, so this is only my best informed guess.

0 Kudos
RafSchietekat
Valued Contributor III
842 Views

An essential detail is that the task that just exited the pipeline (without recycling itself and bypassing the scheduler) must have been stolen. Otherwise its thread would next still execute the task that was spawned when it finished the second stage. It sounds contrived, but weirder things have happened in the parallel world... disclaimer still applying, of course.

I feel it's time once more to rant about that my_ prefix for member variables. Reading the code always takes me back to that scene in Finding Nemo where the seagulls are screaming "Mine! Mine! Mine! Mine! Mine! Mine! Mine!". I might bring myself (sorry for the unintended alliterations) to accept it for permanent or long-term associations, even though it would still be very annoying, but TBB uses it also for temporary ones (stage_task::my_filter), and that's just plain nonsense! It also takes one character more than the more modest m_ prefix (with m for member, of course). There, now I'm feeling better again... :-)

(Edited) Removed comment about IDEs.

0 Kudos
MLema2
New Contributor I
842 Views

Thanks a lot for your time!  

That seems to be a reasonable answer to our problem. 

We had instances of this issue about 5-6 times in the last 2 years.  Sometimes on low performance testing VMs and sometimes on very high performance quad 16 cores Xeons servers but never in the middle machine class.  That somewhat validate the theory of a task being stolen at a very precise timing either because the machine is too slow and always context switches at the wrong time or because the machine processes so many tasks that the likelihood of this problem increases.

Of course, we will refactor this part of the pipeline to make sure we do not block in the first stage.  

0 Kudos
Alexey-Kukanov
Employee
842 Views

It is reasonable to assume that some token(s) got stuck in the middle and cannot complete the pipeline.

In a normal pipeline execution, a token might wait for its turn on entering a serial filter. However, the thread that currently executes that filter must upon exit spawn a task for the next token (if one exists) to be processed. I have read and re-read Raf's posts but still cannot imagine/understand that "unfortunate timing" scenario. Of course it's possible that a bug is lurking somewhere.

It might also happen that a stage is blocked in the middle of execution. For that, the thread that processes it needs to enter a new task dispatch loop (i.e. a parallel algorithm, a task_group::wait(), etc.). This is where nested parallelism makes the difference. The scheduler assumes that all spawned tasks are independent and can be processed in any order. In particular, the nested dispatch loop might steal a task from an outer level, i.e. a sibling to the task blocked on thread's stack waiting for the nested dispatch loop to exit. So if the tasks at the outer level are dependent (as in your scenario), the thread might be waiting for itself, i.e. stuck in a deadlock.

I agree that the best solution is to avoid extra dependencies between pipeline stages.

0 Kudos
MLema2
New Contributor I
842 Views

Hi Alexey, thanks for the additional insights.

In our case, we have different "systems" that have their own usage of the TBB scheduler.  However, in this particular pipeline, it does not spawn nested tasks. In fact, it used to be the case; we had a parallel_sort nested deep down in a stage and, as you explained, we indeed experienced deadlocks. That lead us to replace them with standard std::sorts.

 

0 Kudos
RafSchietekat
Valued Contributor III
842 Views

I think I'll need to invoke my disclaimer, for starters... There's an embarrassing and fatal flaw in my reasoning, and I'm not sure it can be rescued. Sorry for misleading you.

I blame the seagulls!

0 Kudos
RafSchietekat
Valued Contributor III
842 Views

Just to be sure, was the pipeline definitely blocked forever even after other work ceased and other TBB threads became available to help out where they could, or did this situation only persist while all other TBB threads were busy elsewhere?

Also, is all TBB work started from the same application thread, or is the pipeline separate? Threads in different arenas don't steal from each other, so the situation is then simplified (although it may still not be exactly fair, because threads only migrate between arenas when they become idle).

Meanwhile, even though I was wrong in my diagnosis, the advice to stop and restart the pipeline still looks valid (you're probably stepping out of harm's way, and without sacrificing performance because you were already draining the pipeline). It would still be nicer to know what's going on, of course...

 

 

 

0 Kudos
MLema2
New Contributor I
842 Views

Each time we encountered this problem, we did a dump of the process.  No other thread was doing TBB work.  

We have multiple threads spawning tbb work.  Some for user requests and other for server processing.

0 Kudos
Alexey-Kukanov
Employee
843 Views

I think it can be caused by so-called missed wakeup in the task scheduler.

For sake of better performance, the scheduler takes certain assumptions on the ways tasks are used. In particular, there are important assumptions for spawned tasks: these are ready to execute (i.e. have no unresolved dependencies; in particular two spawned tasks may never be dependent), and most of the time there will be more tasks spawned.

Due to these assumptions, the protocol that detects whether any tasks are left in a work arena may miss the last spawned task in a thread; we call it "missed wakeup" because from another viewpoint the spawning thread did not detect ("missed") that it should signal about task availability. In other words, threads may leave the arena despite some tasks still being there; and it's not considered an issue because (a) the next spawned task (which is likely to come - see above) will re-invite worker threads, (b) a thread that spawned the task is the one responsible for executing it (i.e. a worker thread may not leave the arena until its task pool is empty), and (c) parallelism is optional and not guaranteed, so an application thread must participate in execution of the work it created.

However your case, due to the waiting task expecting other tasks to make progress, mandates concurrency (more exactly, it requires that concurrency, if happened once, remains available till the end of pipeline execution) thus breaking the above assumptions.

The problem might happen as follows:

  • The application (master) thread executes the middle serial stage for a token, while all other allowed tokens are waiting their turn to enter that stage. No tasks are available to steal.
  • The master thread completes the middle stage and spawns a task to process the next token, but that task is missed by a worker thread that explores if the arena still has some work.
  • The worker thread decides that there is no more work left. All workers will leave the arena.
  • The master thread passes its current token through the rest of the pipeline. To start processing a new token, it reuses the task object at hand and does not spawn a new task.
  • The master thread enters the input stage with the new token, and blocks there waiting for previous tokens to finish.

As the result, the pipeline cannot make progress anymore.

0 Kudos
RafSchietekat
Valued Contributor III
842 Views

Time to reassign the Best Reply label!

So the takeaway is that this is indeed caused by requiring concurrency (even if it was previously available!), and that the solution (not workaround!) is therefore (still) to stop and restart the pipeline instead of blocking in the first stage.

I'm left wondering (see #4) about what to do when the initial stage is or could be parallel, though. Currently it seems that there is an unnecessary choke point, where tasks falling off the end are not always recycled, with larger values of max_number_of_live_tokens causing a lower chance of recycling, and the effect, if not the aim, seems to be to always execute the initial stage serially, even though the Reference Manual has a special statement about parallel input stages. Then there's end_of_input, which should probably be atomic. And I also thought that it is unnecessary to extract task_info just to store it in an input_buffer, instead of keeping the tasks (as I seem to remember from an earlier implementation), which would probably still work with thread_bound_filter stages: isn't it always better to recycle (also when wrapping around)?

 

0 Kudos
MLema2
New Contributor I
842 Views

Theses posts are enlightening!  I think we can safely close the matter and conclude that a valid explanation has been found. 

Again, thanks for your time... It could have been quickly resolved with a "never ever block in tbb threads" but you all took it seriously and I really appreciate it!

Have yourself a merry Christmas (if that applies to you ;))

0 Kudos
RafSchietekat
Valued Contributor III
842 Views

Never ever block in TBB threads! Unless you know what you're doing. Which apparently is not easy... :-)

It should probably be noted that nested parallelism is still a good thing (as long as it has enough work to do), despite what was said above about its possible contribution to deadlock... if there's also blocking.

0 Kudos
Reply