Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Unexpected wait_for_all behavior

Florian_Z_
Beginner
669 Views

Hi. I was wondering if someone can clarify what wait_for_all does, and why it does not behave the way I expected:

  1. Consider an empty, root task A with n short running child tasks. The children of A are spawned on the main thread. A's reference count is set to n+1.
  2. We also have an empty, root task B wit one long running child task. I also spawn this child on the main thread. B's reference count is set to 2.
  3. Let's call A->wait_for_all()

The call to wait_for_all() won't return until B's long running child task has been executed. This makes sense, right: The long running task will be added to the back of the main thread's ready pool. Due to the LIFO scheduling order on the local ready pool, the long running task will be almost guaranteed to be picked up for execution on the main thread. wait_for_all() can't return until the long running task is done.

It is my understanding that long running tasks don't really fit well into a task parallel programming world: The long running task simply queries some state (think queue data structure) and does some amount of work until querying the state again. Instead of looping over available state, I simply modified the task to do one unit of work at a time, then recycle itself for re-execution (increment ref count, recycle as safe continuation). I thought that this would give the scheduler a change to check the ref count on root task A, and return from the call to wait_for_all on the main thread.

Still I am left with the same problem as before. Calling wait_for_all() on root task A won't return until all children of A, as well as all continuations of the long running child of B are done executing. Any ideas?

0 Kudos
9 Replies
RafSchietekat
Valued Contributor III
669 Views

You have to set the reference count before spawning the first child. Your description suggests that you're doing it the wrong way around.

wait_for_all() could still return before B's long-running child has finished, if somehow all its children were stolen and executed before wait_for_all() was called and the current thread started work on B's long-running child. So for correctness you can't preclude such an execution, however unlikely, but it is indeed not something you can count on.

Long-running tasks don't fit well in the sense that it's better to give the scheduler some "parallel slack" by breaking up the work into smaller pieces. Here of course it isn't really parallel slack because the pieces are strung together sequentially, but it should still allow the thread to finish A->wait_for_all() before B's child's "incarnations" (so to say) finish executing, if meanwhile other threads are stealing A's child tasks.

So you tried that and it didn't help? Are other threads available to help out, or are they just doing their own thing?

If you can't count on those other threads, and you still need a solution where A is not stuck behind B, realise that TBB's LIFO strategy is all about maximising throughput at the expense of fairness, and if you want fairness you have to ask for it by using enqueue() rather than spawn().

0 Kudos
RafSchietekat
Valued Contributor III
669 Views

Obviously another solution in this situation is to first spawn B and then A's child tasks, but that would be too simple, as we're going for understanding rather than just a quick fix here.

I found a blog mentioning explicitly how, starting with TBB 3.0, wait_for_all() should be able to return as soon as the related task tree has finished executing, although the example makes a lot fewer assumptions.

One thing to check maybe is that you aren't bypassing the scheduler by returning "this" from B's child's execute().

0 Kudos
Florian_Z_
Beginner
669 Views

Hi Raf. I do set the reference count before spawning any children. Sorry, I should have made that clear.

Yes, I tried "splitting" up the long running task by having the task recycle itself as a continuation. Unfortunately, it did not fix the problem. It's as if the wait_for_all does NOT check if the reference count has reached 1 in between scheduling the continuations. Other threads are available and happily execute all the children of A. The main thread churns through all the continuations of the long running B child, before returning from wait_for_all.

Thanks for the suggestion with regards to using enqueue() instead of spaw(). Using enqueue() for the long running task fixes the problem, but I can't convince myself that this is due to the FIFO strategy of enqueue(). Instead, I believe this is due to the scheduler rule that states that enqueued tasks will always execute on a worker thread. In my case, since I am spawning all these tasks on the main thread, this means that the main thread will never be "blocked" by the long running task, and wait_for_all will return as expected. However, in my case there is no guarantee that all these tasks will be spawned from the main thread. They could very well be spawned from a worker, in which case the long running task may again be picked up by the worker that enqueued the task in the first place. Does that make sense?

I also understand that TBB optimizes for maximum throughput, and that's great. I want the "incarnations" of the long running task running on the main thread, while the workers execute the children of A. I just really need the scheduler to check the reference count before scheduling more incarnations of the long running task on the main thread, so that wait_for_all returns in a timely manner.

Thanks for the link to the blog post. The article seems to suggest that what I am trying to do should indeed work. I am baffled. Could this be a TBB bug?

0 Kudos
RafSchietekat
Valued Contributor III
669 Views

"Instead, I believe this is due to the scheduler rule that states that enqueued tasks will always execute on a worker thread."

And I don't believe that that's necessarily true. The documentation does say "In the case where there would ordinarily be no other worker thread to execute an enqueued task, the scheduler creates an extra worker.", but it does not say that only worker threads execute enqueued tasks. What you conclude from your premise does make sense to me and is an interesting hypothesis, but I would have to see some evidence before I would accept it.

"Could this be a TBB bug?"

Those are not unheard of...

Let's try the following, as an experiment and perhaps even as a workaround: still use recycle_as_safe_continuation(), but with the reference count bumped to 2, to accommodate a child empty_task that is then spawned, with execute() still returning NULL. The theory is that in that case tally_completion_of_predecessor() will not set t_next to the recycled task (which is a very suspicious thing to do anyway in code that's also used for recycle_to_enqueue() considering that according to the documentation enqueued tasks are almost at the bottom of the scheduler algorithm's preference list!) because the reference count is still larger than one (unless another thread steals and executes the empty_task in the meantime, but that's not the most likely scenario and could be made even less likely by using a task with a small delay instead).

You could also try the deprecated recycle_to_reexecute() instead (TBB team: still "may become deprecated" in the source code), with a root empty_task returned from execute(), or recycle_to_enqueue() (TBB team: not documented?), just to verify that it doesn't solve the problem, as partially explained above.

(Edited)

0 Kudos
Florian_Z_
Beginner
669 Views

Thanks for the pointers, Raf! I looked at the TBB scheduler code you pointed out, and I was able to figure out why wait_for_all does not check the reference count:

  • The inner-most loop of the scheduling algorithm handles the scheduler bypass, i.e. the task pointer returned from the execute method.
  • Bypassing the scheduler also bypasses the check for the parent reference count to reach 1. The scheduler only checks the reference count once it leaves the inner-most loop, and enters the "middle" loop. The middle loop is responsible for retrieving the next task from the local task pool.
  • The scheduler implements an optimization for tasks which have been recycled as continuations: If the recycled task has no pending dependencies (children) AND the recycled task does not bypass the scheduler (i.e. it returns NULL from its execute() method), then the recycled continuation is scheduled via the scheduler bypass mechanism.

Individually, all these points make sense: It's safe and fast to elide the reference count check while the scheduler is being bypassed. Instead of putting a recycled task with no dependencies and no scheduler bypass back into the task pool, it's faster to just schedule it like a bypass.

However, the optimization causes this unexpected and undocumented behavior. The implicit scheduler bypass also bypasses the reference count check.

My current workaround looks like this:

increment_ref_count();
recycle_as_safe_continuation();
return new (allocate_root()) tbb::empty_task();

Returning the empty task via the scheduler bypass mechanism prevents the optimization from taking place. Instead, the inner loop will pick up empty task, then return to the middle loop to check the parent reference count and return from the wait_for_all, if possible. This solution should also be safe if another worker steals the recycled continuation while the empty task is "executing" on the current thread. The current thread is guaranteed to enter the middle loop.

Having said that, it's not very pretty, and probably also not as fast as it could be: There is some unnecessary task allocation, as well as execution of an empty task, going on here. I am not sure if there is a better workaround, though.

By the way, recycle_to_reexecute() does not suffer from this problem, but it also requires at least the allocation of an empty task.

0 Kudos
RafSchietekat
Valued Contributor III
669 Views

"The implicit scheduler bypass also bypasses the reference count check."

Yes, but you would immediately drop out of that scheduler bypass loop if you had a recycle_as_safe_continuation() with children, so that's what I asked you to check, because then we have some more information for a bug report. Of course returning an empty_task also works, just like it does for the deprecated recycle_to_reexecute(), but we wouldn't really want that to remain a required workaround.

I think that tally_completion_of_predecessor() should be adapted so that it doesn't set bypass_slot if local_wait_for_all() can already return, unless perhaps if there's a very good reason not to make any change there (also requiring a clear warning in the documentation!). TBB team (also see remarks above)?

0 Kudos
Florian_Z_
Beginner
668 Views

Yes, if the recycled task has children wait_for_all immediately drops out of the scheduler bypass loop, also. The reason why I chose to return an empty task instead of spawning a child is that it is a more reliable way of dropping out of the loop. Like you mentioned in comment #5, another worker may steal and run the child to completion before the current thread enters tally_completion_of_predecessor(). Although unlikely, this scenario would again cause the bypass optimization to kick in, because the task's reference count would have already reached 1.

I agree on everything you said: Unless this optimization yields big gains, I suggest the bypass_slot should not be set if wait_for_all can return. Alternatively, this should be well documented. 

0 Kudos
RafSchietekat
Valued Contributor III
669 Views

I've had another look, and it seems to me that the burden of proof/persuasion is on not returning as soon as possible, i.e., the status quo. But just having some task hot in cache does not really seem important enough, and even on the hot path it seems cheap enough to check.

It also seems easy enough to correct this, but, since this is more generally relevant than something else I'm currently investigating, I'd prefer it if the TBB team would first express their opinion. Of course, if you or somebody else wants to take a shot at this...

0 Kudos
RafSchietekat
Valued Contributor III
669 Views

Could you try the following (it's past midnight for me now, so no guarantees...):

$ diff tbb43_20150316oss{,.justTryingSomething}/src/tbb/custom_scheduler.h
90c90
<     void tally_completion_of_predecessor( task& s, task*& bypass_slot ) {
---
>     void tally_completion_of_predecessor( task& s, task*& bypass_slot, __TBB_atomic reference_count & ref_count, const reference_count quit_point ) {
117c117
<         if( bypass_slot==NULL )
---
>         if( bypass_slot==NULL && as_atomic(ref_count).load<relaxed>() != quit_point )
495c495
<                             tally_completion_of_predecessor(*s, t_next);
---
>                             tally_completion_of_predecessor(*s, t_next, parent.prefix().ref_count, quit_point);
509c509
<                         tally_completion_of_predecessor(*t, t_next);
---
>                         tally_completion_of_predecessor(*t, t_next, parent.prefix().ref_count, quit_point);
$ 

 

0 Kudos
Reply