blocking style task scheduling disadvantage.

amitm02 · ‎11-25-2010

Hi All,

The TBB book states that blocking style (spawn_and_wait_for_all()) had the following disadvantage:

"the scheduler limit possible victim tasks to those deeper then waiting task [due to memory issues]. this limit modifyies the policy that the shallowest task should be chosen.

allegedly, this is a disadvantage beacuse doing "breadth-first" execuation (picking the "shallowest tasks),maximize parallelism" as it "unfold the tree more", creating more open tasks for the working threads.

however, to my understanding, the limit of "possible victim tasks" is only for the working thread of the waiting task. i.e all other threads are free the steal the shallowest tasks.

and so:

1) parrallisem is maxed due to the other working threads stealing tasks in"breadth-first" execuation.

2) the working thread of the waiting task is picking tasks that are more hot in cache (good thing).

so where is the disadvatage?

Yours Amit

RafSchietekat · ‎11-25-2010

Since that was written, the scheduler has been changed so that task depth is no longer a consideration. A full answer probably needs to consider NUMA issues, and should be based on more than just intuition, but not favouring breadth-first scheduling increases the frequency of task stealing, which is rumoured to be very costly.

amitm02 · ‎11-25-2010

Thanks Raf,

Is there any information resource i can use to understand better how the current scheduler operate?

I'm courise, why "not favouring breadth-first scheduling increases the frequency of task stealing"?

yours

amit

RafSchietekat · ‎11-25-2010

If the amount of work is constant and the scheduler bites off smaller pieces (depth-first), there will be more pieces...

Andrey_Marochko · ‎12-09-2010

I've started answering your question, but the reply was getting longer and longer, so I decided to convert it into a blog post. You can find it here. Hope it will clarify some of your questions.

What I have not mentioned in the blog is another couple of blocking style disadvantages. See my short post in another forum thread.

RafSchietekat · ‎12-09-2010

Nice. Should I have immediately seen why it's a relatively rare occurrence for a thread to be blocked for excessive stack use, or is that a matter of experimentation?

Andrey_Marochko · ‎12-09-2010

For me and, I believe, for other TBB designers it was rather a conclusion based on the field experience. The common sence reasoning suggests that in case of more or less well balanced workloads stealing intensifies only closer to the end of processing, when remaining chunk are small enough to result in dangerously deep recursion. Even in case of substantially imbalanced work distribution threads eventually hit a motherload, and happily gnewing through it most of their time.

Besides default stack sizes on modern OSes are large enough (normally 1M or above), while most of the numerical algorithms keep reasonably small amounts of data on stack.

After all it was exactly the absence of any reliable theoretical foundation why the initial design went to such great (and painful) lengths to preclude this possibilit :)

RafSchietekat · ‎12-09-2010

Murphy's Law pretty much guarantees that somebody will challenge this assumption, though, and then they might blame TBB. Could relief come in the form of bigger stacks, or tapping stand-by threads (dynamically extending a cumulatiive stack, up to a finite multiplication factor)?

Andrey_Marochko · ‎12-10-2010

I think that the way to move further would be eliminating blocking style at all, by means of switching to another stack (which would mean stand-by thread in many OSes) before stealing from the nested parallelism level. The biggest question here is the cost of such switches, and it looks like only experiment will be able to answer it.

Dmitry_Vyukov · ‎12-10-2010

Quoting Andrey Marochko (Intel)

I think that the way to move further would be eliminating blocking style at all, by means of switching to another stack (which would mean stand-by thread in many OSes) before stealing from the nested parallelism level. The biggest question here is the cost of such switches, and it looks like only experiment will be able to answer it.

You do not need full-fledged threads for that. Light-weight fibers will do. Each worker thread need some limited amount of worker fibers to switch between them on blocking. On Windows fiber switch is very lightweight, it just saves current registers in current fiber context and restores registers from new fiber context, on par with ~20 cycles. Unfortunately, on Linux fiber switch is considerably heavier weight, because it needs to do a syscall in order to switch the signal mask.

Andrey_Marochko · ‎12-11-2010

Sure we know about fibers. But as you noted yourself, they are supported on Windows only, and TBB will have to preserve its efficiency on other OSes too.

Dmitry_Vyukov · ‎12-11-2010

Quoting Andrey Marochko (Intel)

Sure we know about fibers. But as you noted yourself, they are supported on Windows only, and TBB will have to preserve its efficiency on other OSes too.

But what is the alternative? Sit idle?

Dmitry_Vyukov · ‎12-11-2010

Quoting Andrey Marochko (Intel)

Sure we know about fibers. But as you noted yourself, they are supported on Windows only, and TBB will have to preserve its efficiency on other OSes too.

Signal mask reset overhead is not fundamental, since you do not actually need signal mask reset during switch, right? Just do what Windows does.