hidden text to trigger early load of fonts ПродукцияПродукцияПродукцияПродукция Các sản phẩmCác sản phẩmCác sản phẩmCác sản phẩm المنتجاتالمنتجاتالمنتجاتالمنتجات מוצריםמוצריםמוצריםמוצרים
Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

blocking style task scheduling disadvantage.

amitm02
Beginner
761 Views
Hi All,
The TBB book states that blocking style (spawn_and_wait_for_all()) had the following disadvantage:
"the scheduler limit possible victim tasks to those deeper then waiting task [due to memory issues]. this limit modifyies the policy that the shallowest task should be chosen.
allegedly, this is a disadvantage beacuse doing "breadth-first" execuation (picking the "shallowest tasks),maximize parallelism" as it "unfold the tree more", creating more open tasks for the working threads.
however, to my understanding, the limit of "possible victim tasks" is only for the working thread of the waiting task. i.e all other threads are free the steal the shallowest tasks.
and so:
1) parrallisem is maxed due to the other working threads stealing tasks in"breadth-first" execuation.
2) the working thread of the waiting task is picking tasks that are more hot in cache (good thing).
so where is the disadvatage?
Yours Amit
0 Kudos
12 Replies
RafSchietekat
Valued Contributor III
761 Views

Since that was written, the scheduler has been changed so that task depth is no longer a consideration. A full answer probably needs to consider NUMA issues, and should be based on more than just intuition, but not favouring breadth-first scheduling increases the frequency of task stealing, which is rumoured to be very costly.

0 Kudos
amitm02
Beginner
761 Views
Thanks Raf,
Is there any information resource i can use to understand better how the current scheduler operate?
I'm courise, why "not favouring breadth-first scheduling increases the frequency of task stealing"?
yours
amit
0 Kudos
RafSchietekat
Valued Contributor III
761 Views

If the amount of work is constant and the scheduler bites off smaller pieces (depth-first), there will be more pieces...

0 Kudos
Andrey_Marochko
New Contributor III
761 Views
I've started answering your question, but the reply was getting longer and longer, so I decided to convert it into a blog post. You can find it here. Hope it will clarify some of your questions.

What I have not mentioned in the blog is another couple of blocking style disadvantages. See my short post in another forum thread.
0 Kudos
RafSchietekat
Valued Contributor III
761 Views
Nice. Should I have immediately seen why it's a relatively rare occurrence for a thread to be blocked for excessive stack use, or is that a matter of experimentation?
0 Kudos
Andrey_Marochko
New Contributor III
761 Views
For me and, I believe, for other TBB designers it was rather a conclusion based on the field experience. The common sence reasoning suggests that in case of more or less well balanced workloads stealing intensifies only closer to the end of processing, when remaining chunk are small enough to result in dangerously deep recursion. Even in case of substantially imbalanced work distribution threads eventually hit a motherload, and happily gnewing through it most of their time.

Besides default stack sizes on modern OSes are large enough (normally 1M or above), while most of the numerical algorithms keep reasonably small amounts of data on stack.

After all it was exactly the absence of any reliable theoretical foundation why the initial design went to such great (and painful) lengths to preclude this possibilit :)
0 Kudos
RafSchietekat
Valued Contributor III
761 Views
Murphy's Law pretty much guarantees that somebody will challenge this assumption, though, and then they might blame TBB. Could relief come in the form of bigger stacks, or tapping stand-by threads (dynamically extending a cumulatiive stack, up to a finite multiplication factor)?
0 Kudos
Andrey_Marochko
New Contributor III
761 Views
I think that the way to move further would be eliminating blocking style at all, by means of switching to another stack (which would mean stand-by thread in many OSes) before stealing from the nested parallelism level. The biggest question here is the cost of such switches, and it looks like only experiment will be able to answer it.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
761 Views
I think that the way to move further would be eliminating blocking style at all, by means of switching to another stack (which would mean stand-by thread in many OSes) before stealing from the nested parallelism level. The biggest question here is the cost of such switches, and it looks like only experiment will be able to answer it.

You do not need full-fledged threads for that. Light-weight fibers will do. Each worker thread need some limited amount of worker fibers to switch between them on blocking. On Windows fiber switch is very lightweight, it just saves current registers in current fiber context and restores registers from new fiber context, on par with ~20 cycles. Unfortunately, on Linux fiber switch is considerably heavier weight, because it needs to do a syscall in order to switch the signal mask.

0 Kudos
Andrey_Marochko
New Contributor III
761 Views
Sure we know about fibers. But as you noted yourself, they are supported on Windows only, and TBB will have to preserve its efficiency on other OSes too.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
761 Views
Sure we know about fibers. But as you noted yourself, they are supported on Windows only, and TBB will have to preserve its efficiency on other OSes too.

But what is the alternative? Sit idle?


0 Kudos
Dmitry_Vyukov
Valued Contributor I
761 Views
Sure we know about fibers. But as you noted yourself, they are supported on Windows only, and TBB will have to preserve its efficiency on other OSes too.

Signal mask reset overhead is not fundamental, since you do not actually need signal mask reset during switch, right? Just do what Windows does.

0 Kudos
Reply