LIFO vs FIFO - Page 2

Dmitry_Vyukov · ‎09-21-2008

MADadrobiso:
TBB uses an approximation of Cilk-style scheduling (work LIFO; steal FIFO).It has good locality and load balancing properties while making certain space guarantees. It enables clean nesting of software components, analogous to the way serial code composes by nested subroutine calls.

FIFO has terrible locality and often has no space guarantees, which often leads to complex throttling mechanisms. Nonetheless FIFO scheduling definitely has its uses, notably for soft real-time. Finding a rationale marriage ofCilk-style and FIFO scheduling remains an open problem. If done rigorously it's a worthyfor aPh.D. thesis.

The "two roots" idea suggests that perhaps said marriage could be accomplished by some kind of epoch mechanism, where each task belongs to an epoch and older epochs are given precedence. There's a long way between a notion and a practical system.

There are two kinds of user algorithms.
First is a 'Cilk-style' computational algorithms ('HPC'). They doesn't require nor won't benefit from 'fair' processing. Even worser, they will degrade and will require more memory.
Second is a kind of general-purpose algorithms (games, client/server software). Sometimes they just require fair (at least to some degree) processing. Otherwise they can basically deadlock.
So, I think, the first decision one have to made is how to cope with those kinds of user algorithms. For example TBB can postulate that it doesn't support second kind at all. Or TBB can have 2 modes of operation, i.e. user will have to setup proper type of scheduler at the beginning of program. Or TBB can support 2 kind simultaneously, for example on per task basis, i.e. those tasks are HPC, and those are general-purpose. Or just provide some single compromise scheduler.

I'm curious what .NET TPL team will decide on this problem:
http://blogs.msdn.com/pfxteam/archive/2008/08/01/8800195.aspx
I hope they will publish some results. Btw, I see some interesting words in comments, like PRIO, SIZE and AGE.

RafSchietekat · ‎10-10-2008

It is impossible for a purely LIFO scheduler to guarantee forward progress unassisted. Why not provide an integrated scheduler to more easily keep the worker threads busy while providing this guarantee?

Each FIFO batch would run in LIFO fashion to exploit locality.

Dmitry_Vyukov · ‎10-10-2008

Quoting - raf_schietekat

It is impossible for a purely LIFO scheduler to guarantee forward progress unassisted.

Why? Whole stack in the work-stealing deque will be eventually drained. This invietably means that all tasks will be processed. What am I missing here?

Quoting - raf_schietekat

Each FIFO batch would run in LIFO fashion to exploit locality.

As I understand, locality is when you process task AND its CHILDS one after another, not when you process just some batch of tasks one after another.

RafSchietekat · ‎10-11-2008

"This invietably means that all tasks will be processed." Yes, but also that the LIFO scheduler isn't the whole story. How about a worthy superstructure? It takes some effort to provide efficient forward progress, not to mention handling priorities and timing constraints and whatnot. I would like a Building Block that handles that for me.

"As I understand, locality is when you process task AND its CHILDS one after another, not when you process just some batch of tasks one after another." Yes, "in LIFO fashion": while(true){ task::spawn_root_and_wait(new_batch); }

Dmitry_Vyukov · ‎10-15-2008

Quoting - raf_schietekat

"This invietably means that all tasks will be processed." Yes, but also that the LIFO scheduler isn't the whole story. How about a worthy superstructure? It takes some effort to provide efficient forward progress, not to mention handling priorities and timing constraints and whatnot. I would like a Building Block that handles that for me.

"As I understand, locality is when you process task AND its CHILDS one after another, not when you process just some batch of tasks one after another." Yes, "in LIFO fashion": while(true){ task::spawn_root_and_wait(new_batch); }

I still can't catch your point. Can you briefly outline high-level pseudo-code for the scheduler, please?

RafSchietekat · ‎10-15-2008

A worker thread always first exhausts its LIFO tasks.

A very simplistic ultra-fair but not terribly efficient addition would be a global queue. Just before stealing from another worker thread, a thread tries to grab a task from the global queue instead. This avoids interfering with locality, thus possibly improving throughput compared to stealing first.

Maybe just before that, a thread could try to steal from any thread that is falling behind based on the last time it tried to steal or grab a new task.

Communication costs could be amortised by also using (fixed-size) task buckets. This may mean that a task that is added later could be processed earlier than some tasks in preceding buckets, but there is still global progress (no task left behind).

Dmitry_Vyukov · ‎10-15-2008

Quoting - raf_schietekat

A worker thread always first exhausts its LIFO tasks.

A very simplistic ultra-fair but not terribly efficient addition would be a global queue. Just before stealing from another worker thread, a thread tries to grab a task from the global queue instead. This avoids interfering with locality, thus possibly improving throughput compared to stealing first.

Maybe just before that, a thread could try to steal from any thread that is falling behind based on the last time it tried to steal or grab a new task.

Communication costs could be amortised by also using (fixed-size) task buckets. This may mean that a task that is added later could be processed earlier than some tasks in preceding buckets, but there is still global progress (no task left behind).

Who and when adds tasks to global queue?

Btw, now this strongly recalls .NET Task Parallel Library scheduler:

A new thread pool thread needs to allocate its work stealing queue.
When queuing a new work item, we must check to see if were on a pool thread. If so, we will queue the work item into the work stealing queue instead of the global queue.
When a pool thread looks for work, it needs to:
- First consult its local work stealing queue.
- If that fails, it then looks at the global queue.
- Lastly, if that fails, it needs to steal from other work stealing queues.

http://www.bluebytesoftware.com/blog/2008/09/17/BuildingACustomThreadPoolSeriesPart3IncorporatingWorkStealingQueues.aspx

RafSchietekat · ‎10-15-2008

"Who and when adds tasks to global queue?" A task or a tbb_thread, basically when they detect work that is not a dependency of what they are currently doing (with some care, limited amounts of work can still be stacked locally, and a realistic solution would be more than just a global queue).

"Btw, now this strongly recalls .NET Task Parallel Library scheduler" I'm sure they also use two's complement numbers and other inescapable concepts. But I prefer being able to say I came up with them all by myself (also seems better for making a contribution), so I'll ignore this particular reference.

Dmitry_Vyukov · ‎10-23-2008

Quoting - Raf Schietekat

"Who and when adds tasks to global queue?" A task or a tbb_thread, basically when they detect work that is not a dependency of what they are currently doing (with some care, limited amounts of work can still be stacked locally, and a realistic solution would be more than just a global queue).

Now I am starting getting your point. I think this can make sense. For big "user-level tasks" (not TBB tasks) fairness is definitely good.

Though one must be very careful here, how are you going to detect whether task depends on what thread currently doing? Currently I don't see any ways...

Quoting - Raf Schietekat

"Btw, now this strongly recalls .NET Task Parallel Library scheduler" I'm sure they also use two's complement numbers and other inescapable concepts. But I prefer being able to say I came up with them all by myself (also seems better for making a contribution), so I'll ignore this particular reference.

Oh, I see... then to be honest you must not look into current TBB scheduler implementation too :)

RafSchietekat · ‎10-23-2008

"Now I am starting getting your point. I think this can make sense. For big "user-level tasks" (not TBB tasks) fairness is definitely good." Well, with appropriate scheduling available there's no reason to make that distinction anymore.

"Though one must be very careful here, how are you going to detect whether task depends on what thread currently doing? Currently I don't see any ways..." I meant the other way around: if you don't "need" to wait for the new task to complete, then maybe you shouldn't.

"Oh, I see... then to be honest you must not look into current TBB scheduler implementation too :)" Hmm, if I want to contribute to something, I had better know how it works now.

Dmitry_Vyukov · ‎10-27-2008

Quoting - Raf Schietekat

"Now I am starting getting your point. I think this can make sense. For big "user-level tasks" (not TBB tasks) fairness is definitely good." Well, with appropriate scheduling available there's no reason to make that distinction anymore.

It seems that conversation start cycling :)

Total fairness (FIFO) is not compatible with locality. Absence of locality is not compatible with performance. Hence total fairness is not compatible with performance.

What I am missing here?

Dmitry_Vyukov · ‎10-27-2008

Quoting - Raf Schietekat

"Though one must be very careful here, how are you going to detect whether task depends on what thread currently doing? Currently I don't see any ways..." I meant the other way around: if you don't "need" to wait for the new task to complete, then maybe you shouldn't.

Assume one implementing fork/join algorithm, but with empty join part, i.e. there are only forks. Your algorithm will decide that all tasks are independent. This will destroy performance and space guarantees (i.e. user will catch out-of-memory exception).

RafSchietekat · ‎10-27-2008

"What I am missing here?" Perhaps that tasks can be the common currency for things that need locality for performance and things that need forward progress?

"Assume one implementing fork/join algorithm, but with empty join part, i.e. there are only forks. Your algorithm will decide that all tasks are independent. This will destroy performance and space guarantees (i.e. user will catch out-of-memory exception)." It's not an algorithm, it's a guideline for a human programmer, who should recognise this situation as a need to wait.

Dmitry_Vyukov · ‎10-27-2008

Quoting - Raf Schietekat

"What I am missing here?" Perhaps that tasks can be the common currency for things that need locality for performance and things that need forward progress?

Please, rephrase above somehow. I am unable to catch the meaning :(

Quoting - Raf Schietekat

"Assume one implementing fork/join algorithm, but with empty join part, i.e. there are only forks. Your algorithm will decide that all tasks are independent. This will destroy performance and space guarantees (i.e. user will catch out-of-memory exception)." It's not an algorithm, it's a guideline for a human programmer, who should recognise this situation as a need to wait.

Waiting do introduces overheads too (reference counting for 'parent' task, second execution for parent etc). So it looks like trading bad for worse...

RafSchietekat · ‎10-27-2008

"Please, rephrase above somehow. I am unable to catch the meaning :(" Whether you need to do something that needs locality for performance, or something that needs forward progress, they're all tasks. The difference is in how you allocate/spawn them so that the scheduler knows how to handle them.

"Waiting do introduces overheads too (reference counting for 'parent' task, second execution for parent etc). So it looks like trading bad for worse..." Hmm, the opportunity for a simple optimisation hardly seems worse than a memory explosion.

Dmitry_Vyukov · ‎10-29-2008

Quoting - Raf Schietekat

"Please, rephrase above somehow. I am unable to catch the meaning :(" Whether you need to do something that needs locality for performance, or something that needs forward progress, they're all tasks. The difference is in how you allocate/spawn them so that the scheduler knows how to handle them.

Are you talking about something similar to what I've described in #12:

http://software.intel.com/en-us/forums/showpost.php?p=66177

(item 2)

?

Quoting - Raf Schietekat

"Waiting do introduces overheads too (reference counting for 'parent' task, second execution for parent etc). So it looks like trading bad for worse..." Hmm, the opportunity for a simple optimisation hardly seems worse than a memory explosion.

Currently with LIFO scheduler there is no memory explosion. And no need to obligatory wait for task completions.

RafSchietekat · ‎10-29-2008

"Are you talking about something similar to what I've described in #12" No round robin.

"Currently with LIFO scheduler there is no memory explosion. And no need to obligatory wait for task completions." Oh yes... Well, luckily I did say "maybe" before to leave an out for myself. :-) So the guideline should be "whichever style avoids the memory explosion should be chosen", which can then be elaborated further.

Dmitry_Vyukov · ‎10-30-2008

Quoting - Raf Schietekat

"Are you talking about something similar to what I've described in #12" No round robin.

Well, if user will be manually explicitly distinguishing what tasks require FIFO and what are Ok with LIFO, then I think it's viable idea.

Currently I am sceptical only about automatic distinguishing based on some heuristics.

RafSchietekat · ‎10-30-2008

I wrote: "which can then be elaborated further". And before you challenge me on that, it is meant to imply that something like a chain reaction should be scheduled for LIFO, because the need to keep memory requirements down to O(log n) (right?) sort of implies the soft kind of "wait" I intended, i.e., depth-first scheduling, while external inputs (with externally imposed granularity) in a continuous process imply FIFO even if only for the simple reason that starving earlier tasks would cause them to pile up (fairness for self-preservation instead of for its own sake). Well, the effect of FIFO here is only partial of course: finishing up business with your earlier customers first gets them out of the store faster, which does clear up space, but without throttling you can still be overrun.

robert_jay_gould · ‎10-30-2008

Just an opinion as a library user, programming a non HPC application. Instead working on a discreet simulation server.

After I read that tbb::task works with simple LIFO manner (of course optimized and all) I decided on the spot it wasn't for me.

Which is good because I knew immediately that it wouldn't work in my scenario, so I ended upwritingmore or less a custom task manager (FIFO + epochs), that suits my needs.

Basically I accumulate tasks into epochs (within which task order is required to be irrelevant) and then operate on the epoch in a random manner (just happens to be LIFO). But I do want the epochs to be FIFO of course as time is supposed to flow forward :)

On the other hand if tbb::tasks were processed in a pure FIFO manner, I probably wouldn't have used it either :)

Basically what I'm trying to say is that anyone working on a non HPC simulation will likely not use a default task manager anyways, simply because it can't possibly map directly to their needs. So I don't think tbb needs a FIFO manager, it would be an addition no one probably use.

RafSchietekat · ‎10-31-2008

"So I don't think tbb needs a FIFO manager, it would be an addition no one probably use." Well, LIFO isn't strictly LIFO either, just a general characterisation, but I would not rule out the occasional use for strict FIFO. (Note that I only mentioned the idea of a global queue to avoid the nitty-gritty of implementation details, not to suggest that it might somehow provide strict FIFO.)

RafSchietekat · ‎10-31-2008

Quoting - Raf Schietekat

"So I don't think tbb needs a FIFO manager, it would be an addition no one probably use." Well, LIFO isn't strictly LIFO either, just a general characterisation, but I would not rule out the occasional use for strict FIFO. (Note that I only mentioned the idea of a global queue to avoid the nitty-gritty of implementation details, not to suggest that it might somehow provide strict FIFO.)

Hmm, I seem to be a bit careless lately: I linked this sentence with the previous paragraph. Now I would have to simply disagree: forward progress is a very simple and general concept to start with.