Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

TBB's UnFair Task Scheduling

Shankar1
Beginner
3,939 Views
I have a couple of questions regarding TBB's unfair scheduling.

1. I read through the user manual that describes Task Scheduler and this what it says

" Tasks in Intel Threading Building Blocks are efficient too because the scheduler is unfair. Thread schedulers typically distribute time slices in a round-robin fashion. This distribution is called fair, because each logical thread gets its fair share of time. Thread schedulers are typically fair because it is the safest strategy to undertake without understanding the higher-level organization of a program. In task-based programming, the task scheduler does have some higher-level information, and so can sacrifice fairness for efficiency. Indeed, it often delays starting a task until it can make useful progress."

And as I understand the above TBB scheduler makes sure that the thread executing a task runs until the task completes. Is this correct? If yes how is this done?

2. "Indeed, it often delays starting a task until it can make useful progress."
What does the above line mean?

3. Im planning to use the TBB scheduler in my application wherein I have a thread(boost thread) which waits on data packets to arrive. The boost thread creates a tbb::task and spawns it to the TaskScheduler and waits for taking the next data packet. The frequency at which the data packets arrive is not determinate( they may arrive at high frequency and sometimes at a low frequency). Hence the decision to use boost thread to wait on the data packets and spawn tasks for processing the packet.

And lets say I initialize the scheduler with number of physical threads( lets say 4). Also lets assume the boost thread reads four data packets and spawns tasks for them to the task scheduler( using task::spawn() ) and waits for further data packets to arrive. All the four worker threads in the scheduler steals a task each and start executing the stolen task. Now lets say a data packet arrives when all the worker threads are busy executing their task.

Now the question is when will the boost thread get its turn for collecting the data packet and spawning a task. Will it get its turn only when one of the worker thread finishes executing the task? How will the overall scheduling work now? Will it be fair or unfair?

4. Now in the above case the boost thread will always spawn a task into its own readypool and the worker threads will try to always steal from the ready pool of the boost thread. Is it possible that I target the spawn into the ready pools of other worker threads in the scheduler( may be randomly)? Because it makes sense for me to that because the boost thread will never process tasks in its own ready pool( they are always stolen by worker threads in the scheduler).

--------
Sankar
0 Kudos
44 Replies
RafSchietekat
Valued Contributor III
2,179 Views
1. There is no separate active scheduler, so a worker thread just churns along until it needs to find something else to do, at which time it interacts with the scheduler-related data structures.

2. Nothing comes to mind at this time.

3. In TBB, "stealing" means taking from another thread. If your boost thread spawns 4 tasks, 3 tasks may typically get stolen because the thread itself would have started executing one of them. If you wait_for_all(), then the thread will not be available to process further packets until all tasks have completed, so if you want to be able to race ahead of processing you might allocate an empty_task, to which you add tasks using allocate_additional_child_of(), if you can accept that these will be processed in LIFO order, otherwise you have to store the items outside of the task scheduler, e.g., in a concurrent_queue,and have each task take the oldest one.

4. You could try all sorts of things, but I see no reason to assume that this would be something that needs investigating.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,179 Views
TBB task scheduler is a bad fit for your problem. You will get basically no advantages from it, and will have to fight it's obstacles.
I would recommend something like the following.
Create N worker threads. Create N FIFO concurrent queues. IO thread submits new packets to FIFO queues in a round-robin fashion. Worker threads try to get next packet to process from "own" queue first, if empty try to steal from other queues.
Queues must be correctly padded to prevent false-sharing of course.
The easiest way to handle worker thread blocking (when all the queues are empty) is to use hybrid spin waiting. And kernel blocking is a bit trickier, it can be easily handled with an eventcount, but unfortunately my eventcount proposal is still not accepted to the main codebase.

0 Kudos
RafSchietekat
Valued Contributor III
2,179 Views
Quoting - Dmitriy Vyukov
TBB task scheduler is a bad fit for your problem. You will get basically no advantages from it, and will have to fight it's obstacles.
I would recommend something like the following.
[...]
That seems like a big fight to avoid a small fight. :-)
0 Kudos
Shankar1
Beginner
2,179 Views
Quoting - Raf Schietekat
3. In TBB, "stealing" means taking from another thread. If your boost thread spawns 4 tasks, 3 tasks may typically get stolen because the thread itself would have started executing one of them. If you wait_for_all(), then the thread will not be available to process further packets until all tasks have completed, so if you want to be able to race ahead of processing you might allocate an empty_task, to which you add tasks using allocate_additional_child_of(), if you can accept that these will be processed in LIFO order, otherwise you have to store the items outside of the task scheduler, e.g., in a concurrent_queue,and have each task take the oldest one.

Well I dont plan to use wait_for_all(). Im only going to make the boost thread call only spawn() and wait on the next data packet to arrive. Then this means the boost thread is not executing any of the tasks. All the tasks spawned by the boost thread are stolen by threads in the Scheduler.

So I dont understand what you mean by saying "If your boost thread spawns 4 tasks, 3 tasks may typically get stolen because the thread itself would have started executing one of them".

Now in this case since all the worker threads in the scheduler are stealing tasks, the processing of data packets will be in FIFO order I guess.
0 Kudos
Shankar1
Beginner
2,179 Views
Quoting - Dmitriy Vyukov
TBB task scheduler is a bad fit for your problem. You will get basically no advantages from it, and will have to fight it's obstacles.
I would recommend something like the following.
Create N worker threads. Create N FIFO concurrent queues. IO thread submits new packets to FIFO queues in a round-robin fashion. Worker threads try to get next packet to process from "own" queue first, if empty try to steal from other queues.
Queues must be correctly padded to prevent false-sharing of course.
The easiest way to handle worker thread blocking (when all the queues are empty) is to use hybrid spin waiting. And kernel blocking is a bit trickier, it can be easily handled with an eventcount, but unfortunately my eventcount proposal is still not accepted to the main codebase.


Well I was also thinking of having N worker threads and N concurrent queues and distributing them( of course I dint think of how to handle worker thread blocking, kernel blocking). Well the only problem is that I would not be able to reap the other benefits of TBB scheduler. The tasks which my IO thread submits are recursively splittable tasks which can be heavily parallelized.

Now if I have to be able to support recursive splitting and continuation passing and stuff like that, will I not end up writing something like tbb scheduler again?

and what is the difference between when you say worker thread blocking and kernel blocking?
0 Kudos
Shankar1
Beginner
2,179 Views
Quoting - Raf Schietekat
1. There is no separate active scheduler, so a worker thread just churns along until it needs to find something else to do, at which time it interacts with the scheduler-related data structures.

I dint understand this yet. So if I create a threadpool with number of threads equal to one less than actual physical threads and make the main thread in my application submit tasks to the threadpool ( and also make the main thread participate in executing the tasks whenever it is free), Can I say that my threadpool does fair scheduling? Or does the TBB scheduler do something other than this to say that it does fair scheduling?
0 Kudos
RafSchietekat
Valued Contributor III
2,179 Views
Quoting - Shankar
So I dont understand what you mean by saying "If your boost thread spawns 4 tasks, 3 tasks may typically get stolen because the thread itself would have started executing one of them".

Now in this case since all the worker threads in the scheduler are stealing tasks, the processing of data packets will be in FIFO order I guess.

Then it seems like all tasks would in fact be stolen by other threads. It just wasn't entirely clear to me that this is what you meant.

I guess that you guess correctly... until it changes around again? :-)
0 Kudos
RafSchietekat
Valued Contributor III
2,179 Views
"I dint understand this yet. So if I create a threadpool with number of threads equal to one less than actual physical threads and make the main thread in my application submit tasks to the threadpool ( and also make the main thread participate in executing the tasks whenever it is free), Can I say that my threadpool does fair scheduling? Or does the TBB scheduler do something other than this to say that it does fair scheduling?"
It used to be that both local processing and stealing were LIFO, which was definitely unfair. Since 20090313 (according to CHANGES) it seems that stealing is done FIFO, so now at least the thieves are fair (who would have expected that!). If all your work is stolen from one parent, that may be all you need, because any remaining unfairness is purely in the interest of getting individual data packets processed faster (assuming their processing is subdivided).

So some of what I wrote earlier was based on behaviour that has changed (sorry for that, and thanks for drawing my attention to the matter), and it seems you can entirely rely on the scheduler after all.

(Added) You can't have the spawning thread participate in processing, though, otherwise there is a chance that some data packets will get buried under other work. But you may not even have to account for it when deciding on the number of TBB threads, unless it does something expensive that would cause oversubscription of processing resources anyway, but I don't expect that to be the case.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,179 Views
Quoting - Shankar
Well I was also thinking of having N worker threads and N concurrent queues and distributing them( of course I dint think of how to handle worker thread blocking, kernel blocking). Well the only problem is that I would not be able to reap the other benefits of TBB scheduler. The tasks which my IO thread submits are recursively splittable tasks which can be heavily parallelized.

Now if I have to be able to support recursive splitting and continuation passing and stuff like that, will I not end up writing something like tbb scheduler again?


Well, requirement to do intra-task parallelization changes picture.
In such situation you indeed better to somehow hack TBB task scheduler to do this.
Frankly I do not know what is the best/official way to do that. Probably you can modify the scheme I've described in the following way. Worker thread (in your thread pool) instead of directly processing a packet submits a task to TBB scheduler. This way you will get around blocking of IO thread and scheduler unfairness.
0 Kudos
RafSchietekat
Valued Contributor III
2,179 Views
Quoting - Dmitriy Vyukov
Create N worker threads. Create N FIFO concurrent queues. IO thread submits new packets to FIFO queues in a round-robin fashion. Worker threads try to get next packet to process from "own" queue first, if empty try to steal from other queues.
Would it be a significant improvement to distribute the work, do you think? It doesn't seem unlikely, because workers wouldn't always need to find work by stumbling upon it by chance, as they normally do. If so, could this also beachieved using affinity (I haven't used that yet, and I see no statement about FIFO or LIFO)?
0 Kudos
RafSchietekat
Valued Contributor III
2,179 Views
Quoting - Dmitriy Vyukov
In such situation you indeed better to somehow hack TBB task scheduler to do this.
Why?
0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,179 Views
Quoting - Shankar
and what is the difference between when you say worker thread blocking and kernel blocking?

I just meant how does worker thread handle absence of work. It may just sleep for some time and then recheck queues, sleep for some time, then recheck queues, and so on (this is called spinning). Or it may block on some kernel primitive (semaphore, condition variable, event) until new work arrive (this is called kernel blocking).

0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,179 Views
Quoting - Raf Schietekat
Quoting - Dmitriy Vyukov
In such situation you indeed better to somehow hack TBB task scheduler to do this.
Why?

To not reimplement it.
0 Kudos
RafSchietekat
Valued Contributor III
2,179 Views
Quoting - Dmitriy Vyukov
To not reimplement it.

Why change anything, I mean?
0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,179 Views
Quoting - Raf Schietekat
Quoting - Dmitriy Vyukov
Create N worker threads. Create N FIFO concurrent queues. IO thread submits new packets to FIFO queues in a round-robin fashion. Worker threads try to get next packet to process from "own" queue first, if empty try to steal from other queues.
Would it be a significant improvement to distribute the work, do you think? It doesn't seem unlikely, because workers wouldn't always need to find work by stumbling upon it by chance, as they normally do. If so, could this also beachieved using affinity (I haven't used that yet, and I see no statement about FIFO or LIFO)?

Since tasks are BIG, distibution will make little sense. Simplier design will be to use single centralized queue.
What can be achieved with affinity?


0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,179 Views
Quoting - Raf Schietekat
Why change anything, I mean?

To get around IO thread blocking until previous packet processing is completed, for example.


0 Kudos
RafSchietekat
Valued Contributor III
2,179 Views
"Since tasks are BIG, distibution will make little sense. Simplier design will be to use single centralized queue."
Hmm, you're the one who proposed to use N queues, so now I'm a bit confused.

"What can be achieved with affinity?"
I'm trying to find a possible rationale for your proposal with N queues, and bring it over to the TBB side. Since there is only one thread to steal from, and there's no way to tell the other threads about that, there might be some overhead by letting them find the stealable work by accident (this thread? no. that thread? no. ... yonder thread? yes.). Instead of using explicit queues, I wanted to know your opinion on whether the affinity mailboxes (sic?) could fill that role (they would have to be FIFO for that to work, though).

"To get around IO thread blocking until previous packet processing is completed, for example."
I think we established that this is not an issue: the I/O thread is kept separate.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,179 Views
Quoting - Dmitriy Vyukov
Well, requirement to do intra-task parallelization changes picture.
In such situation you indeed better to somehow hack TBB task scheduler to do this.
Frankly I do not know what is the best/official way to do that. Probably you can modify the scheme I've described in the following way. Worker thread (in your thread pool) instead of directly processing a packet submits a task to TBB scheduler. This way you will get around blocking of IO thread and scheduler unfairness.

However note that your tasks must be large enough in order to intra-task parallelization to make sense. By large enough I mean, let's say, larger than 1 ms. Otherwise you better not do any intra-task parallelization at all.


0 Kudos
Dmitry_Vyukov
Valued Contributor I
2,179 Views
Quoting - Raf Schietekat
"Since tasks are BIG, distibution will make little sense. Simplier design will be to use single centralized queue."
Hmm, you're the one who proposed to use N queues, so now I'm a bit confused.

Initially I assumed that tasks are not that big, so I proposed distributed queues.
Then Shankar said that intra-task parallelization is worth doing, this means that tasks are large enough. So I said that if tasks are large, there is little sense in distributed queues.


Quoting - Raf Schietekat
"What can be achieved with affinity?"
I'm trying to find a possible rationale for your proposal with N queues, and bring it over to the TBB side. Since there is only one thread to steal from, and there's no way to tell the other threads about that, there might be some overhead by letting them find the stealable work by accident (this thread? no. that thread? no. ... yonder thread? yes.). Instead of using explicit queues, I wanted to know your opinion on whether the affinity mailboxes (sic?) could fill that role (they would have to be FIFO for that to work, though).

That's how TBB works anyway, when thread is idle it polls other queues. The overhead is usually not that big, and note that overhead is associated with a thread that is otherwise idle.
And it's not the case that there is only one thread to steal from. All threads but one may have not empty queues.
As a possible optimization you may also apply the following. When IO thread unblocks any worker threads, it communicates to then pointer to the non-empty queue (the queue it's just enqueued task to). So that worker threads have some idea as to what queue to check first.

What exactly do you mean by affinity mailboxes? How do they differ from explicit queues?



Quoting - Raf Schietekat
"To get around IO thread blocking until previous packet processing is completed, for example."
I think we established that this is not an issue: the I/O thread is kept separate.

Do you mean hack with allocate_additional_child_of()? Well... it's one of the options to "hack TBB scheduler" :)

0 Kudos
RafSchietekat
Valued Contributor III
2,113 Views
"So I said that if tasks are large, there is little sense in distributed queues."
Why is that decision related to size?

"And it's not the case that there is only one thread to steal from. All threads but one may have not empty queues."
In this case it is in fact just the I/O thread that has (the most eligible) stealable work.

"What exactly do you mean by affinity mailboxes? How do they differ from explicit queues?"
Explicit queues are so... explicit. :-)

"Do you mean hack with allocate_additional_child_of()? Well... it's one of the options to "hack TBB scheduler" :)"
Seems more like normal use to me.
0 Kudos
Reply