Why not I/O threads

grepya · ‎04-28-2009

Hi,

I'm a newly enthusiastic user of TBB. I truly wish this becomes the standard to remove any excuse for an engineering team to write a yet another user-level thread scheduler "because performance demands it". But I have to say, I was disappointed to read the following in the FAQ:

TBB is best for computational tasks that are not prone to frequent waiting for I/O or events in order to proceed (this is an area the TBB team does want to tackle later).

In my prior experience, multi-threaded server type applications are just as prone to inefficiencies due to cache thrashing and related effects, not to mention the effect of heavy (kernel level) context switches as the sort of data crunching parallelization that all the tbb example code seems to emphasize. I'd like to understand the following: what exactly is it about the design of the task scheduler or parallel algorithms provided in the tbb library that makes it particularly unsuitable for I/O heavy server applications.

For the purposes of this discussion, we can take a particular(-ly simple) application example: Let's say there's an in-memory database application. It's accepting requests over the network to get or store some data values (in-memory only). The internal work that a thread has to do involves processing the query and then preparing the answer by looking up some in-memory data structures (which need to support concurrent access of course). And finally this answer is returned to the network.

Could someone knowledgeable explain to me why I can't use a collection of tasks (that are waiting on the incoming request queue) combined with some of the parallel data structures provided (parallel hash_map etc.) to write this application ? I'd like to generate a hopefully informative discussion that could potentially be added to the FAQ or another relevant document. (and if something like this is already in an existing document, I would appreciate a pointer please).

Thanks a lot
Chetan

Dmitry_Vyukov · ‎04-28-2009

Quoting - grepya

Could someone knowledgeable explain to me why I can't use a collection of tasks (that are waiting on the incoming request queue) combined with some of the parallel data structures provided (parallel hash_map etc.) to write this application ?

If tasks are blocked on request queue then they are excluded from processing. Assume you have N processors and N worker threads, they all are blocked on request queue. One request arrived, one worker thread unblocks and processes it. Where is parallelization? To achieve parallelization worker thread have to wait on task scheduler's queue.

However, in theory, worker thread may wait on both IO and task queue. Efficient combination of computational processing and IO is indeed possible. One of the examples is libasync-smp, AFAIR it basically maintains low-priority task for every worker thread which polls IO with select():
http://www.scs.cs.nyu.edu/~dm/papers/zeldovich:async-mp.pdf

Another example is QuickThread, it takes a different approach: run-time maintains 2 thread pools - one for computational tasks (thread per CPU by default), and another for IO (tasks may block here). Computational tasks may submit IO tasks and vice versa:
http://software.intel.com/en-us/articles/quickthread/

RafSchietekat · ‎04-28-2009

It would be nice if TBB's scheduler could register itself with the kernel to receive information that would help it decide the currently optimum number of worker threads based on how many are currently blocked waiting and on what's going on in other processes.I supposepart of this could be achieved by user-initiated coordination through shared memory (if multiple TBB-based processes are active at the same time), and by doing things like { tbb::blocking anonymous; my_blocking_call(); } in code that has a manageable number of locations to be guarded, and/or active monitoring at scheduling time.

Another problem related to use in servers is that TBB trades away fairness and even global progress for throughput efficiency of assumedly finite workloads. Tasks may become buried under other tasks and be starved virtually forever if this assumption does not hold. It is possible to work around that by processing jobs in batches, but this requires draining the whole pipeline in-between (using a CPU-related simile), which may obviously waste a considerable amount of performance if not done carefully (well,in cerebroanyway...). It would be nice if the trade-off could be tweaked to reduce the effect to perhaps at the most some small bubbles.

Did I forget anything?

grepya · ‎04-29-2009

Quoting - Dmitriy Vyukov

Hi Dmitriy,

Thanks for your response. But I'm not sure I completely understand. Perhaps there are internal assumptions in TBB that I'm not clear about here. Let
me address your question below:

If tasks are blocked on request queue then they are excluded from processing. Assume you have N processors and N worker threads, they all are blocked on request queue. One request arrived, one worker thread unblocks and processes it. Where is parallelization? To achieve parallelization worker thread have to wait on task scheduler's queue.

Well the parallelization is that it's not just one worker thread who is processing a single request. In the ideal case, every task has picked up a request that it's dealing with (obviously in separate, parallel threads) and once it's done, it goes back to waiting on the request queue. If there's nothing on the request queue, the tasks just wait (which is obviously ok). Otherwise, (in heavy load case) they all spend only a fraction of thier time waiting on the requests queue and most of their time actually doing the work.

However, in theory, worker thread may wait on both IO and task queue. Efficient combination of computational processing and IO is indeed possible. One of the examples is libasync-smp, AFAIR it basically maintains low-priority task for every worker thread which polls IO with select():
http://www.scs.cs.nyu.edu/~dm/papers/zeldovich:async-mp.pdf

Another example is QuickThread, it takes a different approach: run-time maintains 2 thread pools - one for computational tasks (thread per CPU by default), and another for IO (tasks may block here). Computational tasks may submit IO tasks and vice versa:
http://software.intel.com/en-us/articles/quickthread/

Of course I can just as easily code somthing like that (and have coded in the past) in the usual OS supplied threads and/or using one of the other approaches you quoted from research projects. The advantages I'm seeking by using the TBB libraries is the same advantages you cite for all the other applications. Cache friendliness, User level scheduling, no hand-tuning to the next CPU architectures etc. Basically leveraging Intel's investment in the library to my application's advantage.

Am I making sense or am I completely misunderstanding the way TBB tasks work ?

Thanks
Chetan

grepya · ‎04-29-2009

Quoting - Raf Schietekat

Hi Raf,

Thanks for your response.

It would be nice if TBB's scheduler could register itself with the kernel to receive information that would help it decide the currently optimum number of worker threads based on how many are currently blocked waiting and on what's going on in other processes.I supposepart of this could be achieved by user-initiated coordination through shared memory (if multiple TBB-based processes are active at the same time), and by doing things like { tbb::blocking anonymous; my_blocking_call(); } in code that has a manageable number of locations to be guarded, and/or active monitoring at scheduling time.

So if I understand you correctly, currently there's no way for the application to decide how many worker threads are spawned ? Or maybe even a higher level hint like "use R cores out of
total N available" etc. ?? That way, the task scheduler could make sure that there are at least R worker threads in the running state.

Another problem related to use in servers is that TBB trades away fairness and even global progress for throughput efficiency of assumedly finite workloads. Tasks may become buried under other tasks and be starved virtually forever if this assumption does not hold. It is possible to work around that by processing jobs in batches, but this requires draining the whole pipeline in-between (using a CPU-related simile), which may obviously waste a considerable amount of performance if not done carefully (well,in cerebroanyway...). It would be nice if the trade-off could be tweaked to reduce the effect to perhaps at the most some small bubbles.

Hmm... trying to make sense of the above paragraph. So are you saying that the tasks are not pre-empted until they are finished (and/or perhaps call a blocking call). If that's the case, I don't see too much of a problem in the application I drew up in my original post above. The lifetime of a task is book-ended by blocking operations as follows: Input -----> in-core-processing---->Output. Now of course all other server processes can be broken down into subtasks that reduce to this simple pattern.

Or if I have misunderstood your point completely, please correct me. The unfairness issue, if it's real, seems to be the most serious issue so far. Perhaps if there was a clear specification of the nature of the scheduling algorithms we could design our applications to work around the problems.

Thanks
Chetan

robert-reed · ‎04-29-2009

Quoting - grepya

...the parallelization is that it's not just one worker thread who is processing a single request. In the ideal case, every task has picked up a request that it's dealing with (obviously in separate, parallel threads) and once it's done, it goes back to waiting on the request queue. If there's nothing on the request queue, the tasks just wait (which is obviously ok). Otherwise, (in heavy load case) they all spend only a fraction of thier time waiting on the requests queue and most of their time actually doing the work.

It really depends on the amount of work to do in a single request versus the frequency with which those requests show up. Threading adds overhead, and if there's not enough computation work to do, the net effect is low concurrency levels. If there's plenty of work to do with each request, there may be an opportunity for recursive parallelism (parallel at the request dispatch and parallel within the fulfillment of each request), but it really depends on the factors previously mentioned.

Quoting - grepya

So if I understand you correctly, currently there's no way for the application to decide how many worker threads are spawned ? Or maybe even a higher level hint like "use R cores out of
total N available" etc. ?? That way, the task scheduler could make sure that there are at least R worker threads in the running state.

There is a very simple way for the application to decide how many worker threads to create, the argument used with the creation of the first task_scheduler_init object. Raf's point is that when those worker threads are blocked waiting for I/O, they are effectively taken off the schedule. If you have a machine with eight HW threads and two of them are blocked waiting for something, only six are available to share in any task scheduled work. The notion of getting hints from the OS would be to be able to automatically spawn (or take from some other thread pool) additional workers to make full use of the machine. If you can solve that problem, you have the corresponding problem on the other side of gracefully terminating (or repooling) the extra threads when the originals become unblocked, or else face a possible condition where too many threads are completing for CPU time and thrashing with each other. And when you solve that problem, you have the additional problem of either running out of memory because of all the additional threads running, or else limiting the number of additional threads and finding yourself back with the original problem, a loss of concurrency because of all the blocked threads.

Quoting - grepya

So are you saying that the tasks are not pre-empted until they are finished (and/or perhaps call a blocking call). If that's the case, I don't see too much of a problem in the application I drew up in my original post above. The lifetime of a task is book-ended by blocking operations as follows: Input -----> in-core-processing---->Output. Now of course all other server processes can be broken down into subtasks that reduce to this simple pattern.

Tasks are not preempted. They run to completion. They are not fairly scheduled, either. There's no preemptive scheduler to ensure all tasks get equal time. The goal of the scheduler is to maximize the time any task and its related tasks run in the same HW thread in order to minimize the latency to local data already resident in some particular cache.

RafSchietekat · ‎04-30-2009

#3 "Of course I can just as easily code somthing like that (and have coded in the past) in the usual OS supplied threads and/or using one of the other approaches you quoted from research projects. The advantages I'm seeking by using the TBB libraries is the same advantages you cite for all the other applications. Cache friendliness, User level scheduling, no hand-tuning to the next CPU architectures etc. Basically leveraging Intel's investment in the library to my application's advantage."
Certainly this make sense, but to get the most out of TBB and/or avoid surprises you sometimes have to work around the impedance mismatch between its world view and your own situation.

#4 "Or if I have misunderstood your point completely, please correct me."
Tasks may never start executing, or never wake up from waiting for their children to finish, even if they each individually have only a bounded amount of work to do. So you have to take your own flow-control precautions. This can be simple, but you have to be aware of it if you want to use TBB to control your nuclear power plant (just kidding). It would be enough to run a loop that waits for any select() or poll() activity, reads something from all active inputs, calls TBB to process the current batch to completion, and repeats, but it probably won't be fully 100% efficient; you can improve matters by allowing bounded amounts of work to be added to a batch in progress. Alternatively, you could keep all data in passive data structures (like concurrent_queue) and only let TBB touch it as part of task executions that never go to sleep inside TBB (a pipeline wouldn't qualify), which may be more efficient anyway if there is only limited processing to be done on each data item.

#5 "If you can solve that problem", "And if you when you solve that problem"
:-)

(Added) That process loop above could simply be while() { select(); parallel_for(sockets); }, I suppose, but then new work ona socket already serviced during a particular iteration would be delayed until the next iteration.

Dmitry_Vyukov · ‎05-03-2009

Quoting - Raf Schietekat

It would be nice if TBB's scheduler could register itself with the kernel to receive information that would help it decide the currently optimum number of worker threads based on how many are currently blocked waiting and on what's going on in other processes.I supposepart of this could be achieved by user-initiated coordination through shared memory (if multiple TBB-based processes are active at the same time), and by doing things like { tbb::blocking anonymous; my_blocking_call(); } in code that has a manageable number of locations to be guarded, and/or active monitoring at scheduling time.

This will be true when TBB will be running on top of the ConcRT (UMS threads). UMS will provide notifications regarding worker thread blocking, as well as information about system load by other entities (process-wide in the first version).
Doing both things w/o kernel support is too... impossible.

Btw, question to TBB team: have you started integration with ConcRT? CTP/RC versions of Win7 and ConcRT are available. Not sure what actual benefits it will bring to end-users, though it will be interesting in itself :)

Dmitry_Vyukov · ‎05-03-2009

Quoting - grepya

Hi Dmitriy,

Thanks for your response. But I'm not sure I completely understand. Perhaps there are internal assumptions in TBB that I'm not clear about here. Let
me address your question below:

If tasks are blocked on request queue then they are excluded from processing. Assume you have N processors and N worker threads, they all are blocked on request queue. One request arrived, one worker thread unblocks and processes it. Where is parallelization? To achieve parallelization worker thread have to wait on task scheduler's queue.

Well the parallelization is that it's not just one worker thread who is processing a single request. In the ideal case, every task has picked up a request that it's dealing with (obviously in separate, parallel threads) and once it's done, it goes back to waiting on the request queue. If there's nothing on the request queue, the tasks just wait (which is obviously ok). Otherwise, (in heavy load case) they all spend only a fraction of thier time waiting on the requests queue and most of their time actually doing the work.

Of course I can just as easily code somthing like that (and have coded in the past) in the usual OS supplied threads and/or using one of the other approaches you quoted from research projects. The advantages I'm seeking by using the TBB libraries is the same advantages you cite for all the other applications. Cache friendliness, User level scheduling, no hand-tuning to the next CPU architectures etc. Basically leveraging Intel's investment in the library to my application's advantage.

Am I making sense or am I completely misunderstanding the way TBB tasks work ?

If tasks will be waiting on IO (blocking call or separate queue) then... well.. you will hardly get any advantages from the TBB. It's not TBB anymore, you are using TBB just as a plain thread pool, you completely ignore TBB's main component - distributed scheduler - by substituting it with own queue-based scheduler.

What is the question? Why TBB does not support IO? Well, I guess the answer is that it just does not support (limited time, different goals).

It's definitely possible to bolt IO support on TBB. For example, create separate thread which will poll IO and submit tasks to TBB (not sure how well it will work).

.NET's TLP has somehow better support for IO. It has separate "root task" queue, so that the order of task polling is following:
1. Check own work-stealing deque.
2. Check centralized "root task" queue.
3. Check other thread's work-stealing deques.
So that you may submit IO tasks (tasks generated by IO) to "root task" queue. Then worker-threads will get new "root tasks" when they are completed with previous (no excessive amount of "root tasks" will be processed simultaneously which must be considered good). Also worker threads will be trying to work on separate "root tasks" if possible (which also must be considered as good).

AJ13 · ‎05-04-2009

Rather than integrating I/O tasks with the TBB scheduler.... why not just take the boost.asio model and implement something like it TBB-style? I'm not suggesting that TBB develop a network library and all the other fluff inside of asio, but perhaps just the basic model of I/O (the so-called proactor pattern).

This could work with UNIX file descriptors, not sure about the windows world.

grepya · ‎05-05-2009

Quoting - Dmitriy Vyukov

If tasks will be waiting on IO (blocking call or separate queue) then... well.. you will hardly get any advantages from the TBB. It's not TBB anymore, you are using TBB just as a plain thread pool, you completely ignore TBB's main component - distributed scheduler - by substituting it with own queue-based scheduler.

What is the question? Why TBB does not support IO? Well, I guess the answer is that it just does not support (limited time, different goals).

Well I supposes the real question is, what exactly in the design of the system prevents it's use for I/O type problems. This thread has partly answered my question. But I still see an advantage in bending the framework to try to use it in the a toy server (along the lines I described in my original post) application and do some benchmarking. The value of various other aspects if tbb libraries, like the efficient user level thread scheduler over multi-cores, multi-core intelligent memory allocator, smart synchronized data structures etc. still applies.

It's definitely possible to bolt IO support on TBB. For example, create separate thread which will poll IO and submit tasks to TBB (not sure how well it will work).

Exactly my thoguht. I guess the challange is to experiment and come up with a workable solution while still taking advantage of the valuable pieces of the library (like the ones I mentioned above).

Thanks
Chetan