I was wondering why is creating and spawning 20 IO tasksseems cheaper to write than creating 20 tbb_threads each running it's own short IO function? However with the scenario you're describing I'm only seeing one problem at the first glance. And it's that (potentially) TBB could already be initializedwith a pool of default size, so your creating a task_scheduler_init object and passing "20" into the constructor will not change a thing.
Also, you're suggesting to start computations right after you've completed the IO, so you'll have to be able to destroy the last task_scheduler_init object, so that the 20-threads-pool gets destroyed and then initialize the default-size pool to run computations without oversubscription.
By the way, have you tried mixing TBB and Boost Asio? I've read in the forum somewhere that someone tried it and succeeded...
If you can't use Boost, can you use asynchronous, non-blocking I/O? It would work with TBB even better than Asio since Asio has a lot of critical sections inside to use serial containers.
What are the performance differences between select/poll and alternatives? I presume that blocking individual threads aren't a very good choice in comparison (for performance, anyway), but I have not yet explored asynchronous APIs or any other alternatives.
Asio prefers epoll (on Linux) and kqueue (on BSD/Mac) to select and eventfd to pipe. It's the only thing I know about their performance. The idea is to isolate the blocking to only one call in only one thread. Unfortunately, TBB still has known issues even for this approach (spin-idling of master thread). However, it can be worked around and I hope we will fix it eventually.
Then one thing that hasn't been mentioned elsewhere already would probably be to make sure that the data is read directly into the cache that will ultimately process it, instead of first beingassembled into complete messages by the central tbb_thread that would block to monitor all input? Just curious: how is the data treated behind the scene, does it remain inside the network interface until read() time, or would it already have been copied to the thread that performs the poll() or equivalent, or even to another one? Anyway, no use making a redundant copy that also crosses caches once or perhaps even twice, right? Does that make sense?
AFAIK, Boost::Asio and Windows overlapped API work differently. They notify when the data has _already_ been read or written into/from a buffer specified for async operation.
Yes, I was wondering how big the fragments would need to be for it to make sense to have them read and assembled by affinity-based tasks spawned by the central tbb_thread, instead of letting the central tbb_thread assemble them(easier to code, less task overhead) and then provide pre-assembled messages to single tasks (having to deal with nonlocal data from a nonlocal allocation). I'm less sure about the advantage of calling read() in parallel, or about how that effect can even be isolated, but maybe it would play a role as well.
"AFAIK, Boost::Asio and Windows overlapped API work differently. They notify when the data has _already_ been read or written into/from a buffer specified for async operation."
An opportunity lost?
"However, these questions should rather be addressed to OS gurus."
Maybe one is just lurking (oh boy, Webster'shas started to trackInternet lingo), biding his time to make a dramatic entrance?
My own experience on the subject suggested that trying to do the IO in parallel was fouling my disk locality (loading several thousand multi-megabtye files for processing), and I ended up getting near-maximum throughput with the tbb_pipeline pattern that appeared in an Intel blog post a while back. It amounts to building a pipeline with a serial filter on the front-end (reading data into a buffer) and a parallel filter after that (consuming data), and carefully controlling the number of tokens in-flight.
Now, in my case I was doing this to overlap computation with IO. If you have to wait for all of the IO to complete before you can make progress, this may not help.