Re: TBB for IO pattern

robert_jay_gould · ‎08-18-2009

Although I'm aware that TBB and IO don't mix well, I was wondering if this pattern would work:

Start a Scheduler with say 20(lots) threads

Place lots of short IO tasks (of similar length) on the scheduler

let it run.

My understanding is that this would give me say 20 parallel IO tasks, that would get the IO work done faster than a serial implementation (maybe not as fast as a custom implementation but much cheaper to write).

Of course at the same time (startup) the scheduler isn't actually required to do any heavy weight computations, just IO. And not until the IO is done, does the serious computing start.

Anything wrong with this?

Anton_Pegushin · ‎08-19-2009

Quoting - robert.jay.gould

Although I'm aware that TBB and IO don't mix well, I was wondering if this pattern would work:

Start a Scheduler with say 20(lots) threads

Place lots of short IO tasks (of similar length) on the scheduler

let it run.

My understanding is that this would give me say 20 parallel IO tasks, that would get the IO work done faster than a serial implementation (maybe not as fast as a custom implementation but much cheaper to write).

Of course at the same time (startup) the scheduler isn't actually required to do any heavy weight computations, just IO. And not until the IO is done, does the serious computing start.

Anything wrong with this?

Hi,

I was wondering why is creating and spawning 20 IO tasksseems cheaper to write than creating 20 tbb_threads each running it's own short IO function? However with the scenario you're describing I'm only seeing one problem at the first glance. And it's that (potentially) TBB could already be initializedwith a pool of default size, so your creating a task_scheduler_init object and passing "20" into the constructor will not change a thing.

Also, you're suggesting to start computations right after you've completed the IO, so you'll have to be able to destroy the last task_scheduler_init object, so that the 20-threads-pool gets destroyed and then initialize the default-size pool to run computations without oversubscription.

By the way, have you tried mixing TBB and Boost Asio? I've read in the forum somewhere that someone tried it and succeeded...

robert_jay_gould · ‎08-19-2009

Quoting - Anton Pegushin (Intel)

Quoting - robert.jay.gould

Although I'm aware that TBB and IO don't mix well, I was wondering if this pattern would work:

Start a Scheduler with say 20(lots) threads

Place lots of short IO tasks (of similar length) on the scheduler

let it run.

My understanding is that this would give me say 20 parallel IO tasks, that would get the IO work done faster than a serial implementation (maybe not as fast as a custom implementation but much cheaper to write).

Of course at the same time (startup) the scheduler isn't actually required to do any heavy weight computations, just IO. And not until the IO is done, does the serious computing start.

Anything wrong with this?

Hi,

I was wondering why is creating and spawning 20 IO tasksseems cheaper to write than creating 20 tbb_threads each running it's own short IO function? However with the scenario you're describing I'm only seeing one problem at the first glance. And it's that (potentially) TBB could already be initializedwith a pool of default size, so your creating a task_scheduler_init object and passing "20" into the constructor will not change a thing.

Also, you're suggesting to start computations right after you've completed the IO, so you'll have to be able to destroy the last task_scheduler_init object, so that the 20-threads-pool gets destroyed and then initialize the default-size pool to run computations without oversubscription.

By the way, have you tried mixing TBB and Boost Asio? I've read in the forum somewhere that someone tried it and succeeded...

Thanks Anton,

Indeed I've used TBB and Boost Asio together before (actually I think I was the one that first mentioned it here in the forums :)

Anyways in this particular application I can use TBB, but not Boost (strange policies and stuff, but it'll take way longer to change the policies, and probably can't be changed anyways). So I'm left without Asio in the equation :/

Also what I was thinking was to have say 100 tasks and 20 threads in the scheduler, so it could sort of balance itself out a bit.

But, as you mention, I had forgotten that someone else might have started the task scheduler (like myself 6 months later), so you are right making a custom a 20 tbb_thread pool might be a better idea. I was kind of hoping to use the task_scheduler as a thread pool for this situation, but it'll probably bite me in the back down the road... so custom tbb_thread pool it is.

Mmm, maybe adding a tbb_thread_pool to tbb might be a nice feature.

Anton_M_Intel · ‎08-19-2009

Quoting - robert.jay.gould

Indeed I've used TBB and Boost Asio together before (actually I think I was the one that first mentioned it here in the forums :)

Anyways in this particular application I can use TBB, but not Boost (strange policies and stuff, but it'll take way longer to change the policies, and probably can't be changed anyways). So I'm left without Asio in the equation :/

If you can't use Boost, can you use asynchronous, non-blocking I/O? It would work with TBB even better than Asio since Asio has a lot of critical sections inside to use serial containers.

RafSchietekat · ‎08-20-2009

What are the performance differences between select/poll and alternatives? I presume that blocking individual threads aren't a very good choice in comparison (for performance, anyway), but I have not yet explored asynchronous APIs or any other alternatives.

Anton_M_Intel · ‎08-20-2009

Quoting - Raf Schietekat

What are the performance differences between select/poll and alternatives? I presume that blocking individual threads aren't a very good choice in comparison (for performance, anyway), but I have not yet explored asynchronous APIs or any other alternatives.

Asio prefers epoll (on Linux) and kqueue (on BSD/Mac) to select and eventfd to pipe. It's the only thing I know about their performance. The idea is to isolate the blocking to only one call in only one thread. Unfortunately, TBB still has known issues even for this approach (spin-idling of master thread). However, it can be worked around and I hope we will fix it eventually.

RafSchietekat · ‎08-20-2009

Then one thing that hasn't been mentioned elsewhere already would probably be to make sure that the data is read directly into the cache that will ultimately process it, instead of first beingassembled into complete messages by the central tbb_thread that would block to monitor all input? Just curious: how is the data treated behind the scene, does it remain inside the network interface until read() time, or would it already have been copied to the thread that performs the poll() or equivalent, or even to another one? Anyway, no use making a redundant copy that also crosses caches once or perhaps even twice, right? Does that make sense?

Anton_M_Intel · ‎08-20-2009

Quoting - Raf Schietekat

Then one thing that hasn't been mentioned elsewhere already would probably be to make sure that the data is read directly into the cache that will ultimately process it, instead of first beingassembled into complete messages by the central tbb_thread that would block to monitor all input?

There is not much we can do on the user side. select() and co notifies whether the data may be read or written (or something else occured). So, it'd safe to call read() or write() without blocking. But it means that we may recieve or send only a part of the message (or file) and must compose it in a buffer on the user side.
AFAIK, Boost::Asio and Windows overlapped API work differently. They notify when the data has _already_ been read or written into/from a buffer specified for async operation.

Quoting - Raf Schietekat

Just curious: how is the data treated behind the scene, does it remain inside the network interface until read() time, or would it already have been copied to the thread that performs the poll() or equivalent, or even to another one? Anyway, no use making a redundant copy that also crosses caches once or perhaps even twice, right? Does that make sense?

I doubt the data is copied anywhere due to poll operation. However, these questions should rather be addressed to OS gurus. Just theoretically, a driver may pass the data directly into a user buffer specified for the overlapped API call.. who knows. Does mmap() work for sockets? :)

RafSchietekat · ‎08-20-2009

"But it means that we may recieve or send only a part of the message (or file) and must compose it in a buffer on the user side."
Yes, I was wondering how big the fragments would need to be for it to make sense to have them read and assembled by affinity-based tasks spawned by the central tbb_thread, instead of letting the central tbb_thread assemble them(easier to code, less task overhead) and then provide pre-assembled messages to single tasks (having to deal with nonlocal data from a nonlocal allocation). I'm less sure about the advantage of calling read() in parallel, or about how that effect can even be isolated, but maybe it would play a role as well.

"AFAIK, Boost::Asio and Windows overlapped API work differently. They notify when the data has _already_ been read or written into/from a buffer specified for async operation."
An opportunity lost?

"However, these questions should rather be addressed to OS gurus."
Maybe one is just lurking (oh boy, Webster'shas started to trackInternet lingo), biding his time to make a dramatic entrance?

Anton_M_Intel · ‎08-20-2009

Quoting - Raf Schietekat

Yes, I was wondering how big the fragments would need to be for it to make sense to have them read and assembled by affinity-based tasks spawned by the central tbb_thread, instead of letting the central tbb_thread assemble them(easier to code, less task overhead) and then provide pre-assembled messages to single tasks (having to deal with nonlocal data from a nonlocal allocation). I'm less sure about the advantage of calling read() in parallel, or about how that effect can even be isolated, but maybe it would play a role as well.

Asio uses the second approach on Linux. But it is rather due to cross-platform design. I would be also interested to compare these two approaches: calling read() in the one dispatching thread (btw, not necessary tbb_thread, may be master thread) just after polling vs parallel read() executions. And I think this parallel processing of events should be performed by [affinitized] parallel_for on a set of events, so it could combine several OS calls into one TBB task.

robert_jay_gould · ‎08-20-2009

Quoting - Anton Malakhov (Intel)

Quoting - Raf Schietekat

What are the performance differences between select/poll and alternatives? I presume that blocking individual threads aren't a very good choice in comparison (for performance, anyway), but I have not yet explored asynchronous APIs or any other alternatives.

Asio prefers epoll (on Linux) and kqueue (on BSD/Mac) to select and eventfd to pipe. It's the only thing I know about their performance. The idea is to isolate the blocking to only one call in only one thread. Unfortunately, TBB still has known issues even for this approach (spin-idling of master thread). However, it can be worked around and I hope we will fix it eventually.

Depending on the use case this can be a deal breaker or not. In my case all the IO is at init, so it's not an issue. But for something like a file server, this issue needs to be worked around. One possibility is to spawn continuation tasks, but that is only partial solution as luck becomes involved, nevertheless assuming most file servers don't need to be crunching numbers they likely mix together ok as is.

robert_jay_gould · ‎08-20-2009

Quoting - Anton Malakhov (Intel)

There is not much we can do on the user side. select() and co notifies whether the data may be read or written (or something else occured). So, it'd safe to call read() or write() without blocking. But it means that we may recieve or send only a part of the message (or file) and must compose it in a buffer on the user side.
AFAIK, Boost::Asio and Windows overlapped API work differently. They notify when the data has _already_ been read or written into/from a buffer specified for async operation.

Quoting - Raf Schietekat

Just curious: how is the data treated behind the scene, does it remain inside the network interface until read() time, or would it already have been copied to the thread that performs the poll() or equivalent, or even to another one? Anyway, no use making a redundant copy that also crosses caches once or perhaps even twice, right? Does that make sense?

I doubt the data is copied anywhere due to poll operation. However, these questions should rather be addressed to OS gurus. Just theoretically, a driver may pass the data directly into a user buffer specified for the overlapped API call.. who knows. Does mmap() work for sockets? :)

select,poll&brethen can be used to spawn tasks that get thrown on the scheduler, basically using them to generate new tasks, instead of creating tasks to wait on them. So you turn the system into an event driven architecture instead of a polling architecture. IMHO this is generally a good thing. But I've no idea how cache affinity will affect this, not even sure how to measure the effect.

Charles_Tucker · ‎09-03-2009

Quoting - robert.jay.gould

Although I'm aware that TBB and IO don't mix well, I was wondering if this pattern would work:

Start a Scheduler with say 20(lots) threads

Place lots of short IO tasks (of similar length) on the scheduler

let it run.

My understanding is that this would give me say 20 parallel IO tasks, that would get the IO work done faster than a serial implementation (maybe not as fast as a custom implementation but much cheaper to write).

Of course at the same time (startup) the scheduler isn't actually required to do any heavy weight computations, just IO. And not until the IO is done, does the serious computing start.

Anything wrong with this?

My own experience on the subject suggested that trying to do the IO in parallel was fouling my disk locality (loading several thousand multi-megabtye files for processing), and I ended up getting near-maximum throughput with the tbb_pipeline pattern that appeared in an Intel blog post a while back. It amounts to building a pipeline with a serial filter on the front-end (reading data into a buffer) and a parallel filter after that (consuming data), and carefully controlling the number of tokens in-flight.

Now, in my case I was doing this to overlap computation with IO. If you have to wait for all of the IO to complete before you can make progress, this may not help.