Well, I think there might be a reasonable TBB solution, though not as easy to use as you wish.
Let's start with what seems most natural solution for the problem. The main thread should produce some work, some other thread(s) should process the work in serial order, and there are several independent streams of work. So the solution seems simple - each work stream is processed by its own thread, and data are passed through the queue to guarantee in-order processing. It could be either tbb::concurrent_queue or a serial queue plus a lock that protect operations with the queue. We havea classical producer-consumer pattern, with a limited degree of parallelism achieved by multiple independent streams.
What is the problem however? I think it's oversubscription; if the number of work streams exceeds the number of cores, threads processing the streams will compete for CPU. And what is the solution? Yes, create a pool of threads, as many as there are cores.
Each thread within the pool should be able to process any work stream. When a thread becomes idle, it chooses any non-empty workstream, takes an item from its queue,and starts processing. While in-order semantics is kept by using queues as before, serial semantics should be re-ensured. And blocking the whole queue during processing would be bad because the work producer would get blocked as well. A better solution would be to usea binary semaphore that allows only single thread in processing stage for a given work stream. The semaphore should have a non-blocking try-acquire semantics however, to avoid workers get blocked while the job exists in another stream. Combining all pieces together, we have: a pool of worker threads, each worker polls all work streams in round-robin fashion; if it fails to acquire the semaphore for the stream it immediately moves to the next one, otherwise pops a work item (if any) from the queue, processes it, releases the semaphore, and goes to checkthe next stream.
What is the problem in this case? Idle spinning. Try-acquire semantics is required, but then workers should known to stop spinning when there is no job anywhere. There could be different solutions for the problem, but since we are in TBB forum, let me choose tasks :)
Indeed, if there are just as many tasks created as there are work items, no tasks means no work, and worker threads can stop spinning and go sleep until a supplied piece of work wakes them up. The mechanisms for thatalready exist in TBB. Additionally, the task can carry information about which stream has the work, so no need for round robin. So the producer thread puts an item into a work queue (which is still necessary to ensure in-order semantics) and spawns a task to process that item. A worker thread takes the task and executes it. The first thing a task should do is still checking the semaphore however, to preserve serial semantics (because there might be many tasks in flight that process the same work stream). But now what does it do if the semaphore is closed for the moment? Probably this task should be re-spawn to execute later, and another task should be taken, ideally for another work stream.
Here the next issue lies.A re-spawned TBB task is put into the local task pool of the current worker; and the very first thing an idle worker would do is to take a task from its own pool - i.e. the same task that it just postponed for later.
With TBB 2.2, the only solution I see to thislast problem is to step back a bit, and make tasks agnostic of what stream they should process. Instead, each task would poll all work streams until a piece of work is found (and there should be one, because there should be as many work items as tasks). With the next TBB version (for which we have a pre-release in open-source downloads), there is another solution: a task can be enqueue()'d, instead of spawn()'d. The difference is that enqueued tasks are put into a global container, instead of the local pool. And the container preserves some fairness (though has no FIFO guarantees), so older tasks tend to be taken earlier than newly added ones. Exactly what is required!
Remember thatnon-blocking semaphores and work queues are still there, to preserve serial-in-order semantics you need. Tasks don't give you that, as Andrey mentioned. Oh, and I should say that any TBB mutex would work just fine as the non-blocking binary semaphore, because they all have try_acquire method.
What's the next problem? Well, I think it would be idle spinning again, this time due to limited degree of parallelism. Imagine there are idle workers, and there are tasks, but all non-empty work streams are already acquired. Due to the serial semantics, free workers would shuffle available tasks until some previous work is completed and the next item can be taken from that queue. It would be good to have threads sleep in this case too; and that should be possible but I will stop for now :) It's probably worth a blog entry already; might be I will publish it later as such.