tbb::task_scheduler_init init(tbb::task_scheduler_init::default_num_threads() + 1);This leads to temporary oversubscription, but as long as overutilization periods are rare enough the overall performance will be increased.
Actually I believe that you idea about using pipeline is not that hopeless. You could use serial input filter to read packets and parcel boxes. As soon as a box is ready you mark it with the type of fruit in the given parcel and pass it to the second filter which is parallel. The label I've just mentioned could be a pointer to a member function of the second filter that processes the specific type of fruit. The last filter will be again a serial one. If the second filter does not produce anything, it just passes NULL to the last one, which is ignored. What do you think?
Id thought about such an approach, but it looked like the parallel second stage would need to be constrained a bit. In my actual problem there are many different kinds of fruit, potentially hundreds of them, and each of them can be processed independently. So the parallelism is at the level of being able to process a box of apples at the same time as a box of oranges. The processing of each box is largely sequential in nature, or at least at present the small amount of parallelism available is not worth exploiting. I definitely need to process each apple in a box in the same order it originally arrived. The processing step on each fruit is configurable though, so I suppose at some point I could find there is a small amount of parallelism that is exploitable at this level. Each box of apples must be processed in the order the apple boxes were produced, and the same for each of the other fruits. So if I simply classify fruit into boxes in the first stage and then pass them to the second filter then I may end up processing multiple boxes of apples in parallel. I cant simply set the number of tokens to be the number of different kinds of fruit, as that still doesnt prevent the problem. But if it sounds like the pipeline approach is the best way to go in TBB then presumably Ill need to figure out how to implement something like the token mechanism, but at a finer level to ensure that if I pass an apple box into the second filter I cant pass in another apple box until the first one has come out the other end. If the number of fruit types becomes too large I can always aggregate them, e.g. introducing an AppleOrOrange hybrid, but at the cost of constraining the potential parallelism.
The only problem with the this approach may be caused by the fact that the input filter can block waiting for the next packet. Since it is executed by a thread from the TBB pool, it will temporarily exclude this thread from doing useful work (e.g helping with processing already parcelled boxes). In other words the system will be undersubscribed.
Yes, that was a concern, but oversubscribing when the number of cores is small sounds like a good plan. Although my current machine has 2*4 cores, I was worried about what would happen with a dual core machine with one task blocking frequently. Ill try out your suggestion.
Is one box processing provides enough work to be worth parallelizing (and is it possible to parallelize it at all)?
Probably not. The purpose of introducing fruit and the boxing mechanism, was to separate out useful work that can be done in parallel, so two boxes of the same fruit must be processed in the order they arrived (and fruit within a box). And the box size is chosen so that the amount of computation on each box is large enough that the cost of interacting with the rest of the world to get the next box is negligible.
How many boxes per second (approximately of course) are generated in a typical case?
Tens or hundred of thousands of fruit can arrive per second. The boxes per second rate can vary as it depends on how big each box is. For my non-TBB example, I had boxes with 100 fruit in each one. Less than this and the synchronization time required to get each box started to become noticeable compared to the time required to process each box. But the time to process each fruit may vary, depending on what I need to do in each scenario Im considering. So Im expecting to have to adjust the box size for different scenarios.
Do you need to ensure that the results of boxes processing arrive at the final stage in the same order as boxes were parcelled?
Not for different fruit. But for the same fruit the order of the output should match the corresponding input order. This presumably makes it difficult to use a serial filter, as that would impose an unnecessarily strict order on the output. Well, Im not entirely sure how a serial filter connected downstream of a parallel one actually works; from the descriptions in the documents it sounds like it does more than just process the items one at a time in the order they were generated from the parallel filter. But Im guessing a serial filter isnt quite right if it does more than process the outputs from the parallel filter in the order they are produced. Thats why originally Id been thinking of a dispatch mechanism to a collection of fruit-specific pipelines, as it seemed easier to visualize the various ordering constraints.
Thanks for the great response Andrey . It's given me alotto think about.I think I need to spend a bit of time digesting all the details.I'll let you know how I get on.
> Each of these criteria can be also combined (in a whichever-happens-earlier way) with the processing stage signals that it's done
So I grasp the general strategy you have outlined, but there's one aspect that is still puzzling me. We start with the input filter's operator() method being called, which continues reading packets off a link, and building boxes of fruit, until the vector satisfies some "size" criteria. This vector is then returned as the result of the method, and the processing stage can then start processing the fruit, hopefully doing a lot of this work in parallel. Given that the number of tokens is three then the input filter's operator() method will immediately be called again, from another task, to start accumulating the next batch of fruit. But what you are suggesting, which makes sense intuitively, is that when the processing stage finishes then the input filter should consider finishing the production of the vector of fruit boxes prematurely, even if it isn't big enough to satisfy our criteria, to avoid the processing stage being stalled. But how does the code inside the input filter's operator() method know that the processing stage has finished? Or have I misunderstood your suggestion? As with all this stuff, I can see how I could implement all this manually, but is there something in TBB that assists here? Maybe I've still not fully grasped how TBB's version of pipelining works.
That sounds like it might work well. Initially the system would construct fairly small vectors of fruit boxes, and the processing filter would quickly process these. So there wouldn't be much scope for parallelism, but we wouldn't need it either. But if some of the processing became more intensive, because different fruit may require different amounts of work, then the vectors would have time to grow in size, and so the amount of potential for parallel processing within the processing stage would also grow. So with a bit of luck the system might be quite adaptive, and there's probably no need for a lower bound on vector size. In this approach I may not even need the concept of boxes of fruit, I can just store lists of individual fruit in the vector. Originally I'd introducedboxes to make sure the time associated with the communication overheads of passing fruit from one stage to the next didn't form a noticeable percentage of the overall processing time. But it sounds like the pipeline overheads in TBB are appreciably smaller than in a traditional thread-based pipeline. And even if I did need some kind of minimum size to justify the overheads of passing a vector to the processing filter, a simple constraint on the overall number of fruit, rather than fruit boxes, would now seem sufficient.
For generality there's possibly still a need for an upper bound on the vector size for the case where the data is being read from a disk, rather than a network, to prevent the input filter getting carried away and constructing vector elements at a faster rate than the processing filter can handle them.
Thanks for all your help.